Converting GenBank files to FASTA format with Biopython
This page follows on from dealing with GenBank files in BioPython and shows how to use the GenBank parser to convert a GenBank file into a FASTA format file. See also this example of dealing with Fasta Nucelotide files.
As before, I'm going to use a small bacterial genome, Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GI:38349555, GenBank AE017199) which can be downloaded from the NCBI here:
- NC_005213.gbk (1190 KB) - GenBank file
- NC_005213.fna (487 KB) - FASTA Nucleic Acids - entire DNA nucleotide sequence as one record, see gbk -> fna
- NC_005213.faa (198 KB) - FASTA Amino Acids - amino acid sequences for each gene, see gbk -> faa
- NC_005213.ffn (487 KB) - FASTA Feature Nucleotides - nucleotide sequences for each gene, see gbk -> ffn
As you can see, the NCBI provide this file in GenBank format, and preconverted into assorted FASTA formats. This document shows how to do this for yourself, which has the advantage that you can choose what to put into the FASTA title lines.
Note that the NCBI's genomic GenBank files (like the one above) only contain a single large record. However, for maximum flexibility, the example code should cope with multi-record GenBank files.
Record by Record : GenBank to FASTA Nucleotides (*.gbk to *.fna)
Simple sequence file format between supported file formats is very easy using Bio.SeqIO - assuming you are happy with its default choices! This bit of code will record the full DNA nucleotide sequence for each record in the GenBank file as a fasta record:
from Bio import SeqIO
SeqIO.convert("NC_005213.gbk", "genbank", "NC_005213_converted.fna", "fasta")
For comparison, in this next version (gbk_to_fna.py ) we construct the FASTA file "by hand" giving full control:
from Bio import SeqIO gbk_filename = "NC_005213.gbk" faa_filename = "NC_005213_converted.fna" input_handle = open(gbk_filename, "r") output_handle = open(faa_filename, "w") for seq_record in SeqIO.parse(input_handle, "genbank") : print "Dealing with GenBank record %s" % seq_record.id output_handle.write(">%s %s\n%s\n" % ( seq_record.id, seq_record.description, seq_record.seq)) output_handle.close() input_handle.close()
These examples should all give basically the same output, with a single FASTA record for each GenBank record. In this case NC_005213.gbk only has one very large record, so you only get one FASTA record. Note that I have added the line breaks shown here:
>NC_005213 Nanoarchaeum equitans Kin4-M, complete genome. TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGAACCCAAACCGCAATCGTACAAAAATT TGTAAAATTCTCTTTCTTCTTTGTCTAATTTTCTATAAACATTTAACTCTTTCCATAATGTGCCTATATATACTGCTTCC CCTCTGTTAATTCTTATTCTTATTGATACTGTTTTATAGAAAGTAAAACCTTCGAATATTGCTTCTTCAAAATAAAAGTT ...
Gene by Gene : GenBank to FASTA Amino Acids (*.gbk to *.faa)
As before, the suggested input file only has a single GenBank record (as a SeqRecord object), but the code below will cope with multiple GenBank records - each with multiple features (as SeqFeature objects). You can download this example too: gbk_to_faa.py .
from Bio import SeqIO gbk_filename = "NC_005213.gbk" faa_filename = "NC_005213_converted.faa" input_handle = open(gbk_filename, "r") output_handle = open(faa_filename, "w") for seq_record in SeqIO.parse(input_handle, "genbank") : print "Dealing with GenBank record %s" % seq_record.id for seq_feature in seq_record.features : if seq_feature.type=="CDS" : assert len(seq_feature.qualifiers['translation'])==1 output_handle.write(">%s from %s\n%s\n" % ( seq_feature.qualifiers['locus_tag'][0], seq_record.name, seq_feature.qualifiers['translation'][0])) output_handle.close() input_handle.close()
This results in one FASTA record for each CDS feature in each GenBank record. I have added line breaks to the sequence below to display this better:
>NEQ001 from NC_005213 MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLH NKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRG EAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR >NEQ003 from NC_005213 MKKPQPYKDEEIYSILEEPVKQWFKEKYKTFTPPQRYAIMEIHKRNNVLISSPTGSGKTLAAFLAIINELIKLSHKGKLE NRVYAIYVSPLRSLNNDVKKNLETPLKEIKEKAKELNYYIGDIRIAVRTSDTKESEKAKMLKQPPHILITTPESLAIILS ... >NEQ550 from NC_005213 MLELLAGFKQSILYVLAQFKKPEYATSYTIKLVNPFYYISDSLNVITSTKEDKVNYKVSLSDIAFDFPFKFPIVAIVEGK ANREFTFIIDRQNKKLSYDLKKGIIYIQDATIIPNGIKITVNGLAELKNIKINPNDPSITVQKVVGEQNTYIIKTSKDSV KITISADFVVKAEKWLFIQ
Its easy enough to adapt the code if you wanted to output the protein names for example, or GI numbers.
Note that not all GenBank CDS features include the translation. In a related example, for viruses, the mat_peptide (mature peptide) features don't include the protein sequence either. In such cases, you can first extract the nucleotide sequence (see below) and then translate it to get the amino acids.
Gene by Gene : GenBank to FASTA Nucleotides (*.gbk to *.ffn)
I've saved this one till last, because it was the hardest. However, as described in the preceding document, Biopython 1.53 adds a new extract method to the SeqFeature object. This means you don't have to deal with anything complicated like "joins" or "fuzzy" positions explicitly.
I still need to find the time to sit down and expand this example to use the new method...