Dealing with GenBank files in Biopython
This page has recently been updated to mention using the SeqFeature object's extract method, added in Biopython 1.53. Please let me know using the contact link at the bottom of the page if you find any mistakes. Thanks!
This page demonstrates how to use Biopython's GenBank (via the Bio.SeqIO module available in Biopython 1.43 onwards) to interrogate a GenBank data file with the python programming language. The nucleotide sequence for a specific protein feature is extracted from the full genome DNA sequence, and then translated into amino acids. This is then verified against the stated translation.
Depending on the type of GenBank file(s) you are interested in, they will either contain a single record, or multiple records. You can easily determine this by looking at the raw file - each record will start with a LOCUS line, followed by various other header lines, usually a list of features, the sequence data, and ends with a // line (slash slash).
An example GenBank file
For this demonstration I'm going to use a small bacterial genome, Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GI:38349555, GenBank AE017199) which can be downloaded from the NCBI here: NC_005213.gbk (only 1.15 MB).
There is a single record in this file, and it starts as follows:
LOCUS NC_005213 490885 bp DNA circular BCT 02-DEC-2005 DEFINITION Nanoarchaeum equitans Kin4-M, complete genome. ACCESSION NC_005213 VERSION NC_005213.1 GI:38349555 KEYWORDS . SOURCE Nanoarchaeum equitans Kin4-M ORGANISM Nanoarchaeum equitans Kin4-M Archaea; Nanoarchaeota; Nanoarchaeum. REFERENCE 1 (bases 1 to 490885) AUTHORS Waters,E., Hohn,M.J., Ahel,I., Graham,D.E., Adams,M.D., Barnstead,M., Beeson,K.Y., Bibbs,L., Bolanos,R., Keller,M., Kretz,K., Lin,X., Mathur,E., Ni,J., Podar,M., Richardson,T.H., Sutton,G.S., Simon,M., Soll,D., Stetter,K.O., Short,J.M. and Noorderwier,M. TITLE The genome of Nanoarchaeum equitans: insights into early archaeal evolution and derived parasitism JOURNAL Proc. Natl. Acad. Sci. U.S.A. 100 (22), 12984-12988 (2003) PUBMED 14566062 ...
Loading the GenBank file
The following code uses Bio.SeqIO to get SeqRecord objects for each entry in the GenBank file. In this case, there is actually only one record:
from Bio import SeqIO gb_file = "NC_005213.gbk" for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank") : # now do something with the record print "Name %s, %i features" % (gb_record.name, len(gb_record.features)) print repr(gb_record.seq)
This gives the following output:
Name NC_005213, 1183 features Seq('TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGA...TTA', IUPACAmbiguousDNA())
That example above uses a for loop and would cope with a GenBank file containing a multiple records. There is related example on my page about converting GenBank to FASTA.
If you are expecting one and only one record, since Biopython 1.44 you can do this:
from Bio import SeqIO gb_file = "NC_005213.gbk" gb_record = SeqIO.read(open(gb_file,"r"), "genbank") print "Name %s, %i features" % (gb_record.name, len(gb_record.features)) print repr(gb_record.seq)
GenBank Features
From our GenBank file we got a single SeqRecord object which we stored as the variable gb_record, and so far we have just printed its name and the number of features:
print "Name %s, %i features" % (gb_record.name, len(gb_record.features)) print gb_record.seq
The GenBank record's features property is a list of SeqFeature objects, each created from a feature in the original GenBank file. For example, look at the CDS entry for hypothetical protein NEQ010:
CDS complement(7398..8423) /locus_tag="NEQ010" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_963304.1" /db_xref="GI:41614806" /db_xref="GeneID:2654552" /translation="MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEK LNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYLKIYENYNYKFKGIIRFDMINFKLI EINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLFVDE YYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTI MGNKANLAFLDKEVDWALDIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEF IEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFGNILLRFSKDHILNVAKGGG VGFYTIV"
This is the twenty-seventh entry in the features list (one based counting), and so its element 26 in the list (zero based counting). How did I know this? Well, trial and error or by indexing the features. But anyway:
>>> gb_feature = gb_record.features[26] >>> print gb_feature type: CDS location: (7397..8423) ref: None:None strand: -1 qualifiers: Key: codon_start, Value: ['1'] Key: db_xref, Value: ['GI:41614806', 'GeneID:2654552'] Key: locus_tag, Value: ['NEQ010'] Key: product, Value: ['hypothetical protein'] Key: protein_id, Value: ['NP_963304.1'] Key: transl_table, Value: ['11'] Key: translation, Value: ['MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEK LNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYLKIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFS ELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLFVDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKL DYWLEKQNINDILFTIMGNKANLAFLDKEVDWALDIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPK REISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFGNILLRFSKDHILNVAKGGGVGFYTIV']
As you can see, this entry is for a CDS feature (use .type), and its location is given as complement(7398..8423) in the GenBank file (one based counting). BioPython uses the notation of a +1 and -1 strand for the forward and reverse/complement strands (use .strand), while this location (use .location) is held as 7397 to 8423 (zero based counting) to make it easy to use sequence splicing.
i.e. To obtain the DNA sequence corresponding to complement(7398..8423) in the GenBank file:
>>> gb_record.seq[7397:8423].reverse_complement() Seq('ATGGAAAACCGAGAAGGAATTATAGAGTTTGTAAAAGAAGCATATCCAATAACT...TGA', IUPACAmbiguousDNA())
In this example the location is simple and exact - but Biopython can cope with fuzzy locations. Typically in this case you just want to get integer positions back for where to slice:
>>> start = gb_feature.location.nofuzzy_start >>> end = gb_feature.location.nofuzzy_end >>> gb_record.seq[start:end].reverse_complement() Seq('ATGGAAAACCGAGAAGGAATTATAGAGTTTGTAAAAGAAGCATATCCAATAACT...TGA', IUPACAmbiguousDNA())
This is still rather tricky, and it gets worse for complex situations like joins. Biopython 1.53 makes this much easier:
>>> gb_feature.extract(gb_record.seq) Seq('ATGGAAAACCGAGAAGGAATTATAGAGTTTGTAAAAGAAGCATATCCAATAACT...TGA', IUPACAmbiguousDNA())
Checking GenBank feature translations
Having got our nucleotide sequence, Biopython will happily translate this for you (so you can check it agrees with the stated translation in the GenBank file). The GenBank file even tells us which translation table to use (the standard bacterial table, 11). Refer to the tutorial for more details.
>>> gb_record.seq[7397:8423].reverse_complement().translate(table=11) Seq('MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRK NLYLKIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPD NYLFVDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEV DWALDIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKK LRFGNILLRFSKDHILNVAKGGGVGFYTIV*', HasStopCodon(ExtendedIUPACProtein(), '*'))
Notice that the translate method will translate the included stop codon(s). We can also use the optional to_stop argument to avoid this.
>>> t = gb_record.seq[7397:8423].reverse_complement().translate(table=11, to_stop=True) >>> str(t) 'MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYL KIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLF VDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEVDWAL DIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFG NILLRFSKDHILNVAKGGGVGFYTIV'
If you have Biopython 1.51 or later, you can translate this as a CDS - this means Biopython will check there is a valid start codon which will be translated at methionine, and check there is a string valid stop codon:
>>> t = gb_record.seq[7397:8423].reverse_complement().translate(table=11, cds=True) >>> str(t) 'MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYL KIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLF VDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEVDWAL DIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFG NILLRFSKDHILNVAKGGGVGFYTIV'
The short version using Biopython 1.53 or later would be just:
>>> t = gb_feature.extract(gb_record.seq).translate(table=11, cds=True) >>> str(t) 'MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYL KIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLF VDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEVDWAL DIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFG NILLRFSKDHILNVAKGGGVGFYTIV'
In case you are wondering, yes, this is identical to the translation for the protein given in the GenBank file - note that the qualifiers dictionary returns a list of entries, and in the case of the translation there should be one and only one entry (entry zero):
>>> gb_feature.qualifiers['translation'][0] == str(t) True
Indexing GenBank features
Did you notice the slight of hand above, where I just declared that the CDS entry for locus tag NEQ010 was gb_record.features[26]?
In general, how can we find a particular entry from a unique identifier like the locus tag?
One way is to scan through all the features, and build up a mapping (stored as a python dictionary) from (say) the locus tag to the feature index. This is illustrated in the following function:
def index_genbank_features(gb_record, feature_type, qualifier) : answer = dict() for (index, feature) in enumerate(gb_record.features) : if feature.type==feature_type : if qualifier in feature.qualifiers : #There should only be one locus_tag per feature, but there #are usually several db_xref entries for value in feature.qualifiers[qualifier] : if value in answer : print "WARNING - Duplicate key %s for %s features %i and %i" \ % (value, feature_type, answer[value], index) else : answer[value] = index return answer
How does this work then? We'll show this by looking for the features list entry for the CDS feature with locus_tag of NEQ010:
>>> locus_tag_cds_index = index_genbank_features(gb_record,"CDS","locus_tag") >>> print locus_tag_index["NEQ010"] 26
This doesn't just work for the locus tag, using the db_xref (database cross-reference) we can index the features allowing us to search them using GI numbers or GeneID:
>>> db_xref_cds_index = index_genbank_features(gb_record,"CDS","db_xref") >>> print db_xref_cds_index["GI:41614806"] 26 >>> print db_xref_cds_index["GeneID:2654552"] 26
It would also make sense to index by protein_id.