Dealing with GenBank files in Biopython

This page has recently been updated to mention using the SeqFeature object's extract method, added in Biopython 1.53. Please let me know using the contact link at the bottom of the page if you find any mistakes. Thanks!

This page demonstrates how to use Biopython's GenBank (via the Bio.SeqIO module available in Biopython 1.43 onwards) to interrogate a GenBank data file with the python programming language. The nucleotide sequence for a specific protein feature is extracted from the full genome DNA sequence, and then translated into amino acids. This is then verified against the stated translation.

Depending on the type of GenBank file(s) you are interested in, they will either contain a single record, or multiple records. You can easily determine this by looking at the raw file - each record will start with a LOCUS line, followed by various other header lines, usually a list of features, the sequence data, and ends with a // line (slash slash).

An example GenBank file

For this demonstration I'm going to use a small bacterial genome, Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GI:38349555, GenBank AE017199) which can be downloaded from the NCBI here: NC_005213.gbk (only 1.15 MB).

There is a single record in this file, and it starts as follows:

LOCUS       NC_005213             490885 bp    DNA     circular BCT 02-DEC-2005
DEFINITION  Nanoarchaeum equitans Kin4-M, complete genome.
ACCESSION   NC_005213
VERSION     NC_005213.1  GI:38349555
KEYWORDS    .
SOURCE      Nanoarchaeum equitans Kin4-M
  ORGANISM  Nanoarchaeum equitans Kin4-M
            Archaea; Nanoarchaeota; Nanoarchaeum.
REFERENCE   1  (bases 1 to 490885)
  AUTHORS   Waters,E., Hohn,M.J., Ahel,I., Graham,D.E., Adams,M.D.,
            Barnstead,M., Beeson,K.Y., Bibbs,L., Bolanos,R., Keller,M.,
            Kretz,K., Lin,X., Mathur,E., Ni,J., Podar,M., Richardson,T.H.,
            Sutton,G.S., Simon,M., Soll,D., Stetter,K.O., Short,J.M. and
            Noorderwier,M.
  TITLE     The genome of Nanoarchaeum equitans: insights into early archaeal
            evolution and derived parasitism
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 100 (22), 12984-12988 (2003)
   PUBMED   14566062
...

Loading the GenBank file

The following code uses Bio.SeqIO to get SeqRecord objects for each entry in the GenBank file. In this case, there is actually only one record:

from Bio import SeqIO
gb_file = "NC_005213.gbk"
for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank") :
    # now do something with the record
    print "Name %s, %i features" % (gb_record.name, len(gb_record.features))
    print repr(gb_record.seq)

This gives the following output:

Name NC_005213, 1183 features
Seq('TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGA...TTA', IUPACAmbiguousDNA())

That example above uses a for loop and would cope with a GenBank file containing a multiple records. There is related example on my page about converting GenBank to FASTA.

If you are expecting one and only one record, since Biopython 1.44 you can do this:

from Bio import SeqIO
gb_file = "NC_005213.gbk"
gb_record = SeqIO.read(open(gb_file,"r"), "genbank")
print "Name %s, %i features" % (gb_record.name, len(gb_record.features))
print repr(gb_record.seq)

GenBank Features

From our GenBank file we got a single SeqRecord object which we stored as the variable gb_record, and so far we have just printed its name and the number of features:

print "Name %s, %i features" % (gb_record.name, len(gb_record.features))
print gb_record.seq

The GenBank record's features property is a list of SeqFeature objects, each created from a feature in the original GenBank file. For example, look at the CDS entry for hypothetical protein NEQ010:

     CDS             complement(7398..8423)
                     /locus_tag="NEQ010"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_963304.1"
                     /db_xref="GI:41614806"
                     /db_xref="GeneID:2654552"
                     /translation="MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEK
                     LNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYLKIYENYNYKFKGIIRFDMINFKLI
                     EINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLFVDE
                     YYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTI
                     MGNKANLAFLDKEVDWALDIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEF
                     IEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFGNILLRFSKDHILNVAKGGG
                     VGFYTIV"

This is the twenty-seventh entry in the features list (one based counting), and so its element 26 in the list (zero based counting). How did I know this? Well, trial and error or by indexing the features. But anyway:

>>> gb_feature = gb_record.features[26]
>>> print gb_feature
type: CDS
location: (7397..8423)
ref: None:None
strand: -1
qualifiers: 
	Key: codon_start, Value: ['1']
	Key: db_xref, Value: ['GI:41614806', 'GeneID:2654552']
	Key: locus_tag, Value: ['NEQ010']
	Key: product, Value: ['hypothetical protein']
	Key: protein_id, Value: ['NP_963304.1']
	Key: transl_table, Value: ['11']
	Key: translation, Value: ['MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEK
LNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYLKIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFS
ELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLFVDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKL
DYWLEKQNINDILFTIMGNKANLAFLDKEVDWALDIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPK
REISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFGNILLRFSKDHILNVAKGGGVGFYTIV']

As you can see, this entry is for a CDS feature (use .type), and its location is given as complement(7398..8423) in the GenBank file (one based counting). BioPython uses the notation of a +1 and -1 strand for the forward and reverse/complement strands (use .strand), while this location (use .location) is held as 7397 to 8423 (zero based counting) to make it easy to use sequence splicing.

i.e. To obtain the DNA sequence corresponding to complement(7398..8423) in the GenBank file:

>>> gb_record.seq[7397:8423].reverse_complement()
Seq('ATGGAAAACCGAGAAGGAATTATAGAGTTTGTAAAAGAAGCATATCCAATAACT...TGA', IUPACAmbiguousDNA())

In this example the location is simple and exact - but Biopython can cope with fuzzy locations. Typically in this case you just want to get integer positions back for where to slice:

>>> start = gb_feature.location.nofuzzy_start
>>> end = gb_feature.location.nofuzzy_end
>>> gb_record.seq[start:end].reverse_complement()
Seq('ATGGAAAACCGAGAAGGAATTATAGAGTTTGTAAAAGAAGCATATCCAATAACT...TGA', IUPACAmbiguousDNA())

This is still rather tricky, and it gets worse for complex situations like joins. Biopython 1.53 makes this much easier:

>>> gb_feature.extract(gb_record.seq)
Seq('ATGGAAAACCGAGAAGGAATTATAGAGTTTGTAAAAGAAGCATATCCAATAACT...TGA', IUPACAmbiguousDNA())

Checking GenBank feature translations

Having got our nucleotide sequence, Biopython will happily translate this for you (so you can check it agrees with the stated translation in the GenBank file). The GenBank file even tells us which translation table to use (the standard bacterial table, 11). Refer to the tutorial for more details.

>>> gb_record.seq[7397:8423].reverse_complement().translate(table=11)
Seq('MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRK
NLYLKIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPD
NYLFVDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEV
DWALDIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKK
LRFGNILLRFSKDHILNVAKGGGVGFYTIV*', HasStopCodon(ExtendedIUPACProtein(), '*'))

Notice that the translate method will translate the included stop codon(s). We can also use the optional to_stop argument to avoid this.

>>> t = gb_record.seq[7397:8423].reverse_complement().translate(table=11, to_stop=True)
>>> str(t)
'MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYL
KIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLF
VDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEVDWAL
DIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFG
NILLRFSKDHILNVAKGGGVGFYTIV'

If you have Biopython 1.51 or later, you can translate this as a CDS - this means Biopython will check there is a valid start codon which will be translated at methionine, and check there is a string valid stop codon:

>>> t = gb_record.seq[7397:8423].reverse_complement().translate(table=11, cds=True)
>>> str(t)
'MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYL
KIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLF
VDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEVDWAL
DIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFG
NILLRFSKDHILNVAKGGGVGFYTIV'

The short version using Biopython 1.53 or later would be just:

>>> t = gb_feature.extract(gb_record.seq).translate(table=11, cds=True)
>>> str(t)
'MENREGIIEFVKEAYPITKGKYRIPLAVTEFFIEKEKLKEIEEKLNAIFLRIKEIVFEFLDNSLKGYPIDEFRKNLYL
KIYENYNYKFKGIIRFDMINFKLIEINADAVEGVIQLDITSKWFSELFDFVGEPRYNREPFLSLFDKPAIIVYPDNYLF
VDEYYLESKLLKGPYIKESEFYENKKYRIYPIYRRAIDIDKVKKLDYWLEKQNINDILFTIMGNKANLAFLDKEVDWAL
DIVAKTSFKPEFKKYVCKPYFGTHGEVYFEKRDNCVYQEFIEIPKREISYLDIDDKIKKGIFYYDFNPYAFFDKKLRFG
NILLRFSKDHILNVAKGGGVGFYTIV'

In case you are wondering, yes, this is identical to the translation for the protein given in the GenBank file - note that the qualifiers dictionary returns a list of entries, and in the case of the translation there should be one and only one entry (entry zero):

>>> gb_feature.qualifiers['translation'][0] == str(t)
True

Indexing GenBank features

Did you notice the slight of hand above, where I just declared that the CDS entry for locus tag NEQ010 was gb_record.features[26]?

In general, how can we find a particular entry from a unique identifier like the locus tag?

One way is to scan through all the features, and build up a mapping (stored as a python dictionary) from (say) the locus tag to the feature index. This is illustrated in the following function:

def index_genbank_features(gb_record, feature_type, qualifier) :
    answer = dict()
    for (index, feature) in enumerate(gb_record.features) :
        if feature.type==feature_type :
            if qualifier in feature.qualifiers :
                #There should only be one locus_tag per feature, but there
                #are usually several db_xref entries
                for value in feature.qualifiers[qualifier] :
                    if value in answer :
                        print "WARNING - Duplicate key %s for %s features %i and %i" \
                           % (value, feature_type, answer[value], index)
                    else :
                        answer[value] = index
    return answer

How does this work then? We'll show this by looking for the features list entry for the CDS feature with locus_tag of NEQ010:

>>> locus_tag_cds_index = index_genbank_features(gb_record,"CDS","locus_tag")
>>> print locus_tag_index["NEQ010"]
26

This doesn't just work for the locus tag, using the db_xref (database cross-reference) we can index the features allowing us to search them using GI numbers or GeneID:

>>> db_xref_cds_index = index_genbank_features(gb_record,"CDS","db_xref")
>>> print db_xref_cds_index["GI:41614806"]
26
>>> print db_xref_cds_index["GeneID:2654552"]
26

It would also make sense to index by protein_id.

Python

Biopython

NCBI