Nucleotide and Amino Acid Sequences in BioPython
In day to day coding, like many BioPython users, I often used to use just python strings - rather than the Biopython Seq objects which are strings with an associated alphabet. I think that recent releases of Biopython have made the Seq object much more useful, especially in combination with the SeqIO system.
In an effort to get to grips with BioPython's current "alphabet" system, several years ago I started this page. So far it just summarises the IUPAC/IUBMB standards for nucleotide and amino acid "letter" names.
Still under construction...
Nucleotide Alphabet
The nucleotides making up RNA or DNA sequences, taken from IUBMB - Nucleotides:
Symbol | Meaning |
---|---|
G | Guanine |
A | Adenine |
T | Thymine (in DNA) |
C | Cytosine |
U | Uracil (in RNA) |
Then there are the ambigous nucleotide letters:
Symbol | Meaning | Origin of designation |
---|---|---|
R | G or A | puRine |
Y | T or C | pYrimidine |
M | A or C | aMino |
K | G or T | Keto |
S | G or C | Strong interaction (3 H bonds) |
W | A or T | Weak interaction (2 H bonds) |
H | A, C or T | not-G, H follows G in the alphabet |
B | G, T or C | not-A, B follows A in the alphabet |
V | G, C or A | not-T (not-U), V follows U in the alphabet |
D | G, A or T | not-C, D follows C |
N | G, A, T or C | aNy |
Nucleotide Sequences in BioPython
Right then... DNA and RNA... unambiguous and ambiguous...
Amino Acid Alphabet
The standard twenty amino acids have one-letter and three-letter codes as follows, taken from the IUPAC/IMBMB - Amino Acids:
One | Three | Meaning | One | Three | Meaning |
---|---|---|---|---|---|
A | Ala | Alanine | M | Met | Methionine |
C | Cys | Cysteine | N | Asn | Asparagine |
D | Asp | Aspartic acid | P | Pro | Proline |
E | Glu | Glutamic acid | Q | Gln | Glutamine |
F | Phe | Phenylalanine | R | Arg | Arginine |
G | Gly | Glycine | S | Ser | Serine |
H | His | Histidine | T | Thr | Threonine |
I | Ile | Isoleucine | V | Val | Valine |
K | Lys | Lysine | W | Trp | Tryptophan |
L | Leu | Leucine | Y | Tyr | Tyrosine |
There are of course, some special cases
One | Three | Meaning |
---|---|---|
X | Xaa | Unknown or 'other' amino acid |
U | Sec | Selenocysteine (see IUBMB recommentations) |
O | Pyl | Pyrrolysine |
B | Asx | Aspartic acid (R) or Asparagine (N) |
Z | Glx | Glutamic acid (E) or Glutamine (Q), or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides |
J | Sometimes used in NMR work as designation for signals assigned either to leucine (L) or to isoleucine (I) which cannot be distinguished from each other |
Amino Acid Sequences in BioPython
The first point is that BioPython uses the one-letter codes almost exclusively - they are simply much more convenient for manipulating on the computer than the three-letter codes.