The Genetic Code

The Genetic Code

T C A G
T Phe [F] Ser [S] Tyr [Y] Cys [C] T
C
Leu [L] Ter [end] Ter [end] A
Trp [W] G
C Leu [L] Pro [P] His [H] Arg [R] T
C
Gln [Q] A
G
A Ile [I] Thr [T] Asn [N] Ser [S] T
C
Lys [K] Arg [R] A
Met [M] G
G Val [V] Ala [A] Asp [D] Gly [G] T
C
Glu [E] A
G

An explanation of the Genetic Code: DNA is a two-stranded molecule. Each strand is a polynucleotide composed of A (adenosine), T (thymidine), C (cytosine), and G (guanosine) residues polymerized by "dehydration" synthesis in linear chains with specific sequences. Each strand has polarity, such that the 5'-hydroxyl group of the first nucleotide begins the strand and the 3'-hydroxyl group of the final nucleotide ends strand; accordingly, we say that this strand runs 5' to 3' . It is also essential to know that the two strands of DNA run antiparallel such that one strand runs 5' -> 3' while the other one runs 3' -> 5'. At each nucleotide residue along the double-stranded DNA molecule, the nucleotides are complementary. That is, A forms two hydrogen-bonds with T; C forms three hydrogen bonds with G. In most cases the two-stranded, antiparallel, complementary DNA molecule folds to form a helical structure which resembles a spiral staircase. This is the reason why DNA has been referred to as the "Double Helix".

One strand of DNA holds the information that codes for various genes; this strand is often called the template strand or antisense strand (containing anticodons). The other, and complementary, strand is called the coding strand or sense strand (containing codons). Since mRNA is made from the template strand, it has the same information as the coding strand. The table above referrs to the sequence 5' -> 3' of the coding or sense strand of DNA; the code for the mRNA would be identical but for the fact that RNA contains U (uridine) rather than T.

An example of two complementary strands of DNA would be:

(5' -> 3') ATGGAATTCTCGCTC (Coding, sense strand)
(3' <- 5') TACCTTAAGAGCGAG (Template, antisense strand)
(5' -> 3') AUGGAAUUCUCGCUC (mRNA made from Template strand)
Since amino acid residues of proteins are specified as triplet codons, the protein sequence made from the above example would be Met-Glu-Phe-Ser-Leu... (MEFSL...).


Sequence Symbols

GCG programs allow all upper- and lower-case letters, periods (.), asterisks (*), pluses (+), ampersands (&), and ats (@) as symbols in biological sequences. Nucleotide symbols, their complements, and the standard one-letter amino acid symbols are shown below in separate lists. The meanings of the symbols +, &, and @ have not been assigned at this writing (March, 1989).

GCG uses the letter codes for amino acid codes and nucleotide ambiguity proposed by IUB (Nomenclature Committee, 1985, Eur. J. Biochem. 150; 1-5). These codes are compatible with the codes used by the EMBL, GenBank, and PIR data libraries.


Nucleotides

The meaning of each symbol, its complement, and the Cambridge equivalents are shown below. Cambridge files can be converted into GCG files and vice versa with the programs FromStaden and ToStaden.

          IUB/GCG      Meaning     Complement   Staden/Sanger
          
              A             A             T             A
              C             C             G             C
              G             G             C             G
             T/U            T             A             T
              M           A or C          K             5
              R           A or G          Y             R
              W           A or T          W             7
              S           C or G          S             8
              Y           C or T          R             Y
              K           G or T          M             6
              V        A or C or G        B       not supported
              H        A or C or T        D       not supported
              D        A or G or T        H       not supported
              B        C or G or T        V       not supported
             X/N     G or A or T or C     X            -/X
              .    not G or A or T or C   .       not supported
The uncertainty and frame ambiguity codes used by Staden are not supported by GCG and are translated by FromStaden as the lower case single base equivalent.

                  Staden Code          Meaning              GCG
          
                      1               probably C              c
                      2               probably T              t
                      3               probably A              a
                      4               probably G              g
                      D                C or CC                c
                      V                T or TT                t
                      B                A or AA                a
                      H                G or GG                g
                      K                C or CX                c
                      L                T or TX                t
                      M                A or AX                a
                      N                G or GX                g

Amino Acids

Here is a list of the standard one-letter amino acid codes and their three-letter equivalents. The synonymous codons and their depiction in the IUB codes are shown. You should recognize that the codons following semicolons (;) are not sufficiently specific to define a single amino acid even though they represent the best possible backtranslation into the IUB codes! All of the relationships in this list can be redefined by you in a local data file, as described below.

                                                           IUB
    Symbol 3-letter  Meaning      Codons                Depiction
    
      A    Ala       Alanine      GCT,GCC,GCA,GCG         !GCX
      B    Asp,Asn   Aspartic,
                     Asparagine   GAT,GAC,AAT,AAC         !RAY
      C    Cys       Cysteine     TGT,TGC                 !TGY
      D    Asp       Aspartic     GAT,GAC                 !GAY
      E    Glu       Glutamic     GAA,GAG                 !GAR
      F    Phe     Phenylalanine  TTT,TTC                 !TTY
      G    Gly       Glycine      GGT,GGC,GGA,GGG         !GGX
      H    His       Histidine    CAT,CAC                 !CAY
      I    Ile       Isoleucine   ATT,ATC,ATA             !ATH
      K    Lys       Lysine       AAA,AAG                 !AAR
      L    Leu       Leucine      TTG,TTA,CTT,CTC,CTA,CTG !TTR,CTX,YTR;YTX
      M    Met       Methionine   ATG                     !ATG
      N    Asn       Asparagine   AAT,AAC                 !AAY
      P    Pro       Proline      CCT,CCC,CCA,CCG         !CCX
      Q    Gln       Glutamine    CAA,CAG                 !CAR
      R    Arg       Arginine     CGT,CGC,CGA,CGG,AGA,AGG !CGX,AGR,MGR;MGX
      S    Ser       Serine       TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX
      T    Thr       Threonine    ACT,ACC,ACA,ACG         !ACX
      V    Val       Valine       GTT,GTC,GTA,GTG         !GTX
      W    Trp       Tryptophan   TGG                     !TGG
      X    Xxx       Unknown                              !XXX
      Y    Tyr       Tyrosine     TAT, TAC                !TAY
      Z    Glu,Gln   Glutamic,
                     Glutamine    GAA,GAG,CAA,CAG         !SAR
      *    End       Terminator   TAA, TAG, TGA           !TAR,TRA;TRR


References:
  1. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. by A. Cornish-Bowden
  2. IUPAC-IUB Commission on Biochemical Nomenclature (CBN): Abbreviations and Symbols for Nucleic Acids, Polynucleotides and their Constituents
  3. Nomenclature Committee of the International Union of Biochemistry (NC-IUB): Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences

  • Compiled by Zhiliang Hu