The genetic code consists of a number of lines that determine how living cells translate the information coded in genetic material (DNA or RNA sequences) to proteins (amino acid sequences). This code defines how a sequence of three nucleotides — named codons — specifies which amino acid will be added next to the protein during the protein synthesis.

A codon sequence within a messenger RNA (mRNA) molecule. Every codon consists of three nucleotides, that generally represent only one amino acid. The nucleotides are abbreviated with the letters A, U, G, and C. This is mRNA, that uses U (uracil), as opposed to DNA, which uses T (thymine). This mRNA molecule will instruct to synthesize a protein according to this code.

The standard genetic code.

Because the majority of the genes use the same code, this specific code is often referred to as the canonical or standard code, or simply the genetic code, while various variant of the code have developed. The protein synthesis in human mitochondria is an example where a genetic code is used that deviate from the standard genetic code.

Living cells use 20 types of amino acids to code proteins, that each are appointed their own uppercase letter. With four different nucleotides, a code that starts with 2 nucleotides can code a maximum of $$4^2$$ or 16 different amino acids. Genetic codes are 3-letter codes where some codons are portrayed on the same amino acids or are used as stop codon. A certain genetic code can be (partially) recorded by linking every one of the 64 possible nucleotides to an amino acid (indicated by an uppercase letter) or a stop codon (indicated with an asterisk (*)).

Assignment

Define a class GeneticCode to represent specific genetic codes for translating genes into proteins. When creating a new genetic code (GeneticCode), the location (str) of a text file containing the translation table of the genetic code must be passed. Such a file consists of 64 lines, each containing a codon and the corresponding amino acid, separated by a space. You may assume the file contains all 64 codons. However, both the DNA (with T for thymine) or the RNA alphabet (with U for uracil) can be used for the codons. The example below uses a file in which the DNA alphabet is used.

Furthermore, a genetic code $$\mathcal{C}$$ (GeneticCode) must support at least the following methods:

A method amino_acid that takes a valid codon (str; DNA or RNA). The method must return the uppercase letter (str) representing the corresponding amino acid in genetic code $$\mathcal{C}$$. In this translation, the method must not distinguish between uppercase and lowercase letters for the given codon, nor between the letters U (uracil) and T (thymine). In this way, both DNA and RNA codons can passed to the method. Look at the example below to see how the method should respond if the argument does not represent a valid codon.
A method protein that takes a DNA or RNA sequence (str). This sequence may contain both uppercase and lowercase letters, but only contains letters from the DNA or the RNA alphabet. Look at the example below to see how the method should respond if the argument does not represent a valid DNA or RNA sequence. If a valid sequence is passed, the method must return its translation the corresponding protein sequence (str). This translation should consist only of uppercase letters. If the length of the given DNA or RNA sequence is not a multiple of three, the last or the last two letters of the sequence should be ignored in the translation.

Example

In the example below, we assume that the file standard_code.txt¹ is situated in the current directory.

TTT F
TTC F
TTA L
TTG L
CTT L
CTC L
CTA L
CTG L
ATT I
ATC I
ATA I
ATG M
GTT V
GTC V
GTA V
GTG V
TCT S
TCC S
TCA S
TCG S
CCT P
CCC P
CCA P
CCG P
ACT T
ACC T
ACA T
ACG T
GCT A
GCC A
GCA A
GCG A
TAT Y
TAC Y
TAA *
TAG *
CAT H
CAC H
CAA Q
CAG Q
AAT N
AAC N
AAA K
AAG K
GAT D
GAC D
GAA E
GAG E
TGT C
TGC C
TGA *
TGG W
CGT R
CGC R
CGA R
CGG R
AGT S
AGC S
AGA R
AGG R
GGT G
GGC G
GGA G
GGG G

>>> code = GeneticCode('standard_code.txt²')

>>> code.amino_acid('AGT')
'S'
>>> code.amino_acid('cga')
'R'
>>> code.amino_acid('UCU')
'S'
>>> code.amino_acid('ABC')
Traceback (most recent call last):
AssertionError: 'ABC' is not a valid codon.
>>> code.amino_acid('aagc')
Traceback (most recent call last):
AssertionError: 'aagc' is not a valid codon.

>>> code.protein('ATGCTGATGATGGGCTATTATCGAT')
'MLMMGYYR'
>>> code.protein('uauccuaguguc')
'YPSV'
>>> code.protein('AAGTCGTAGCTACGXXXXGAGAAGGAT')
Traceback (most recent call last):
AssertionError: invalid DNA or RNA sequence.