In 2020, Pfizer and BioNTech developed a vaccine against the SARS-CoV-2 virus. It contains genetic material very similar to that of the virus's famous Spike protein: the protein responsible for creating the characteristic club-shaped spikes that project from the surface of coronaviruses.

coronavirus
Coronaviruses have characteristic club-shaped spikes that project from their surface, which in electron micrographs create an image reminiscent of the solar corona, from which their name derives.

Through clever chemical processes, the vaccine manages to get this genetic material into specific cells of the human body. These cells then dutifully start producing large quantities of SARS-CoV-2 Spike proteins. When our immune system senses these Spike proteins, it develops a powerful response against the Spike protein and its production process.

Assignment

A peptide is a short stretch of genetic material that is made up of a sequence of building blocks. Each building block is represented by one character (str), so that the peptides can be represented as strings (str). Different characters represent different types of building blocks. No distinction is made between uppercase and lowercase letters.

There are four types of building blocks (called bases) in DNA and RNA. DNA bases are represented by the letters A, C, G and T. In RNA, base T is replaced by base U. This immediately tells us how DNA is transcribed into RNA (and vice versa). In proteins there are 21 types of building blocks (called amino acids), which are represented by 20 letters and an asterisk (*; technically not an amino acid but a stop codon).

A DNA or RNA peptide can describe how proteins are made. To do so, the peptide is split into codons: a sequence of three consecutive bases. As a result, there are $$4^3 = 64$$ different codons. Since each codon corresponds to one amino acid, there must be multiple codons that correspond to the same amino acid. Such codons are called synonyms.

genetic code
The standard genetic code shows how most organisms convert RNA codons into amino acids.

A genetic code describes how codons are converted to amino acids. Different genetic codes (with slight variations) are used by different groups of organisms. A genetic code is stored in a text file in the following format. The first line contains a header (and can therefore be ignored). This is followed by 64 lines, each describing the conversion of a codon to an amino acid using for comma-separated fields: i) codon, ii) amino acid name, iii) three-letter amino acid abbreviation and iv) one-letter amino acid abbreviation. This is, for example, how the standard genetic code is stored:

Codon,Full Name,Abbreviation (3 Letter),Abbreviation (1 Letter)
TTT,Phenylalanine,Phe,F
TTC,Phenylalanine,Phe,F
TTA,Leucine,Leu,L
…
TAA,Termination (ochre),Ter,*
TAG,Termination (amber),Ter,*
…
GGC,Glycine,Gly,G
GGA,Glycine,Gly,G
GGG,Glycine,Gly,G

A file describing a genetic code may use either DNA or RNA codons. The codons can be written in uppercase or lowercase.

The Pfizer/BioNTech vaccine uses a technique called codon optimization, where each codon of a peptide is replaced by a synonym containing as many C's and G's as possible. As a result, the peptide still describes the same protein, but it turns out that RNA with a higher amount1 of C's and G's is converted more efficiently into proteins2. This can already be seen, for example, if we compare the start of the SARS-CoV-2 Spike protein with the start of the vaccine:

          M   F   V   F   L   V   L   L   P   L   V   S   S   Q   C   V
virus:   AUG UUU GUU UUU CUU GUU UUA UUG CCA CUA GUC UCU AGU CAG UGU GUU
               |   |   |   |   | | | |     |   |   |   |   |
vaccine: AUG UUC GUG UUC CUG GUG CUG CUG CCU CUG GUG UCC AGC CAG UGU GUU
          M   F   V   F   L   V   L   L   P   L   V   S   S   Q   C   V

All mutations are synonymous and all but one introduce an extra C or G in the RNA of the vaccine. This is what gets us to a 95% efficient vaccine. Your task:

Example

In the following interactive session we assume the text file genetic_code.txt3 to be located in the current directory.

>>> code = genetic_code('genetic_code.txt4')
>>> code['TCT']
'S'
>>> code['TCA']
'S'
>>> code['ATG']
'M'

>>> inverse_code = inverse_genetic_code(code)
>>> inverse_code['S']
{'TCA', 'TCC', 'TCT', 'AGC', 'TCG', 'AGT'}
>>> inverse_code['M']
{'ATG'}
>>> inverse_code['*']
{'TGA', 'TAA', 'TAG'}

>>> synonyms('TCT', code)
{'TCC', 'AGT', 'TCT', 'AGC', 'TCA', 'TCG'}
>>> synonyms('atg', code, RNA=True)
{'AUG'}
>>> synonyms('UGA', code)
{'UAA', 'UAG', 'UGA'}

>>> codon_optimization(['tca', 'tcc', 'tct', 'agc', 'tcg', 'agt'])
'agc'
>>> codon_optimization({'ATG'})
'ATG'
>>> codon_optimization(('UAA', 'UAG', 'UGA'))
'UAG'

>>> peptide_optimization('ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTT', code)
'ATGTTCGTCTTCCTCGTCCTCCTCCCCCTCGTCAGCAGCCAGTGCGTC'
>>> peptide_optimization('AUGUUUGUUUUUCUUGUUUUAUUGCCACUAGUCUCUAGUCAGUGUGUU', code, RNA=True)
'AUGUUCGUCUUCCUCGUCCUCCUCCCCCUCGUCAGCAGCCAGUGCGUC'