Frederick Sanger (1918-2013) was a British biochemist who won the Nobel Prize in chemistry twice, one of two people who have done so in the same category (the other is John Bardeen in physics), the fourth person overall with two Nobel Prizes, and the third person overall with two Nobel Prizes in the sciences. In 1958, he was awarded a Nobel Prize in chemistry "for his work on the structure of proteins, especially that of insulin". In 1980, Walter Gilbert and Sanger shared half of the chemistry prize "for their contributions concerning the determination of base sequences in nucleic acids". The other half was awarded to Paul Berg "for his fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant DNA".

Sanger's second Nobel Prize was awarded for a method that he and his colleagues developed in 1977, currently known as Sanger sequencing. It was the most widely used sequencing method for approximately 25 years. More recently, Sanger sequencing has been supplanted by Next Generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use, for smaller-scale projects, validation of Next-Gen results and for obtaining especially long contiguous DNA sequence reads (> 500 nucleotides). The following video describes in general terms how Sanger sequencing works.

In this assignment we represent a DNA sequence as a string that only contains the uppercase letters A, C, G and T, representing the individual bases (or nucleotides) of the sequence. In addition, all you need to know about the Sanger method is that sequencing a DNA molecule yields a large collection of fragments, of which only the length and the last base are known. We will represent such a fragment as a string that starts with zero or more dashes (-) representing the unknown bases at the start of the DNA sequence), followed by one of the four base letters. As an example, the following string represents a fragment of length 6 that tells us that the sixth base letter of the DNA sequence is a G.

-----G

All fragments that are obtained after sequencing a DNA molecule are stored in a text file that contains one fragment per line. Because the length of each fragment is determined by a stochastic process, it is possible that there are no fragments of a given length or that there are multiple fragments having the same length. Due to errors during the sequencing process, it is quite possible that some fragments have the same length but a different base at the end. As a result, for some positions the correct base letter cannot be derived unambiguously.

----T
C
-C
-----G
C
-----C
---T
----A
----C
-----G

Based on the fragments in the sample file above, we can infer that the first base of the DNA sequence is a C (confirmed by two fragments), that the second base is also a C, and that the fourth base is a T. There is no information about the third base because there are no fragments of length 3. The fifth base can either be an A, a C or a T, and the sixth base can be a C or a G.

The International Union of Pure and Applied Chemistry (IUPAC) has developed a coding system (the IUPAC code) that assigns a unique uppercase letter to each combination of one, two, three and four different bases. For example, the combination "A or C or T" is represented by the letter H, the combination "C or G" by the letter S, and the combination "A or C or G or T" by the letter N. Based on the information in the sample file above, the DNA sequence can be reconstructed as CCNTHS. For this reconstruction we have also used the letter N at positions in the sequence for which no information can be derived from the text file.

Assignment

In this assignment you have to process two types of text files. An IUPAC file contains one line for each possible combination of one, two, three and four different base letters, that consists of a unique uppercase letter, followed by a space and the base letters of the combination in alphabetic order. All uppercase letters in the first column are different. A fragments file contains all fragments obtained after sequencing a DNA molecule using the Sanger method, with each fragment on a separate line. Your task:

Example

The following interactive session assumes that the text files fragments1.txt1, fragments2.txt2 and IUPAC.txt3 are located in the current directory.

>>> codes = IUPAC('IUPAC.txt')
>>> codes
{'G': 'G', 'C': 'C', 'ACGT': 'N', 'ACT': 'H', 'ACG': 'V', 'AC': 'M', 'CT': 'Y', 'AT': 'W', 'AG': 'R', 'CG': 'S', 'T': 'T', 'A': 'A', 'GT': 'K', 'AGT': 'D', 'CGT': 'B'}

>>> code('ATC', codes)
'H'
>>> code(['G', 'C', 'G', 'C', 'C', 'C'], codes)
'S'
>>> code(('A', 'T', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'A'), codes)
'W'
>>> code({'G', 'A', 'T'}, codes)
'D'

>>> fragments('fragments1.txt')
{1: {'C'}, 2: {'C'}, 4: {'T'}, 5: {'C', 'A', 'T'}, 6: {'C', 'G'}}
>>> fragments('fragments2.txt')
{1: {'T'}, 2: {'G'}, 3: {'A'}, 4: {'C'}, 6: {'G'}}

>>> sequence('fragments1.txt', 'IUPAC.txt')
'CCNTHS'
>>> sequence('fragments2.txt', 'IUPAC.txt')
'TGACNG'