Frederick Sanger (1918-2013) was a British biochemist who won the Nobel Prize in chemistry twice, one of two people who have done so in the same category (the other is John Bardeen in physics), the fourth person overall with two Nobel Prizes, and the third person overall with two Nobel Prizes in the sciences. In 1958, he was awarded a Nobel Prize in chemistry "for his work on the structure of proteins, especially that of insulin". In 1980, Walter Gilbert and Sanger shared half of the chemistry prize "for their contributions concerning the determination of base sequences in nucleic acids". The other half was awarded to Paul Berg "for his fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant DNA".
Sanger's second Nobel Prize was awarded for a method that he and his colleagues developed in 1977, currently known as Sanger sequencing. It was the most widely used sequencing method for approximately 25 years. More recently, Sanger sequencing has been supplanted by Next Generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use, for smaller-scale projects, validation of Next-Gen results and for obtaining especially long contiguous DNA sequence reads (> 500 nucleotides). The following video describes in general terms how Sanger sequencing works.
In this assignment we represent a DNA sequence as a string that only contains the uppercase letters A, C, G and T, representing the individual bases (or nucleotides) of the sequence. In addition, all you need to know about the Sanger method is that sequencing a DNA molecule yields a large collection of fragments, of which only the length and the last base are known. We will represent such a fragment as a string that starts with zero or more dashes (-) representing the unknown bases at the start of the DNA sequence), followed by one of the four base letters. As an example, the following string represents a fragment of length 6 that tells us that the sixth base letter of the DNA sequence is a G.
-----G
All fragments that are obtained after sequencing a DNA molecule are stored in a text file that contains one fragment per line. Because the length of each fragment is determined by a stochastic process, it is possible that there are no fragments of a given length or that there are multiple fragments having the same length. Due to errors during the sequencing process, it is quite possible that some fragments have the same length but a different base at the end. As a result, for some positions the correct base letter cannot be derived unambiguously.
----T C -C -----G C -----C ---T ----A ----C -----G
Based on the fragments in the sample file above, we can infer that the first base of the DNA sequence is a C (confirmed by two fragments), that the second base is also a C, and that the fourth base is a T. There is no information about the third base because there are no fragments of length 3. The fifth base can either be an A, a C or a T, and the sixth base can be a C or a G.
The International Union of Pure and Applied Chemistry (IUPAC) has developed a coding system (the IUPAC code) that assigns a unique uppercase letter to each combination of one, two, three and four different bases. For example, the combination "A or C or T" is represented by the letter H, the combination "C or G" by the letter S, and the combination "A or C or G or T" by the letter N. Based on the information in the sample file above, the DNA sequence can be reconstructed as CCNTHS. For this reconstruction we have also used the letter N at positions in the sequence for which no information can be derived from the text file.
In this assignment you have to process two types of text files. An IUPAC file contains one line for each possible combination of one, two, three and four different base letters, that consists of a unique uppercase letter, followed by a space and the base letters of the combination in alphabetic order. All uppercase letters in the first column are different. A fragments file contains all fragments obtained after sequencing a DNA molecule using the Sanger method, with each fragment on a separate line. Your task:
Write a function IUPAC that takes the location of an IUPAC file. The function must return a dictionary that maps all possible combinations of base letters in the given file onto the uppercase letter that is used to represent this combination. The uppercase letter can be found in the first column of the file.
Write a function code that takes two arguments: i) a collection type object (a string, a list, a tuple or a set) containing one or more base letters, and ii) a dictionary that maps all possible combinations of one, two, three and four different base letters (represented as a string containing the base letters in alphabetic order) onto the unique uppercase letter that represents the combination. The function must return the uppercase letter that corresponds to the combination of bases that is passed as the first argument to the function, taking into account the mapping determined by the dictionary that is passed as the second argument to the function.
Write a function fragments that takes the location of a fragments file. The function must return a dictionary that maps each position in the DNA sequence for which information can be found in the given text file onto the set of bases on that position according to the given text file. The first position in the DNA sequence is position 1, the second position is position 2, and so on.
Write a function sequence that takes the locations of a fragments file and an IUPAC file. The function must return the DNA sequence that can be reconstructed from the fragments in the given fragments file. The reconstruction must use the IUPAC codes as defined in the given IUPAC file. Positions in the DNA sequence for which no information is available in the fragments file have to be filled up by the letter in the given IUPAC file that corresponds to any of the four possible bases (A or C or G or T). The last position of the DNA sequence corresponds to the last position for which information can be derived from the given fragments file.
The following interactive session assumes that the text files fragments1.txt1, fragments2.txt2 and IUPAC.txt3 are located in the current directory.
>>> codes = IUPAC('IUPAC.txt')
>>> codes
{'G': 'G', 'C': 'C', 'ACGT': 'N', 'ACT': 'H', 'ACG': 'V', 'AC': 'M', 'CT': 'Y', 'AT': 'W', 'AG': 'R', 'CG': 'S', 'T': 'T', 'A': 'A', 'GT': 'K', 'AGT': 'D', 'CGT': 'B'}
>>> code('ATC', codes)
'H'
>>> code(['G', 'C', 'G', 'C', 'C', 'C'], codes)
'S'
>>> code(('A', 'T', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'A'), codes)
'W'
>>> code({'G', 'A', 'T'}, codes)
'D'
>>> fragments('fragments1.txt')
{1: {'C'}, 2: {'C'}, 4: {'T'}, 5: {'C', 'A', 'T'}, 6: {'C', 'G'}}
>>> fragments('fragments2.txt')
{1: {'T'}, 2: {'G'}, 3: {'A'}, 4: {'C'}, 6: {'G'}}
>>> sequence('fragments1.txt', 'IUPAC.txt')
'CCNTHS'
>>> sequence('fragments2.txt', 'IUPAC.txt')
'TGACNG'