Determining an organism's complete genome (called genome sequencing) is one of the cornerstones of bioinformatics. Unfortunately, we still don't possess the microscope technology to zoom in to the nucleotide level and determine the sequence of a genome's nucleotides, one at a time. However, researchers can apply chemical methods to generate and identify much smaller snippets of DNA, called reads. After obtaining a large collection of reads from multiple copies of the same genome, the remaining challenge is to reconstruct the desired genome from these small pieces of DNA. This process is called genome assembly.

fragment assembly
Genome assembly works by blasting many copies of the same genome into smaller, identifiable reads, which are then used to computationally assemble one copy of the genome.

Even though genome sequencing now has become a multi-billion dollar enterprise, the sequencing machines that identify the reads still produce errors a substantial percentage of the time. To make matters worse, these errors are unpredictable. It is difficult to determine if the machine has made an error, let alone where in the read the error has occurred. For this reason, error correction in reads is typically a vital first step in genome assembly. As is the case with point mutations, the most common type of sequencing error occurs when a single nucleotide from a read is interpreted incorrectly.

Assignment

You are given a collection of reads of equal length. For each read $$r$$ in the data set, one of the following applies:

Your task is to determine the orphan reads and to determine the error corrections for the other incorrect reads. In order to do so, you proceed as follows:

Reverse complement

The reverse complement of a DNA string is formed by reversing the string and taking the complement of each base symbol (A and T are complementary base symbols, as are C and G). We must reverse the string in addition to taking complements because of the directionality of DNA. DNA replication and transcription occurs from the 5' end to the 3' end, and the 3' end of one strand is opposite from the 5' end of the complementary strand. Thus, if we were to simply take complements, then we would be reading the second strand in the wrong direction.

reverse complement

Hamming distance

The Hamming distance between two strings having the same length is the minimum number of symbol substitutions required to transform one string into the other. If the strings are given by $$s_1$$ and $$s_2$$, then we write the Hamming distance between them as $$d_H(s_1,s_2)$$.

We can compute the Hamming distance by visual inspection: the Hamming distance between two strings is simply the number of positions in the strings at which corresponding symbols differ. For example, the Hamming distance between the two strings in the figure below is equal to 7.

Hamming distance

Example

>>> hamming('TTCAT', 'TTCAT')
0
>>> hamming('TTCAT', 'TTGAT')
1
>>> hamming('TTCAT', 'GAGGA')
5
>>> hamming('TTCAT', 'TTT')
Traceback (most recent call last):
AssertionError: strings should have equal length

>>> complement('TTCAT')
'ATGAA'
>>> complement('TTGAT')
'ATCAA'
>>> complement('GAGGA')
'TCCTC'

>>> normalform('TTCAT')
'ATGAA'
>>> normalform('TTGAT')
'ATCAA'
>>> normalform('GAGGA')
'GAGGA'

>>> reads = ['TCATC', 'TTCAT', 'TCATC', 'TGAAA', 'GAGGA', 'TTTCA', 'ATCAA', 'TTGAT', 'AGGCT']
>>> occurrences(reads)
{'GAGGA': 1, 'ATGAA': 1, 'TGAAA': 2, 'GATGA': 2, 'AGCCT': 1, 'ATCAA': 2}
>>> errors(reads)
({'AGGCT'}, [('GAGGA', 'GATGA'), ('TTCAT', 'TTGAT')])

>>> reads = ['GATTA', 'GATTA', 'TAATC', 'GAATC', 'GATTA', 'TAAGC', 'TAATA']
>>> occurrences(reads)
{'GATTA': 4, 'TAATA': 1, 'GCTTA': 1, 'GAATC': 1}
>>> errors(reads)
(set(), [('GAATC', 'TAATC'), ('TAAGC', 'TAATC'), ('TAATA', 'TAATC')])