Genome assembly is the computational process in which the actual sequence of nucleotides making up an organism's genome is determined. Researchers are presently unable to read the nucleotides of an entire chromosome one at a time. Instead, they rely on high-technological chemical methods to determine the order of bases along short strands of DNA (usually less than 1000 bp). These short strands are called reads.

To sequence a genome, researchers chemically "blow up" multiple copies of a genome (typically taken from multiple cells of the same organism) into reads and then determine the nature of the reads. To reassemble the genome, they must use overlaps in reads to determine which read pairs are adjacent in the genome. For example, if the short reads GACCTACA and ACCTACAA are produced, then we might surmise from the fact that they overlap that they both belong to a substring GACCTACAA. An efficient algorithm for assembly that can handle all possible wrinkles and complications arising from practical concerns (such as errors in reads) still does not exist, so that research in genome assembly remains a highly competitive field.

Assignment

As a first step in genome assembly, most algorithms try to get insight in the overlap between reads by constructing an overlap graph. This is a data structure where overlapping reads (represented as labeled circles in the picture below) are connected using arrows. Note that it is possible that one read overlaps multiple other reads, usually due to repeats in the genome or sequencing errors. This causes so-called "forks" in the overlap graph.

overlapgraaf
Example of an overlap graph (right) constructed from a set of 10 reads of 8 base pairs (left). Reads having an overlap of at least 6 base pairs are indicated by drawing an arrow between both reads. Transitive overlaps — which are implied by other longer overlaps — are indicated as dotted arrows. The fork in the graph is the result of repeat sequences or read errors, and makes it extremely difficult for assembly algorithms to figure out which route has to be followed. Note that we have only considered the forward orientation of each sequence to simplify the figure.

The goal of this assignment is to construct the overlap graph for a given set of reads. In doing so, you proceed as follows:

Example

>>> overlap('AAATTTT', 'TTTTCCC', 3)
True
>>> overlap('AAATTTT', 'TTTTCCC', 5)
False
>>> overlap('ATATATATAT', 'TATATATATA', 4)
False
>>> overlap('ATATATATAT', 'TATATATATA', 5)
True

>>> maximalOverlap('AAATTTT', 'TTTTCCC')
4
>>> maximalOverlap('ATATATATAT', 'TATATATATA')
9

>>> reads = ['AAATAAA', 'AAATTTT', 'TTTTCCC', 'AAATCCC', 'GGGTGGG']
>>> overlapGraph(reads, 3)
{'AAATTTT': {'TTTTCCC'}, 'AAATAAA': {'AAATTTT', 'AAATCCC'}}
>>> overlapGraph(reads, 4)
{'AAATTTT': {'TTTTCCC'}}

>>> reads = ['GACCTACA', 'ACCTACAA', 'CCTACAAG', 'CTACAAGT', 'TACAAGTT', 'ACAAGTTA', 'CAAGTTAG', 'TACAAGTC', 'ACAAGTCC', 'CAAGTCCG']
>>> overlapGraph(reads, 6)
{'CTACAAGT': {'ACAAGTCC', 'ACAAGTTA', 'TACAAGTC', 'TACAAGTT'}, 'TACAAGTT': {'CAAGTTAG', 'ACAAGTTA'}, 'ACCTACAA': {'CTACAAGT', 'CCTACAAG'}, 'ACAAGTCC': {'CAAGTCCG'}, 'ACAAGTTA': {'CAAGTTAG'}, 'GACCTACA': {'CCTACAAG', 'ACCTACAA'}, 'TACAAGTC': {'ACAAGTCC', 'CAAGTCCG'}, 'CCTACAAG': {'CTACAAGT', 'TACAAGTC', 'TACAAGTT'}}

Note: the last example represents the overlap graph depicted in the introduction of this assignment.

Resources