Genome assembly is the computational process in which the actual sequence of nucleotides making up an organism's genome is determined. Researchers are presently unable to read the nucleotides of an entire chromosome one at a time. Instead, they rely on high-technological chemical methods to determine the order of bases along short strands of DNA (usually less than 1000 bp). These short strands are called reads.

genome assembly
Genome assembly works by blasting many copies of the same genome into smaller, identifiable reads, which are then used to computationally assemble one copy of the genome.

To sequence a genome, researchers chemically "blow up" multiple copies of a genome (typically taken from multiple cells of the same organism) into reads and then determine the nature of the reads. To reassemble the genome, they must use overlaps in reads to determine which read pairs are adjacent in the genome. For example, if the short reads GACCTACA and ACCTACAA are produced, then we might surmise from the fact that they overlap that they both belong to a substring GACCTACAA. An efficient algorithm for assembly that can handle all possible wrinkles and complications arising from practical concerns (such as errors in reads) still does not exist, so that research in genome assembly remains a highly competitive field.

In this assignment we solve a simplified version of the genome assembly problem, where all reads have a fixed length $$k$$ ($$k$$-mers), all $$k$$-mers in the original genome occur just once and there are no read errors.


We start by breaking up a given word in all its $$n - k + 1$$ overlapping substrings of length $$k$$. These are called the $$k$$-mers of the given word. Then we randomly shuffle the list of $$k$$-mers. The problem you're now faced with is to reconstruct the original word. For example, consider the following list of four substrings of length 3 (given here in alphabetic order):


These are the four 3-mers of the word BANANA. Your task:

Although this is a simplified version of the true genome assembly problem, this problem in itself is quite challenging in providing an efficient solution. Your challenge is not only to come up with a program that can find a least one solution, but also to try to make that program as fast as possible. You may find some inspiration for solving this problem in the paper of Compeau et al. (2011). In practice, genome assemblies are made from millions of reads with lengths between 40 and 10.000 base pairs (dependent on the sequence technology used), that need to be composed into genomes with lengths in the orders of millions or billions of base pairs.


>>> kmers('banana', 2)
['ba', 'an', 'na', 'an', 'na']
>>> kmers('cucumber', 2)
['cu', 'uc', 'cu', 'um', 'mb', 'be', 'er']
>>> kmers('elderberry', 3)
['eld', 'lde', 'der', 'erb', 'rbe', 'ber', 'err', 'rry']
>>> kmers('pineapple', 3)
['pin', 'ine', 'nea', 'eap', 'app', 'ppl', 'ple']
>>> kmers('mississippi', 3)
['mis', 'iss', 'ssi', 'sis', 'iss', 'ssi', 'sip', 'ipp', 'ppi']
>>> kmers('alfalfa', 3)
['alf', 'lfa', 'fal', 'alf', 'lfa']

>>> assemble(['an', 'an', 'ba', 'na', 'na'])
>>> assemble(['be', 'cu', 'cu', 'er', 'mb', 'uc', 'um'])
>>> assemble(['ber', 'der', 'eld', 'erb', 'err', 'lde', 'rbe', 'rry'])
>>> assemble(['app', 'eap', 'ine', 'nea', 'pin', 'ple', 'ppl'])
>>> assemble(['ipp', 'iss', 'iss', 'mis', 'ppi', 'sip', 'sis', 'ssi', 'ssi'])
>>> assemble(['alf', 'alf', 'fal', 'lfa', 'lfa'])


Now that you have implemented an algorithm that reconstructs words from a shuffled list of all its $$k$$-mers, try to see how fast the algorithm works. The file words8.txt1 contains a list of 70.000+ English words of at least 8 characters, each word on a separate line. Modify your program such that it takes each word in this list, splits it into all its $$k$$-mers (pick a fixed $$k$$; the problem becomes more difficult with smaller $$k$$) and then try to reconstruct the original word from its k-mers. Analyze the behavior of your "assembly" program: