Introduction to genome sequencing

Recall from "Computing GC-content1" that almost all humans share approximately 99.9% of the same nucleotides on the genome. Thus, if we know only a few complete genomes from the species, then we already possess an important key to unlocking the species genome.

Determining an organism's complete genome (called genome sequencing) forms a central task of bioinformatics. Unfortunately, we still don't possess the microscope technology to zoom into the nucleotide level and determine the sequence of a genome's nucleotides, one at a time. However, researchers can apply chemical methods to generate and identify much smaller snippets of DNA, called reads.

After obtaining a large collection of reads from multiple copies of the same genome, the aim is to reconstruct the desired genome from these small pieces of DNA. This process is called fragment assembly.

fragment assembly
Fragment assembly works by blasting many copies of the same genome into smaller, identifiable reads, which are then used to computationally assemble one copy of the genome.

Although the goal of fragment assembly is to produce an entire genome, in practice it is only possible to construct several contiguous portions of each chromosome, called contigs. Furthermore, the assumption made above that reads all derive from the same strand is also practically unrealistic. In reality, researchers will not know the strand of DNA from which a given read has been sequenced.

Assignment

For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring. By the assumption of parsimony, a shortest possible superstring over a collection of reads serves as a candidate chromosome.

Write a function shortestSuperstring that takes the location of a FASTA file. The FASTA file should contain a collection of DNA strings of equal length, representing reads derived from the same strand of a single linear chromosome. The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length. The function must return a shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).

Example

In the following interactive session, we assume the FASTA file data.fna2 to be located in the current directory.

>>> shortestSuperstring('data.fna')
'ATTAGACCTGCCGGAATAC'