Even after read breaking, most assemblies still have gaps in $$k$$-mer coverage, causing the de Bruijn graph to have missing edges, and so the search for an Eulerian path fails. In this case, biologists often settle on assembling contigs (long, contiguous segments of the genome) rather than entire chromosomes. For example, a typical bacterial sequencing project may result in about a hundred contigs, ranging in length from a few thousand to a few hundred thousand nucleotides. For most genomes, the order of these contigs along the genome remains unknown. Needless to say, biologists would prefer to have the entire genomic sequence, but the cost of ordering the contigs into a final assembly and closing the gaps using more expensive experimental methods is often prohibitive.

Fortunately, we can derive contigs from the de Bruijn graph. A path in a graph is called non-branching if In($$v$$) = Out($$v$$) = 1 for each intermediate node $$v$$ of this path, i.e., for each node except possibly the starting and ending node of a path. A maximal non-branching path is a non-branching path that cannot be extended into a longer non-branching path. We are interested in these paths because the strings of nucleotides that they spell out must be present in any assembly with a given $$k$$-mer composition. For this reason, contigs correspond to strings spelled by maximal non-branching paths in the de Bruijn graph.

Assignment

Write a function contigs that takes two arguments: i) the location of a FASTA file containing a collection of $$k$$-mers $$C_\text{DNA}$$ and ii) another file location. The function must write all contigs in the de Bruijn graph of $$C_\text{DNA}$$ to a FASTA file whose location is passed as the second argument. These contigs must be sorted in lexicographic order.

Example

In the following interactive session, we assume the FASTA files data01.fna1 and output01.fna2 to be located in the current directory.

>>> contigs('data01.fna', 'output01.fna')
>>> print(open('output01.fna').read().rstrip())
>seq01
AGA
>seq02
ATG
>seq03
ATG
>seq04
CAT
>seq05
GAT
>seq06
TGGA
>seq07
TGT