In this assignment, you are asked to reconstruct a string from a sequence of $$(k, d)$$-mers corresponding to a path in a paired de Bruijn graph.

Assignment

Write a function reconstruction that takes five arguments. The first two arguments are integers $$k$$ ($$k \geq 2$$) and $$d \in \mathcal{N}_0$$. The next two arguments are locations of FASTA files containing a sequence of $$(k, d)$$-mers $$(a_1, b_1), \ldots, (a_n, b_n)$$ such that Suffix($$a_i|b_i$$) = Prefix($$a_{i+1}|b_{i+1}$$) for all $$i$$ from 1 to $$n - 1$$. The fifth argument is another file location.

The function must determine the DNA string $$s$$ whose $$i$$-th $$k$$-mer is equal to Suffix($$a_i|b_i$$) for all $$i$$ from 1 to $$n$$. If such a string $$s$$ exists, the string must be written to the FASTA file whose location is passed as the fifth argument. If such a string $$s$$ does not exist, the function must raise an AssertionError with the message invalid gapped genome path.

Example

In the following interactive session, we assume the FASTA files data01_1.fna1, data01_2.fna2, data02_1.fna3, data02_2.fna4 and output01.fna5 to be located in the current directory.

>>> reconstruction(4, 2, 'data01_1.fna', 'data01_2.fna', 'output01.fna')
>>> print(open('output01.fna').read().rstrip())
>seq01
GACCGAGCGCCGGA
>>> reconstruction(5, 8, 'data02_1.fna', 'data02_2.fna', 'output02.fna')
Traceback (most recent call last):
AssertionError: invalid gapped genome path