Given a genome $$s$$, the path graph $$\mathcal{G}_\text{path}(k, s)$$ is the path consisting of $$|s| - k + 1$$ edges, where the $$i$$-th edge of this path is labeled by the $$i$$-th $$k$$-mer in $$s$$ and the $$i$$-th node of the path is labeled by the $$i$$-th $$(k - 1)$$-mer in $$s$$. The de Bruijn graph $$\mathcal{G}_\text{DB}(k, s)$$ is formed by gluing identically labeled nodes in $$\mathcal{G}_\text{path}(k, s)$$.

Note

In this assignment we only build the de Bruijn graph from the $$k$$-mers, not taking into account their reverse complements.

Assignment

Write a function de_bruijn_graph that takes two arguments: i) an integer $$k$$ ($$k \geq 2$$) and ii) a FASTA file containing a DNA string $$s$$. The function must return the de Bruijn graph $$\mathcal{G}_\text{DB}(k, s)$$. The de Bruijn graph is represented as a dictionary that maps each label of a starting node in $$\mathcal{G}_\text{DB}(k, s)$$ onto a list containing the lexicographically sorted labels of all ending nodes connected to that starting node.

Example

In the following interactive session, we assume the FASTA file data01.fna1 to be located in the current directory.

>>> de_bruijn_graph(4, 'data01.fna')
{'AAG': ['AGA'], 'AGA': ['GAT'], 'GAT': ['ATT'], 'ATT': ['TTC'], 'TTC': ['TCT'], 'TCT': ['CTA', 'CTC'], 'CTC': ['TCT'], 'CTA': ['TAC']}