Given a genome $$s$$, the path graph $$\mathcal{G}_\text{path}(k, s)$$ is the path consisting of $$|s| - k + 1$$ edges, where the $$i$$-th edge of this path is labeled by the $$i$$-th $$k$$-mer in $$s$$ and the $$i$$-th node of the path is labeled by the $$i$$-th $$(k - 1)$$-mer in $$s$$. The de Bruijn graph $$\mathcal{G}_\text{DB}(k, s)$$ is formed by gluing identically labeled nodes in $$\mathcal{G}_\text{path}(k, s)$$.

Note

In this assignment we only build the de Bruijn graph from the $$k$$-mers, not taking into account their reverse complements.

Assignment

Write a function de_bruijn_graph that takes two arguments: i) an integer $$k$$ ($$k \geq 2$$) and ii) a FASTA file containing a DNA string $$s$$. The function must return the de Bruijn graph $$\mathcal{G}_\text{DB}(k, s)$$. The de Bruijn graph is represented as a dictionary that maps each label of a starting node in $$\mathcal{G}_\text{DB}(k, s)$$ onto a list containing the lexicographically sorted labels of all ending nodes connected to that starting node.

Example

In the following interactive session, we assume the FASTA file data01.fna¹ to be located in the current directory.

        >>> de_bruijn_graph(4, 'data01.fna')
{'AAG': ['AGA'], 'AGA': ['GAT'], 'GAT': ['ATT'], 'ATT': ['TTC'], 'TTC': ['TCT'], 'TCT': ['CTA', 'CTC'], 'CTC': ['TCT'], 'CTA': ['TAC']}