Given an arbitrary collection of $$k$$-mers $$C_\text{DNA}$$ (where some $$k$$-mers may appear multiple times), we define the composition graph $$\mathcal{G}_\text{comp}(C_\text{DNA})$$ as a graph with $$|C_\text{DNA}|$$ isolated edges. Every edge is labeled by a $$k$$-mer from $$C_\text{DNA}$$, and the starting and ending nodes of an edge are labeled by the prefix and suffix of the $$k$$-mer labeling that edge. We then define the de Bruijn graph of $$C_\text{DNA}$$, denoted $$\mathcal{G}_\text{DB}(C_\text{DNA})$$, by gluing identically labeled nodes in $$\mathcal{G}_\text{C}(C_\text{DNA})$$, which yields the following algorithm.

DeBruin(CDNA)
    represent every k-mer in CDNA as an isolated edge between its prefix and suffix
    glue all nodes with identical labels, yielding the graph GDB(CDNA)
    return GDB(CDNA)

Assignment

Write a function de_bruijn_graph that takes a FASTA file containing a collection of $$k$$-mers DNA string $$C_\text{DNA}$$. The function must return the de Bruijn graph $$\mathcal{G}_\text{DB}(C_\text{DNA})$$. The de Bruijn graph is represented as a dictionary that maps each label of a starting node in $$\mathcal{G}_\text{DB}(k, s)$$ onto a list containing the lexicographically sorted labels of all ending nodes connected to that starting node.

Example

In the following interactive session, we assume the FASTA file data01.fna1 to be located in the current directory.

>>> de_bruijn_graph('data01.fna')
{'GAG': ['AGG'], 'CAG': ['AGG', 'AGG'], 'GGG': ['GGA', 'GGG'], 'AGG': ['GGG'], 'GGA': ['GAG']}