Given an arbitrary collection of $$k$$-mers $$C_\text{DNA}$$ (where some $$k$$-mers may appear multiple times), we define the composition graph $$\mathcal{G}_\text{comp}(C_\text{DNA})$$ as a graph with $$|C_\text{DNA}|$$ isolated edges. Every edge is labeled by a $$k$$-mer from $$C_\text{DNA}$$, and the starting and ending nodes of an edge are labeled by the prefix and suffix of the $$k$$-mer labeling that edge. We then define the de Bruijn graph of $$C_\text{DNA}$$, denoted $$\mathcal{G}_\text{DB}(C_\text{DNA})$$, by gluing identically labeled nodes in $$\mathcal{G}_\text{C}(C_\text{DNA})$$, which yields the following algorithm.

        DeBruin(CDNA)
    represent every k-mer in CDNA as an isolated edge between its prefix and suffix
    glue all nodes with identical labels, yielding the graph GDB(CDNA)
    return GDB(CDNA)
      

Assignment

Write a function de_bruijn_graph that takes a FASTA file containing a collection of $$k$$-mers DNA string $$C_\text{DNA}$$. The function must return the de Bruijn graph $$\mathcal{G}_\text{DB}(C_\text{DNA})$$. The de Bruijn graph is represented as a dictionary that maps each label of a starting node in $$\mathcal{G}_\text{DB}(k, s)$$ onto a list containing the lexicographically sorted labels of all ending nodes connected to that starting node.

Example

In the following interactive session, we assume the FASTA file data01.fna¹ to be located in the current directory.

        >>> de_bruijn_graph('data01.fna')
{'GAG': ['AGG'], 'CAG': ['AGG', 'AGG'], 'GGG': ['GGA', 'GGG'], 'AGG': ['GGG'], 'GGA': ['GAG']}