Given a genome , the path graph is the path consisting of edges, where the -th
edge of this path is labeled by the -th -mer in and the
-th node of the path is labeled by the -th -mer in
. The de Bruijn graph
is formed by gluing identically labeled
nodes in .
Note
In this assignment we only build the de Bruijn graph from the
-mers, not taking into account their reverse complements.
Assignment
Write a function de_bruijn_graph that takes two arguments:
i) an integer () and ii) a FASTA file
containing a DNA string . The function must return the de Bruijn
graph . The de Bruijn graph is represented
as a dictionary that maps each label of a starting node in
onto a list containing the
lexicographically sorted labels of all ending nodes connected to that
starting node.
Example
In the following interactive session, we assume the FASTA file data01.fna to be located
in the current directory.
>>> de_bruijn_graph(4, 'data01.fna')
{'AAG': ['AGA'], 'AGA': ['GAT'], 'GAT': ['ATT'], 'ATT': ['TTC'], 'TTC': ['TCT'], 'TCT': ['CTA', 'CTC'], 'CTC': ['TCT'], 'CTA': ['TAC']}