In this assignment, we use the terms Prefix($$p$$) and Suffix($$p$$) to refer to the first $$k - 1$$ nucleotides and the last $$k - 1$$ nucleotides of a $$k$$-mer $$p$$, respectively. Given an arbitrary collection of $$k$$-mers $$C_\text{DNA}$$, we form a graph having a node for each k-mer in $$C_\text{DNA}$$ and connect $$k$$-mers $$p$$ and $$p'$$ by a directed edge if Suffix($$p$$) is equal to Prefix($$p'$$). The resulting graph is called the overlap graph on these $$k$$-mers, denoted $$\mathcal{G}_\text{overlap}(C_\text{DNA})$$.

Assignment

Write a function overlap_graph that takes a FASTA file containing a collection of $$k$$-mers $$C_\text{DNA}$$. The function must return the overlap graph $$\mathcal{G}_\text{overlap}(C_\text{DNA})$$. An edge from $$k$$-mer $$p$$ to $$k$$-mer $$p'$$ in $$\mathcal{G}_\text{overlap}(C_\text{DNA})$$ is represented as a tuple $$(p, p')$$. The overlap graph itself is represented as a dictionary that maps all tuples $$(p, p')$$ onto the number of edges from $$k$$-mer $$p$$ to $$k$$-mer $$p'$$ in $$\mathcal{G}_\text{overlap}(C_\text{DNA})$$.

Example

In the following interactive session, we assume the FASTA file data01.fna1 to be located in the current directory.

>>> overlap_graph('data01.fna')
{('GCATG', 'CATGC'): 1, ('CATGC', 'ATGCG'): 1, ('AGGCA', 'GGCAT'): 1, ('GGCAT', 'GCATG'): 1}