A brief introduction to graph theory

Networks arise everywhere in the practical world, especially in biology. Networks are prevalent in popular applications such as modeling the spread of disease, but the extent of network applications spreads far beyond popular science. Our first question asks how to computationally model a network without actually needing to render a picture of the network.

First, some terminology: graph is the technical term for a network. A graph is made up of hubs called nodes (or vertices), pairs of which are connected via segments/curves called edges. If an edge connects nodes $$v$$ and $$w$$, then it is denoted by $$(v, w)$$ (or equivalently $$(w, v)$$).

Graph theory is the abstract mathematical study of graphs and their properties.

Assignment

A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment. The starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail $$v$$ and head $$w$$ is represented by the tuple $$(v, w)$$ (but not by $$(w, v)$$). A directed loop is a directed edge of the form $$(v, v)$$.

For a collection of strings and a positive integer $$k$$, the overlap graph for the strings is a directed graph $$O_k$$ in which each string is represented by a node, and string $$s$$ is connected to string $$t$$ with a directed edge when there is a length $$k$$ suffix of $$s$$ that matches a length $$k$$ prefix of $$t$$, as long as $$s \neq t$$. We demand $$s \neq t$$ to prevent directed loops in the overlap graph (although directed cycles may be present).

Write a function overlapGraph that takes an integer $$k \in \mathbb{N}_0$$ and the location of a FASTA file containing a collection of DNA strings. The function must return the adjacency list corresponding to the overlap graph $$O_k$$, represented as a set of tuples. The nodes of the overlap graph are labeled with the labels of the DNA strings taken from the FASTA file.

Example

In the following interactive session, we assume the FASTA files data01.fna1 and data02.fna2 to be located in the current directory.

>>> overlapGraph('data01.fna', 3)
{('seq01', 'seq02'), ('seq01', 'seq04'), ('seq02', 'seq03')}

>>> overlapGraph('data02.fna', 3)
{('seq14', 'seq11'), ('seq07', 'seq04'), ('seq11', 'seq01'), ('seq02', 'seq08'), ('seq05', 'seq17'), ('seq15', 'seq09'), ('seq11', 'seq17'), ('seq05', 'seq01'), ('seq16', 'seq06'), ('seq14', 'seq08'), ('seq09', 'seq14'), ('seq13', 'seq02'), ('seq04', 'seq05'), ('seq02', 'seq11'), ('seq08', 'seq16')}