In "Finding the longest multiple repeat1" we introduced the suffix tree. This data structure has a wide array of applications, one of which was to help us identify long repeats in a genome. In that problem, we provided the tree as part of the dataset, but a vital computational exercise is to create the suffix tree solely from a string.
Given a string $$s$$ having length $$n$$, recall that its suffix tree $$T(s)$$ is defined by the following properties:
$$T(s)$$ is a rooted tree having exactly $$n$$ leaves
every edge of $$T(s)$$ is labeled with a substring of $$s^{∗}$$, where $$s^{∗}$$ is the string formed by adding a placeholder symbol $ to the end of $$s$$.
every internal node of $$T(s)$$ other than the root has at least two children; i.e., it has degree at least 3
the substring labels for the edges leading down from a node to its children must begin with different symbols
by concatenating the substrings along edges, each path from the root to a leaf corresponds to a unique suffix of $$s^{∗}$$.
Write a function suffixTree that takes an DNA strings $$s$$. The function must return a list containing the substrings of $$s^{∗}$$ encoding the edges of the suffix tree for $$s$$. These substrings may listed in any order.
In the following interactive session, we assume the FASTA file data.fna2 to be located in the current directory.
>>> suffixTree('ATAAATG') ['AAATG$', 'G$', 'T', 'ATG$', 'TG$', 'A', 'A', 'AAATG$', 'G$', 'T', 'G$', '$'] >>> from Bio import SeqIO >>> suffixTree(*SeqIO.parse('data.fna', 'fasta')) ['ACTT$', '$', 'T', '$', 'T$', 'C', 'ACTT$', 'TT$']