Creating a suffix tree

In "Finding the longest multiple repeat1" we introduced the suffix tree. This data structure has a wide array of applications, one of which was to help us identify long repeats in a genome. In that problem, we provided the tree as part of the dataset, but a vital computational exercise is to create the suffix tree solely from a string.

Assignment

Given a string $$s$$ having length $$n$$, recall that its suffix tree $$T(s)$$ is defined by the following properties:

Write a function suffixTree that takes an DNA strings $$s$$. The function must return a list containing the substrings of $$s^{∗}$$ encoding the edges of the suffix tree for $$s$$. These substrings may listed in any order.

Example

In the following interactive session, we assume the FASTA file data.fna2 to be located in the current directory.

>>> suffixTree('ATAAATG')
['AAATG$', 'G$', 'T', 'ATG$', 'TG$', 'A', 'A', 'AAATG$', 'G$', 'T', 'G$', '$']

>>> from Bio import SeqIO
>>> suffixTree(*SeqIO.parse('data.fna', 'fasta'))
['ACTT$', '$', 'T', '$', 'T$', 'C', 'ACTT$', 'TT$']