We saw in "Introduction to pattern matching1" that a data structure commonly used to encode the relationships among a collection of strings was the trie, which is particularly useful when the strings represent a collection of patterns that we wish to match to a larger text.
The trie is helpful when processing multiple strings at once, but when we want to analyze a single string, we need something different.
In this assignment, we will use a new data structure to handle the problem of finding long repeats in the genome. Recall from "Finding a motif in DNA2" that cataloguing these repeats is a problem of the utmost interest to molecular biologists, as a natural correlation exists between the frequency of a repeat and its influence on RNA transcription. Our aim is therefore to identify long repeats that occur more than some predetermined number of times.
A repeated substring of a string $$s$$ of length $$n$$ is simply a substring that appears in more than one location of $$s$$. More specifically, a $$k$$-fold substring appears in at least $$k$$ distinct locations.
The suffix tree of $$s$$, denoted $$T(s)$$, is defined as follows:
The figure below shows an example of a suffix tree.
Write a function longestRepeatedSubstring that takes two arguments: i) a DNA string $$s$$ and ii) an integer $$k \in \mathbb{N}_0$$. The function must return the longest substring of $$s$$ that occurs at least $$k$$ times in $$s$$. If multiple solutions exist, the function may return any solution. If no solutions exist, the function must return the value None.
In the following interactive session, we assume the FASTA file data.fna3 to be located in the current directory.
>>> longestRepeatedSubstring('CATACATAC', 2) 'CATAC' >>> from Bio import SeqIO >>> longestRepeatedSubstring(*SeqIO.parse('data.fna', 'fasta'), 19) 'AAAAC'