Reads will form a collection of strings Patterns that we wish to match against a reference genome Text. For each string in Patterns, we will first find all its exact matches as a substring of Text (or conclude that it does not appear in Text). When hunting for the cause of a genetic disorder, we can immediately eliminate from consideration areas of the reference genome where exact matches occur. We will later generalize this problem to find approximate matches, where single nucleotide substitutions in reads separate the individual from the reference genome (or represent errors in reads).
Multiple Pattern Matching Problem: Find all occurrences of a collection of patterns in a text.
Input: A string Text and a collection Patterns containing (shorter) strings.
Output: All starting positions in Text where a string from Patterns appears as a substring.
To solve this problem, we will consolidate Patterns into a directed tree called a trie (pronounced “try”), which is written Trie(Patterns) and has the following properties.
The most obvious way to construct Trie(Patterns) is by iteratively adding each string from Patterns to the growing trie, as implemented by the following algorithm.
Add a method create_trie_from_patterns
that takes a list of patterns/strings.
The function returns the Trie representing the given patterns consisting of a label and a list of Edges.
As output we expect the representation of the trie, so the class you use or return has to atleast implement:
__repr__(self)__str__(self)
We also expect that Trie's __init__ method has a label and varargs edges.
The Edge class should contain an end of class Trie and a label that contains the string of the edge.
The number on the labels of the Trie are not important for the evaluation.
>>> Trie(0) Trie(0) >>> Edge(None, 'A') Edge(None, 'A') >>> Trie(0, Edge(Trie(1), 'A'), Edge(Trie(2), 'B')) Trie(0, Edge(Trie(1), 'A'), Edge(Trie(2), 'B')) >>> trie = create_trie_from_patterns(['ATAGA', 'ATC', 'GAT']) >>> print(repr(trie)) Trie(0, Edge(Trie(1, Edge(Trie(2, Edge(Trie(3, Edge(Trie(4, Edge(Trie(5), 'A')), 'G')), 'A'), Edge(Trie(6), 'C')), 'T')), 'A'), Edge(Trie(7, Edge(Trie(8, Edge(Trie(9), 'T')), 'A')), 'G'))