We hope that you have noticed what is still a limitation of BETTERBWMATCHING from “Implement BetterBWMatching”1 — even though this algorithm counts the number of occurrences of Pattern in Text, it does not tell us where these occurrences are located in Text! To locate pattern matches identified by BETTERBWMATCHING, we can once again use the suffix array.
However, recall that our original motivation for using the Burrows-Wheeler transform was to reduce the amount of memory used by the suffix array for pattern matching. If we add the suffix array to Burrows-Wheeler-based pattern matching, then we are right back where we started!
The memory-saving device that we will employ is inelegant but useful. We will build a partial suffix array of Text, denoted SuffixArrayK(Text), which only contains values that are multiples of some positive integer K. In real applications, partial suffix arrays are often constructed for K = 100, thus reducing memory usage by a factor of 100 compared to a full suffix array.
We should also discuss how to improve BETTERBWMATCHING (reproduced below) by resolving the trade-off between precomputing the values of Countsymbol(i, LastColumn) (requiring substantial memory) and computing these values as we go (requiring substantial runtime).
The balance that we strike is similar to the one used for the partial suffix array. Rather than storing Countsymbol(i, LastColumn) for all positions i, we will only store the Count arrays when i is divisible by C, where C is a constant; these arrays are called checkpoint arrays. When C is large (C is typically equal to 100 in practice) and the alphabet is small (e.g., 4 nucleotides), checkpoint arrays require only a fraction of the memory used by BWT(Text).
What about runtime? Using checkpoint arrays, we can compute the top and bottom pointers in a constant number of steps (i.e., fewer than C). Since each Pattern requires at most |Pattern| pointer updates, the modified BETTERBWMATCHING algorithm now requires O(|Patterns|) runtime, which is the same as using a trie or suffix array.
Furthermore, we now only have to store the following data in memory: BWT(Text), FirstOccurrence, the partial suffix array, and the checkpoint arrays. Storing this data requires memory approximately equal to 1.5 · |Text|. Thus, we have finally knocked down the memory required for solving the Multiple Pattern Matching Problem for millions of sequencing reads into a workable range, and we can at last solve a full version of the Multiple Pattern Matching Problem.
Implement generate_first_occurrence(text). This function returns a mapping of each symbol to their first occurrence in the first column of the Burrows-Wheeler matrix of the text.
Implement generate_partial_suffix_array(k, text). The returned object can be queried (.get(i)) for the index of suffixes, but returns None for suffixes not a multiple of k.
Implement generate_checkpoint_arrays(c, bwttext). This function returns a mapping of each (index, symbol) tuple to the frequency of that symbol in bwttext up to that index. While the mapping is complete, do not store the complete mapping: use for instance __getitem__ to calculate the unstored items.
Implement the find_all_occurances method which takes a FASTA filename, containing a string and a couple of string patterns (both without a '$'!). The function should return a set of indices where one of the patterns can be matched with the string.
>>> text = 'panamabananas$' >>> bwttext = 'smnpbnnaaaaa$a' >>> symbols = sorted(set(text)) >>> first_occurrence = generate_first_occurrence(text) >>> {symbol: first_occurrence[symbol] for symbol in symbols} {'$': 0, 'a': 1, 'b': 7, 'm': 8, 'n': 9, 'p': 12, 's': 13} >>> partial_suffix_array = generate_partial_suffix_array(5, text) >>> [partial_suffix_array.get(i) for i in range(len(text))] [None, 5, None, None, None, None, None, None, None, None, None, 10, 0, None] >>> checkpoint_arrays = generate_checkpoint_arrays(5, bwttext) >>> import pprint >>> pprint.pprint([[checkpoint_arrays[(i, sym)] for sym in symbols] for i in range(len(bwttext))]) [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 1], [0, 0, 0, 1, 1, 0, 1], [0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1, 1], [0, 0, 1, 1, 2, 1, 1], [0, 0, 1, 1, 3, 1, 1], [0, 1, 1, 1, 3, 1, 1], [0, 2, 1, 1, 3, 1, 1], [0, 3, 1, 1, 3, 1, 1], [0, 4, 1, 1, 3, 1, 1], [0, 5, 1, 1, 3, 1, 1], [1, 5, 1, 1, 3, 1, 1]] >>> find_all_occurrences('data01.fna'2) {42, 50}