The case of mutated repeats

In "Finding a motif with modifications1" we considered a problem in which we were given a motif and a long string (perhaps representing a genome), and we aimed to find the "closest" substring of the long string to the motif. In that problem, "closest" was defined as a minimum with respect to edit distance.

Yet there may be multiple substring candidates from the genome that achieve the minimum distance to the motif; this situation might occur in practice when the motif forms a repeat that occurs multiple times with variations deriving from mutations.

In this assignment, we would like to find all substrings of a genome that are within a certain fixed distance of the desired motif.

Assignment

Write a function similarMotifs that takes an integer $$k \in \mathbb{N}_0$$, a DNA strings $$s$$ representing a motif, and a DNA string $$t$$ representing a genome. The function must return the set of all substrings $$t'$$ of $$t$$ such that the edit distance $$d_E(s, t')$$ is less than or equal to $$k$$. Each substring should be encoded by a tuple containing its location in $$t$$ followed by its length.

Example

In the following interactive session, we assume the FASTA file data.fna2 to be located in the current directory.

>>> similarMotifs(2, 'ACGTAG', 'ACGGATCGGCATCGT')
{(1, 4), (1, 5), (1, 6)}

>>> from Bio import SeqIO
>>> similarMotifs(2, *SeqIO.parse('data.fna', 'fasta'))
{(1, 4), (1, 5), (1, 6)}