We say that a $$k$$-mer $$p$$ appears as a substring of string $$s$$ with at most $$d$$ mismatches if there is some $$k$$-mer substring $$p'$$ of $$s$$ having $$d$$ or fewer mismatches with $$p$$, i.e. $$\text{hammingDistance}(p, p') \leq d$$. Our observation that a DnaA box may appear with slight variations leads to the following generalization of the Pattern Matching problem.

Assignment

Write a function approximate_matches that takes two DNA strings $$p$$ and $$s$$ and an integer $$d$$. The function must return a tuple containing all starting positions where pattern $$p$$ appears as a substring of $$s$$ with at most $$d$$ mismatches.

Example

In the following interactive session, we assume the FASTA file data.fna¹ to be located in the current directory.

        >>> approximate_matches('ATTCTGGA', 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAATGCCTAGCGGCTTGTGGTTTCTCCTACGCTCC', 3)
(6, 7, 26, 27, 78)

>>> from Bio import SeqIO
>>> approximate_matches(*SeqIO.parse('data.fna', 'fasta'), 3)
(217, 1145, 2135, 5981, 12433, 13010, 16156)