We say that a $$k$$-mer $$p$$ appears as a substring of string $$s$$
with at most $$d$$ mismatches if there is some $$k$$-mer substring $$p'$$ of
$$s$$ having $$d$$ or fewer mismatches with $$p$$, i.e.
$$\text{hammingDistance}(p, p') \leq d$$. Our observation that a
DnaA
box may appear with slight variations leads to the following generalization
of the Pattern Matching problem.
Assignment
Write a function
approximate_matches that takes two DNA strings
$$p$$ and $$s$$ and an integer $$d$$. The function must return a tuple
containing all starting positions where pattern $$p$$ appears as a substring
of $$s$$ with at most $$d$$ mismatches.
Example
In the following interactive session, we assume the FASTA file data.fna to be located in
the current directory.
>>> approximate_matches('ATTCTGGA', 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAATGCCTAGCGGCTTGTGGTTTCTCCTACGCTCC', 3)
(6, 7, 26, 27, 78)
>>> from Bio import SeqIO
>>> approximate_matches(*SeqIO.parse('data.fna', 'fasta'), 3)
(217, 1145, 2135, 5981, 12433, 13010, 16156)