We defined a mismatch in "Hamming distance1" and we now generalize "Most frequent patterns2" to incorporate mismatches as well.
Given strings $$s$$ and $$p$$ as well as an integer $$d$$, we define PatternCountd($$s$$, $$p$$) as the total number of occurrences of $$p$$ in $$s$$ with at most $$d$$ mismatches. For example
patternCount1(AACAAGCTGATAAACATTTAAAGAG, AAAAA) = 4
because AAAAA appears four times in this string with at most one mismatch: AACAA, ATAAA, AAACA, and AAAGA. Note that two of these occurrences overlap.
A most frequent $$k$$-mer with up to $$d$$ mismatches in $$s$$ is simply a string $$p$$ maximizing PatternCountd($$s$$, $$p$$) among al $$k$$-mers. Note that $$p$$ does not need to actually appear as a substring of $$s$$. For example, AAAAA is the most frequent 5-mer with 1 mismatch in AACAAGCTGATAAACATTTAAAGAG, even though AAAAA does not appear exactly in this string. Keep this in mind while solving the following problem.
Write a function most_frequent_kmers that takes a DNA string $$s$$ and two integers $$k$$ and $$d$$. The function must return the set of all most frequent $$k$$-mers with up to $$d$$ mismatches in $$s$$.
In the following interactive session, we assume the FASTA file data.fna3 to be located in the current directory.
>>> most_frequent_kmers('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) {'GATG', 'ATGC', 'ATGT'} >>> most_frequent_kmers('AACAAGCTGATAAACATTTAAAGAG', 5, 1) {'AAAAA'} >>> from Bio import SeqIO >>> most_frequent_kmers(*SeqIO.parse('data.fna', 'fasta'), 10, 2) {'GCACACAGAC', 'GCGCACACAC'}
The algorithms for finding all most frequent $$k$$-mers with up to $$d$$ mismatches in $$s$$ becomes rather slow as $$k$$ and $$d$$ increase. In practice, your solution should work for $$k \leq 12$$ and $$d \leq 3$$.