We defined a mismatch in "Hamming distance1" and we now generalize "Most frequent patterns2" to incorporate mismatches as well.

Given strings $$s$$ and $$p$$ as well as an integer $$d$$, we define PatternCountd($$s$$, $$p$$) as the total number of occurrences of $$p$$ in $$s$$ with at most $$d$$ mismatches. For example

patternCount1(AACAAGCTGATAAACATTTAAAGAG, AAAAA) = 4

because AAAAA appears four times in this string with at most one mismatch: AACAA, ATAAA, AAACA, and AAAGA. Note that two of these occurrences overlap.

A most frequent $$k$$-mer with up to $$d$$ mismatches in $$s$$ is simply a string $$p$$ maximizing PatternCountd($$s$$, $$p$$) among al $$k$$-mers. Note that $$p$$ does not need to actually appear as a substring of $$s$$. For example, AAAAA is the most frequent 5-mer with 1 mismatch in AACAAGCTGATAAACATTTAAAGAG, even though AAAAA does not appear exactly in this string. Keep this in mind while solving the following problem.

Assignment

Write a function most_frequent_kmers that takes a DNA string $$s$$ and two integers $$k$$ and $$d$$. The function must return the set of all most frequent $$k$$-mers with up to $$d$$ mismatches in $$s$$.

Example

In the following interactive session, we assume the FASTA file data.fna3 to be located in the current directory.

>>> most_frequent_kmers('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1)
{'GATG', 'ATGC', 'ATGT'}
>>> most_frequent_kmers('AACAAGCTGATAAACATTTAAAGAG', 5, 1)
{'AAAAA'}

>>> from Bio import SeqIO
>>> most_frequent_kmers(*SeqIO.parse('data.fna', 'fasta'), 10, 2)
{'GCACACAGAC', 'GCGCACACAC'}

Note

The algorithms for finding all most frequent $$k$$-mers with up to $$d$$ mismatches in $$s$$ becomes rather slow as $$k$$ and $$d$$ increase. In practice, your solution should work for $$k \leq 12$$ and $$d \leq 3$$.