We defined a mismatch in "Hamming distance1" and we now generalize "Most frequent patterns2" to incorporate mismatches as well.

Given strings s and p as well as an integer d, we define PatternCountd(s, p) as the total number of occurrences of p in s with at most d mismatches. For example

patternCount1(AACAAGCTGATAAACATTTAAAGAG, AAAAA) = 4

because AAAAA appears four times in this string with at most one mismatch: AACAA, ATAAA, AAACA, and AAAGA. Note that two of these occurrences overlap.

A most frequent k-mer with up to d mismatches in s is simply a string p maximizing PatternCountd(s, p) among al k-mers. Note that p does not need to actually appear as a substring of s. For example, AAAAA is the most frequent 5-mer with 1 mismatch in AACAAGCTGATAAACATTTAAAGAG, even though AAAAA does not appear exactly in this string. Keep this in mind while solving the following problem.

Assignment

Write a function most_frequent_kmers that takes a DNA string s and two integers k and d. The function must return the set of all most frequent k-mers with up to d mismatches in s.

Example

In the following interactive session, we assume the FASTA file data.fna3 to be located in the current directory.

>>> most_frequent_kmers('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1)
{'GATG', 'ATGC', 'ATGT'}
>>> most_frequent_kmers('AACAAGCTGATAAACATTTAAAGAG', 5, 1)
{'AAAAA'}

>>> from Bio import SeqIO
>>> most_frequent_kmers(*SeqIO.parse('data.fna', 'fasta'), 10, 2)
{'GCACACAGAC', 'GCGCACACAC'}

Note

The algorithms for finding all most frequent k-mers with up to d mismatches in s becomes rather slow as k and d increase. In practice, your solution should work for k12 and d3.