We defined a mismatch in "Hamming distance¹" and we now generalize "Most frequent patterns²" to incorporate mismatches as well.

Given strings $s$ and $p$ as well as an integer $d$ , we define PatternCount_d( $s$ , $p$ ) as the total number of occurrences of $p$ in $s$ with at most $d$ mismatches. For example

patternCount₁(AACAAGCTGATAAACATTTAAAGAG, AAAAA) = 4

because AAAAA appears four times in this string with at most one mismatch: AACAA, ATAAA, AAACA, and AAAGA. Note that two of these occurrences overlap.

A most frequent $k$ -mer with up to $d$ mismatches in $s$ is simply a string $p$ maximizing PatternCount_d( $s$ , $p$ ) among al $k$ -mers. Note that $p$ does not need to actually appear as a substring of $s$ . For example, AAAAA is the most frequent 5-mer with 1 mismatch in AACAAGCTGATAAACATTTAAAGAG, even though AAAAA does not appear exactly in this string. Keep this in mind while solving the following problem.

Assignment

Write a function most_frequent_kmers that takes a DNA string $s$ and two integers $k$ and $d$ . The function must return the set of all most frequent $k$ -mers with up to $d$ mismatches in $s$ .

Example

In the following interactive session, we assume the FASTA file data.fna³ to be located in the current directory.

        >>> most_frequent_kmers('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1)
{'GATG', 'ATGC', 'ATGT'}
>>> most_frequent_kmers('AACAAGCTGATAAACATTTAAAGAG', 5, 1)
{'AAAAA'}

>>> from Bio import SeqIO
>>> most_frequent_kmers(*SeqIO.parse('data.fna', 'fasta'), 10, 2)
{'GCACACAGAC', 'GCGCACACAC'}

Note

The algorithms for finding all most frequent $k$ -mers with up to $d$ mismatches in $s$ becomes rather slow as $k$ and $d$ increase. In practice, your solution should work for $k \leq 12$ and $d \leq 3$ .