We defined a mismatch in "Hamming distance1" and we now generalize "Most frequent patterns2" to incorporate mismatches as well.
Given strings
patternCount1(AACAAGCTGATAAACATTTAAAGAG, AAAAA) = 4
because AAAAA appears four times in this string with at most one mismatch: AACAA, ATAAA, AAACA, and AAAGA. Note that two of these occurrences overlap.
A most frequent
Write a function most_frequent_kmers that takes a DNA
string
In the following interactive session, we assume the FASTA file data.fna3 to be located in the current directory.
>>> most_frequent_kmers('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) {'GATG', 'ATGC', 'ATGT'} >>> most_frequent_kmers('AACAAGCTGATAAACATTTAAAGAG', 5, 1) {'AAAAA'} >>> from Bio import SeqIO >>> most_frequent_kmers(*SeqIO.parse('data.fna', 'fasta'), 10, 2) {'GCACACAGAC', 'GCGCACACAC'}
The algorithms for finding all most frequent