A $$k$$-mer is a string of length $$k$$. We define the function PatternCount($$s$$, $$p$$) as the number of times that a $$k$$-mer $$p$$ appears as a substring of $$s$$. For example,
>>> pattern_count('ACAACTATGCATACTATCGGGAACTATCCT', 'ACTAT') 3
We note that
>>> pattern_count('CGATATATCCATAG ', 'ATA') 3
is equal to 3 (not 2) since we should account for overlapping occurrences of $$p$$ in $$s$$.
To compute pattern_count($$s$$, $$p$$), our plan is to "slide a window" down $$s$$, checking whether each $$k$$-mer substring of $$s$$ matches $$p$$. We will therefore refer to the $$k$$-mer starting at position $$i$$ of $$s$$ as $$s(i, k)$$. We will use 0-based indexing, meaning that we count starting at 0 instead of 1. In this case, $$s$$ begins at position 0 and ends at position $$|s| - 1$$ ($$|s|$$ denotes the number of symbols in $$s$$). For example, if $$s$$ = GACCATACTG, then $$s(4, 3)$$ = ATA. Note that the last $$k$$-mer of $$s$$ begins at position $$|s| - k$$, e.g., the last 3-mer of GACCATACTG starts at position $$10 - 3 = 7$$. This discussion results in the following pseudocode for computing pattern_count($$t$$, $$p$$).
PatternCount(s, p)
count ← 0
for i ← 0 to |s| − |p|
if t(i, |p|) = p
count ← count + 1
return count
In the following interactive session, we assume the FASTA file data.fna1 to be located in the current directory.
>>> pattern_count('ACAACTATGCATACTATCGGGAACTATCCT', 'ACTAT') 3 >>> pattern_count('CGATATATCCATAG ', 'ATA') 3 >>> from Bio import SeqIO >>> pattern_count(*SeqIO.parse('data.fna', 'fasta')) 294
In all assignments a DNA string is either represented in Python as
a string object that only contain the letters A, C, G and T
a Seq object2 (defined in the Bio.Seq module) with a DNA alphabet
a SeqRecord object3 (defined in the Bio.SeqRecord module) that holds a sequence (as a Seq object) with a DNA alphabet