A $$k$$-mer is a string of length $$k$$. We define the function PatternCount($$s$$, $$p$$) as the number of times that a $$k$$-mer $$p$$ appears as a substring of $$s$$. For example,

        >>> pattern_count('ACAACTATGCATACTATCGGGAACTATCCT', 'ACTAT')
3

We note that

        >>> pattern_count('CGATATATCCATAG ', 'ATA')
3

is equal to 3 (not 2) since we should account for overlapping occurrences of $$p$$ in $$s$$.

To compute pattern_count($$s$$, $$p$$), our plan is to "slide a window" down $$s$$, checking whether each $$k$$-mer substring of $$s$$ matches $$p$$. We will therefore refer to the $$k$$-mer starting at position $$i$$ of $$s$$ as $$s(i, k)$$. We will use 0-based indexing, meaning that we count starting at 0 instead of 1. In this case, $$s$$ begins at position 0 and ends at position $$|s| - 1$$ ($$|s|$$ denotes the number of symbols in $$s$$). For example, if $$s$$ = GACCATACTG, then $$s(4, 3)$$ = ATA. Note that the last $$k$$-mer of $$s$$ begins at position $$|s| - k$$, e.g., the last 3-mer of GACCATACTG starts at position $$10 - 3 = 7$$. This discussion results in the following pseudocode for computing pattern_count($$t$$, $$p$$).

        PatternCount(s, p)
    count ← 0
    for i ← 0 to |s| − |p|
        if t(i, |p|) = p
            count ← count + 1
    return count

Assignment

Write a function pattern_count that takes two DNA strings $$s$$ and $$p$$. The function must return the number of occurrences of $$p$$ in $$s$$.

Example

In the following interactive session, we assume the FASTA file data.fna¹ to be located in the current directory.

        >>> pattern_count('ACAACTATGCATACTATCGGGAACTATCCT', 'ACTAT')
3
>>> pattern_count('CGATATATCCATAG ', 'ATA')
3

>>> from Bio import SeqIO
>>> pattern_count(*SeqIO.parse('data.fna', 'fasta'))
294

Note

In all assignments a DNA string is either represented in Python as

a string object that only contain the letters A, C, G and T
a Seq object² (defined in the Bio.Seq module) with a DNA alphabet
a SeqRecord object³ (defined in the Bio.SeqRecord module) that holds a sequence (as a Seq object) with a DNA alphabet