Generalizing GC-content

A length $$k$$ substring of a genetic string is commonly called a k-mer. A genetic string of length $$n$$ can be seen as composed of $$n−k+1$$ overlapping k-mers. The k-mer composition of a genetic string encodes the number of times that each possible k-mer occurs in the string (see figure below). The 1-mer composition is a generalization of the GC-content of a strand of DNA, and the 2-mer, 3-mer, and 4-mer compositions of a DNA string are also commonly known as its di-nucleotide, tri-nucleotide, and tetra-nucleotide compositions.

The biological significance of k-mer composition is manyfold. GC-content is helpful not only in helping to identify a piece of unknown DNA (see "Computing GC-content"), but also because a genomic region having high GC-content compared to the rest of the genome signals that it may belong to an exon. Analyzing k-mer composition is vital to fragment assembly as well.

In "Computing GC-content" we also drew an analogy between analyzing the frequency of characters and identifying the underlying language. For larger values of $$k$$, the k-mer composition offers a more robust fingerprint of a string's identity because it offers an analysis on the scale of substrings (i.e., words) instead of that of single symbols. As a basis of comparison, in language analysis the k-mer composition of a text can be used not only to pin down the language, but also often the author.

Assignment

For a fixed positive integer $$k$$, order all possible k-mers taken from an underlying alphabet lexicographically.

Then the k-mer composition of a string $$s$$ can be represented by an array $$A$$ for which $$A[m]$$ denotes the number of times that the $$m$$-th k-mer (with respect to the lexicographic order) appears in $$s$$.

Write a function kmerComposition that takes two arguments: i) a DNA string $$s$$ and ii) an integer $$k \in \mathbb{N}_0$$. The function must return a list containing the k-mer composition of the given DNA string.

Example

In the following interactive session, we assume the FASTA file data.fna to be located in the current directory.

        >>> from Bio import SeqIO
>>> kmerComposition(*SeqIO.parse('data.fna', 'fasta'), 1)
[105, 109, 101, 100]
>>> kmerComposition(*SeqIO.parse('data.fna', 'fasta'), 2)
[31, 29, 30, 15, 26, 25, 23, 35, 24, 25, 22, 29, 24, 29, 26, 21]
>>> kmerComposition(*SeqIO.parse('data.fna', 'fasta'), 3)
[12, 7, 7, 5, 6, 7, 5, 11, 5, 6, 2, 16, 4, 5, 4, 2, 6, 8, 11, 1, 6, 5, 7, 7, 7, 5, 8, 3, 8, 10, 9, 8, 7, 5, 9, 3, 6, 6, 4, 9, 6, 7, 6, 3, 7, 9, 6, 7, 6, 9, 3, 6, 8, 7, 7, 7, 6, 7, 6, 7, 5, 5, 7, 4]
>>> kmerComposition(*SeqIO.parse('data.fna', 'fasta'), 4)
[4, 1, 4, 3, 0, 1, 1, 5, 1, 3, 1, 2, 2, 1, 2, 0, 1, 1, 3, 1, 2, 1, 3, 1, 1, 1, 1, 2, 2, 5, 1, 3, 0, 2, 2, 1, 1, 1, 1, 3, 1, 0, 0, 1, 5, 5, 1, 5, 0, 2, 0, 2, 1, 2, 1, 1, 1, 2, 0, 1, 0, 0, 1, 1, 3, 2, 1, 0, 3, 2, 3, 0, 0, 2, 0, 8, 0, 0, 1, 0, 2, 1, 3, 0, 0, 0, 1, 4, 3, 2, 1, 1, 3, 1, 2, 1, 3, 1, 2, 1, 2, 1, 1, 1, 2, 3, 2, 1, 1, 0, 1, 1, 3, 2, 1, 2, 6, 2, 1, 1, 1, 2, 3, 3, 3, 2, 3, 0, 3, 2, 1, 1, 0, 0, 1, 4, 3, 0, 1, 5, 0, 2, 0, 1, 2, 1, 3, 0, 1, 2, 2, 1, 1, 0, 3, 0, 0, 4, 5, 0, 3, 0, 2, 1, 1, 3, 0, 3, 2, 2, 1, 1, 0, 2, 1, 0, 2, 2, 1, 2, 0, 2, 2, 5, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 3, 4, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 1, 5, 2, 0, 3, 2, 1, 1, 2, 2, 3, 0, 3, 0, 1, 3, 1, 2, 3, 0, 2, 1, 2, 2, 1, 2, 3, 0, 1, 2, 3, 1, 1, 3, 1, 0, 1, 1, 3, 0, 2, 1, 2, 2, 0, 2, 1, 1]