Getting repetitive

We have seen that every genome contains a large number of repeats and noted that the Alu repeat recurs around a million times on the human genome. Yet exactly how repetitive is the human genome?

To frame such a vague question mathematically, we first need to make the observation that if the genome were formed by adding nucleobases randomly, with each base having a 1/4 probability of being added at each nucleotide position, then we should expect to see a huge number of different substrings in the genome. Yet (to take a simple case) the genome containing only adenosine and forming the DNA string AAAAAA…AAA has relatively very few distinct substrings.

Now, real genomes are formed by a process that chooses nucleotides somewhere in between these two extreme cases, and so to quantify just how random this process is, we need to take the percentage of distinct substrings appearing in a genome with respect to the maximum possible number of distinct substrings that could appear in a genome of the same length.

Assignment

Given a length $$n$$ string $$s$$ formed over an alphabet $$\mathscr{A}$$ of size $$a$$, let the substring count $$\textrm{sub}(s)$$ denote the total number of distinct substrings of $$s$$. Furthermore, let the maximum substring count $$\textrm{m}(a, n)$$ denote the maximum number of distinct substrings that could appear in a string of length $$n$$ formed over $$\mathscr{A}$$.

The linguistic complexity of $$s$$ (written $$\textrm{lc}(s)$$) is equal to \[ \textrm{lc}(s) = \frac{\textrm{sub}(s)}{\textrm{m}(a, n)} \] In other words, $$\textrm{lc}(s)$$ represents the percentage of observed substrings of $$s$$ to the total number that are theoretically possible. Note that $$0 < \textrm{lc}(s) <1$$, with smaller values of $$\textrm{lc}(s)$$ indicating that $$s$$ is more repetitive.

As an example, consider the DNA string $$s$$ = ATTTGGATT ($$a = 4$$). In the following table, we demonstrate that \[ \textrm{lc}(s) = \frac{35}{40} = 0.875 \] by considering the number of observed and possible length $$k$$ substrings of $$s$$, which are denoted by $$\textrm{sub}_k(s)$$ and $$\textrm{m}(a, n, k)$$, respectively.

$$k$$ $$\textrm{sub}_k(s)$$ $$\textrm{m}(a, k, n)$$
1 3 4
2 5 8
3 6 7
4 6 6
5 5 5
6 4 4
7 3 3
8 2 2
9 1 1
total 35 40

Observe that \[ \textrm{m}(a, n) = \sum_{k=1}^n{\textrm{m}(a, n, k)} = 40 \] and \[ \textrm{sub}(s) = \sum_{k=1}^n{\textrm{sub}_k(s)} = 35 \]

Write a function linguisticComplexity that takes an DNA strings $$s$$. The function must return the linguistic complexity $$\textrm{lc}(s)$$.

Example

In the following interactive session, we assume the FASTA file data.fna to be located in the current directory.

>>> linguisticComplexity('ATTTGGATT')
0.875

>>> from Bio import SeqIO
>>> linguisticComplexity(*SeqIO.parse('data.fna', 'fasta'))
0.7481964105905349

Hint

Think about using suffix trees for solving this assignment.