Modeling random genomes

We already know that the genome is not just a random strand of nucleotides. Recall from "Finding a motif in DNA¹" that motifs recur commonly across individuals and species. If a DNA motif occurs in many different organisms, then chances are good that it serves an important function.

At the same time, if you form a long enough DNA string, then you should theoretically be able to locate every possible short substring in the string. And genomes are very long. The human genome contains about 3.2 billion base pairs. As a result, when analyzing an unknown piece of DNA, we should try to ensure that a motif does not occur out of random chance.

To conclude whether motifs are random or not, we need to quantify the likelihood of finding a given motif randomly. If a motif occurs randomly with high probability, then how can we really compare two organisms to begin with? In other words, all very short DNA strings will appear randomly in a genome, and very few long strings will appear. What is the critical motif length at which we can throw out random chance and conclude that a motif appears in a genome for a reason?

In this assignment, our first step toward understanding random occurrences of strings is to form a simple model for constructing genomes randomly. We will then apply this model to a somewhat simplified exercise: calculating the probability of a given motif occurring randomly at a fixed location in the genome.

Assignment

An array is a structure containing an ordered collection of objects (numbers, strings, other arrays, etc.). We let $$A[k]$$ denote the $$k$$-th value in array $$A$$. You may like to think of an array as simply a matrix having only one row.

A random string is constructed so that the probability of choosing each subsequent symbol is based on a fixed underlying symbol frequency.

GC-content offers us natural symbol frequencies for constructing random DNA strings. If the GC-content is $$x$$, then we set the symbol frequencies of C and G equal to $$\frac{x}{2}$$ and the symbol frequencies of A and T equal to $$\frac{1 - x}{2}$$. For example, if the GC-content is 40%, then as we construct the string, the next symbol is G/C with probability 0.2, and the next symbol is A/T with probability 0.3.

In practice, many probabilities wind up being very small. In order to work with small probabilities, we may plug them into a function that "blows them up" for the sake of comparison. Specifically, the common logarithm of $$x$$ (defined for $$x > 0$$ and denoted $$\log_{10}(x)$$ is the exponent to which we must raise 10 to obtain $$x$$.

The graph of the common logarithm function of $$x$$. For a given $$x$$-value, the corresponding $$y$$-value is the exponent to which we must raise 10 to obtain $$x$$. Note that $$x$$-values between 0 and 1 get mapped to $$y$$-values between $$-\infty$$ and 0.

See the above figure for a graph of the common logarithm function $$y = \log_{10}(x)$$. In this graph, we can see that the logarithm of $$x$$-values between 0 and 1 always winds up mapping to $$y$$-values between $$-\infty$$ and 0: $$x$$-values near 0 have logarithms close to $$-\infty$$, and $$x$$-values close to 1 have logarithms close to $$0$$. Thus, we will select the common logarithm as our function to "blow up" small probability values for comparison.

Write a function probability that takes a DNA string $$s$$ and a value $$g \in \mathbb{R}$$, with $$0 \leq g \leq 1$$. The function must return the common logarithm of the probability that a random string constructed with the GC-content $$g$$ will exactly match $$s$$.

Example

In the following interactive session, we assume the FASTA file data.fna² to be located in the current directory.

        >>> probability('ACGATACAA', 0.129)
-5.7373919000321045

>>> from Bio import SeqIO
>>> probability(*SeqIO.parse('data.fna', 'fasta'), 0.096)
-64.59701871764963