A shot in the dark

In "Locating restriction sites¹" we first familiarized ourselves with restriction enzymes. Recall that these enzymes are used by bacteria to cut through both strands of viral DNA, thus disarming the virus: the viral DNA locations where these cuts are made are known as restriction sites. Recall also that every restriction enzyme is preprogrammed with a reverse palindromic interval of DNA to which it will bind and cut, called a recognition sequence. These even length intervals are usually either 4 or 6 base pairs long, although longer ones do exist. Rare-cutter enzymes have recognition sequences of 8 or more base pairs.

In this assignment, we will ask a simple question: how does the bacterium "know" that it will probably succeed in finding a restriction site within the virus's DNA? The answer is that the short length of recognition sequences guarantees a large number of matches occurring randomly.

Intuitively, we would expect for a recognition sequence of length 6 to occur on average once every $$4^6 = 4096$$ base pairs. Note that this fact does not imply that the associated restriction enzyme will cut the viral DNA every 4096 bp. It may find two restriction sites close together, then not find a restriction site for many thousand nucleotides.

In this assignment, we will generalize the problem of finding an average number of restriction sites to take into account the GC-content of the underlying string being analyzed.

Assignment

Say that you place a number of bets on your favorite sports teams. If their chances of winning are $$0.3$$, $$0.8$$, and $$0.6$$, then you should expect on average to win $$0.3 + 0.8 + 0.6 = 1.7$$ of your bets (of course, you can never win exactly $$1.7$$ times!).

More generally, if we have a collection of events $$A_1, A_2, \ldots, A_n$$, then the expected number of events occurring is $$P(A_1) + P(A_2) + \ldots + P(A_n)$$. Consult the note following the assignment for a precise explanation of this fact. In this assignment, we extend the idea of finding an expected number of events to finding the expected number of times that a given string occurs as a substring of a random string.

Write a function expectedSites that takes an integer $$n \in \mathbb{N}$$, a number $$g \in \mathbb{R}$$ with $$0 \leq g \leq 1$$, and a DNA string $$s$$ of even length. The function must return the expected number of times that $$s$$ will appear as a substring of a random DNA string $$t$$ of length $$n$$, where $$t$$ is formed with GC-content $$x$$ (see "Introduction to random strings²").

Example

        >>> expectedSites(10, 0.25, 'AG')
0.421875
>>> expectedSites(10, 0.5, 'AG')
0.5625
>>> expectedSites(10, 0.75, 'AG')
0.421875

The mathematical details

In this assignment, we are speaking of an expected number of events. How can we tie this into the definition of expected value that we already have from "Calculating expected offspring³" ?

The answer relies on a slick mathematical trick. For any event $$A$$, we can form a random variable for $$A$$, called an indicator random variable $$I_A$$. For an outcome $$x$$, $$I_A(x) = 1$$ when $$x$$ belongs to $$A$$ and $$I_A(x) = 0$$ when $$x$$ belongs to $$A^c$$.

For an indicator random variable $$I_A(x) = 1$$, verify that $$E(I_A) = P(A)$$.

You should also verify from our original formula for expected value that for any two random variables $$X$$ and $$Y$$, $$E(X + Y)$$ is equal to $$E(X)+E(Y)$$. As a result, the expected number of events $$A_1, A_2, \ldots, A_n$$ occurring, or \[ E(I_{A_1} + I_{A_2} + \cdots + I_{A_n}) \] reduces to \[ P(A_1) + P(A_2) + \cdots + P(A_n) \]