More random strings

In "Introduction to random strings1" we discussed searching for motifs in large genomes, in which random occurrences of the motif are possible. Our aim is to quantify just how frequently random motifs occur.

One class of motifs of interest are promoters, or regions of DNA that initiate the transcription of a gene. A promoter is usually located shortly before the start of its gene, and it contains specific intervals of DNA that provide an initial binding site for RNA polymerase to initiate transcription. Finding a promoter is usually the second step in gene prediction after establishing the presence of an ORF (see "Open reading frames2").

Unfortunately, there is no quick rule for identifying promoters. In Escherichia coli, the promoter contains two short intervals (TATAAT and TTGACA), which are respectively located 10 and 35 base pairs upstream from the beginning of the gene's ORF. Yet even these two short intervals are consensus strings (see "Consensus and profile3"): they represent average-case strings that are not found intact in most promoters. Bacterial promoters further vary in that some contain additional intervals used to bind to specific proteins or to change the intensity of transcription.

Eukaryotic promoters are even more difficult to characterize. Most have a TATA box (consensus sequence: TATAAA), preceded by an interval called a B recognition element, or BRE. These elements are typically located within 40 bp of the start of transcription. For that matter, eukaryotic promoters can hold a larger number of additional "regulatory" intervals, which can be found as far as several thousand base pairs upstream of the gene.

Assignment

Our aim in this assignment is to determine the probability with which a given motif (a known promoter, say) occurs in a randomly constructed genome. Unfortunately, finding this probability is tricky. Instead of forming a long genome, we will form a large collection of smaller random strings having the same length as the motif. These smaller strings represent the genome's substrings, which we can then test against our motif.

Given a probabilistic event $$A$$, the complement of $$A$$ is the collection $$A^c$$ of outcomes not belonging to $$A$$. Because $$A^c$$ takes place precisely when $$A$$ does not, we may also call $$A^c$$ "not $$A$$".

For a simple example, if $$A$$ is the event that a rolled die is 2 or 4, then $$P(A) = \frac{1}{3}$$. $$A^c$$ is the event that the die is 1, 3, 5, or 6, and $$P(A^c) = \frac{2}{3}$$. In general, for any event we will have the identity that $$P(A) + P(A^c) = 1$$.

Write a function probability that takes a positive integer $$N \in \mathbb{N}$$, a number $$x \in \mathbb{R}$$ between 0 and 1, and a DNA string $$s$$. The function must return the probability that if $$N$$ random DNA strings having the same length as $$s$$ are constructed with GC-content $$x$$ (see "Introduction to random strings4"), then at least one of the strings equals $$s$$. We allow for the same random string to be created more than once.

Example

In the following interactive session, we assume the FASTA file data.fna5 to be located in the current directory.

>>> probability(90000, 0.6, 'ATAGCCGA')
0.689

>>> from Bio import SeqIO
>>> probability(90000, 0.6, *SeqIO.parse('data.fna', 'fasta'))
0.689