You may have had the misfortune to participate in a team-building event that featured the "human knot", in which everyone joins hands with two other people, and the group must undo the giant knot of arms without letting go.
Let's consider a simplified version of the human knot. Say that we have an even number of people at a party who are standing in a circle, and they pair off and shake hands at the same time. One combinatorial question at hand asks us to count the total number of ways that the guests can shake hands without any two pairs interfering with each other by crossing arms.
This silly little counting problem is actually an excellent analogy for RNA folding. In practice, base pairing can occur anywhere along the RNA molecule, but the secondary structure of RNA often forbids base pairs crossing over each other, which forms a structure called a pseudoknot. Pseudoknots are not technically knots, but they nevertheless cause RNA to fold over itself.
Forbidding pseudoknots offers an interesting wrinkle to the problem of counting potential RNA secondary structures that we started working with in "Perfect matchings and RNA secondary structures1", in which every possible nucleotide of a strand of RNA must base pair with another nucleotide.
A matching in a graph is non-crossing if none of its edges cross each other. If we assume that the $$n$$ nodes of this graph are arranged around a circle, and if we label these nodes with positive integers between 1 and $$n$$, then a matching is non-crossing as long as there are not edges $$\{i, j\}$$ and $$\{k, l\}$$ such that $$i < k < j <l$$.
A non-crossing matching of basepair edges in the bonding graph corresponding to an RNA string will correspond to a possible secondary structure of the underlying RNA strand that lacks pseudoknots.
In this assignment, we will consider counting non-crossing perfect matchings of basepair edges. As a motivating example of how to count non-crossing perfect matchings, let $$c_n$$ denote the number of non-crossing perfect matchings in the complete graph $$K_{2n}$$. After setting $$C_0 = 1$$, we can see that $$C_1$$ should equal 1 as well. As for the case of a general $$n$$, say that the nodes of $$K_{2n}$$ are labeled with the positive integers from $$1$$ to $$2n$$. We can join node 1 to any of the remaining $$2n - 1$$ nodes. Yet once we have chosen this node (say $$m$$), we cannot add another edge to the matching that crosses the edge $$\{1, m\}$$. As a result, we must match all the edges on one side of $$\{1, m\}$$ to each other. This requirement forces $$m$$ to be even, so that we can write $$m = 2k$$ for some positive integer $$k$$.
There are $$2k - 2$$ nodes on one side of $$\{1, m\}$$ and $$2n - 2k$$
nodes on the other side of $$\{1, m\}$$, so that in turn there will be
$$C_{k - 1}\cdot C_{n - k}$$ different ways of forming a perfect matching
on the remaining nodes of $$K_{2n}$$. If we let $$m$$ vary over all
possible $$n - 1$$ choices of even numbers between $$1$$ and $$2n$$, then
we obtain the recurrence relation \[ C_n = \sum_{k=1}^n{C_{k - 1}\cdot
C_{n - k}} \] The resulting numbers $$C_n$$ counting non-crossing perfect
matchings in $$K_{2n}$$ are called the Catalan numbers, and they
appear in a huge number of other settings. See the figure below for an
illustration counting the first four Catalan numbers.
Your task:
Write a function catalanNumber that takes an integer $$n \in \mathbb{N}$$ and returns the Catalan number $$C_n$$.
Write a function noncrossingPerfectMatchings that takes an RNA strings $$s$$ having the same number of occurrences of A as U and the same number of occurrences of C as G. The function must return the total number of non-crossing perfect matchings of basepair edges in the bonding graph of $$s$$ modulo 1,000,000.
In the following interactive session, we assume the FASTA file data.fna2 to be located in the current directory.
>>> catalanNumber(0) 1 >>> catalanNumber(1) 1 >>> catalanNumber(2) 2 >>> catalanNumber(3) 5 >>> catalanNumber(4) 14 >>> noncrossingPerfectMatchings('AUAU') 2 >>> from Bio import SeqIO >>> noncrossingPerfectMatchings(*SeqIO.parse('data.fna', 'fasta')) 412480