The genome of most bacteria consists of a single circular molecule of DNA. DNA replication — the process of producing two identical replicas from one original DNA molecule — is initiated at a particular sequence in the genome called the origin of replication, and proceeds from this point simultaneous in both directions (see figure below). The replication process ends at a position in the genome called the terminus of replication.

DNA replicatie

The specific structure of the origin of replication varies somewhat from species to species, but all share some common characteristics such as high AT content (adenine and thymine are easier to separate because they form only two hydrogen bonds whereas guanine and cytosine form three). The origin of replication binds the pre-replication complex, a protein complex that recognizes, unwinds, and begins to copy DNA.

A simple procedure to determine the position of the origin and terminus of replication in a genome sequence, makes use of a vector representation of the DNA molecule. This vector is constructed as a list (list) of $$(x, y)$$-coordinates (tuple; $$x, y \in \mathbb{N}$$ (int)) that starts at the point $$(0, 0)$$ in the origin. For each successive base in the DNA sequence a neighboring point is visited: the left neighbor for base A, the right neighbor for base T, the upstairs neighbor for base C and the downstairs neighbor for base G. The neighboring points are always at distance one from the previous point. The figure below shows a graphical display of the vector representation of the DNA sequence GACCCTTGT.

DNA vector

The positions in the vector representation where the $$y$$-coordinate for the first time reaches its maximal and minimal value, correspond to the positions of the origin and terminus of replication on the genome. However, which of the two points is the origin and which one is the terminus can not be determined unambiguously. Below you see an example of the vector representation of the complete genome sequence of Haemophilus influenzae strain Rd (L42023), where the positions of the origin and terminus are indicated using black circles.

DNA vectorvoorstelling van H. influenza

Assignment

We represent DNA sequences as strings (str) that only consists of the uppercase letters A, C, G and T. Your task:

Example

>>> vector('GACCCTTGT')
[(0, 0), (0, -1), (-1, -1), (-1, 0), (-1, 1), (-1, 2), (0, 2), (1, 2), (1, 1), (2, 1)]
>>> vector('CTGGGGTAA')
[(0, 0), (0, 1), (1, 1), (1, 0), (1, -1), (1, -2), (1, -3), (2, -3), (1, -3), (0, -3)]

>>> replicatie('GACCCTTGT')
(5, 1)
>>> replicatie('CTGGGGTAA')
(1, 6)

>>> sequentie([(0, 0), (0, -1), (-1, -1), (-1, 0), (-1, 1), (-1, 2), (0, 2), (1, 2), (1, 1), (2, 1)])
'GACCCTTGT'
>>> sequentie([(0, 0), (0, 1), (1, 1), (1, 0), (1, -1), (1, -2), (1, -3), (2, -3), (1, -3), (0, -3)])
'CTGGGGTAA'

Resources