The genome of most bacteria consists of a single circular molecule of DNA. DNA replication — the process of producing two identical replicas from one original DNA molecule — is initiated at a particular sequence in the genome called the origin of replication, and proceeds from this point simultaneous in both directions (see figure below). The replication process ends at a position in the genome called the terminus of replication.

DNA replicatie
Topology of bi-directional replication of a circular prokaryotic chromosome. The continuous line is the DNA strand replicated as the leading strand. The dashed line is the DNA strand replicated as the lagging strand. Ori: the origin of replication. Ter: the terminus of replication. Ori and Ter divide the chromosome into two replichores, arbitrarily called left and right.

The specific structure of the origin of replication varies somewhat from species to species, but all share some common characteristics such as high AT content (adenine and thymine are easier to separate because they form only two hydrogen bonds whereas guanine and cytosine form three). The origin of replication binds the pre-replication complex, a protein complex that recognizes, unwinds, and begins to copy DNA.

A simple procedure to determine the position of the origin and terminus of replication in a genome sequence, makes use of a vector representation of the DNA molecule. This vector is constructed as a list (list) of $$(x, y)$$-coordinates (tuple; $$x, y \in \mathbb{N}$$ (int)) that starts at the point $$(0, 0)$$ in the origin. For each successive base in the DNA sequence a neighboring point is visited: the left neighbor for base A, the right neighbor for base T, the upstairs neighbor for base C and the downstairs neighbor for base G. The neighboring points are always at distance one from the previous point. The figure below shows a graphical display of the vector representation of the DNA sequence GACCCTTGT.

DNA vector
Vector representation of the DNA sequence GACCCTTGT. The positions where the $$y$$-coordinate for the first time reaches its maximal and minimal value is indicated by orange dots.

The positions in the vector representation where the $$y$$-coordinate for the first time reaches its maximal and minimal value, correspond to the positions of the origin and terminus of replication on the genome. However, which of the two points is the origin and which one is the terminus can not be determined unambiguously. Below you see an example of the vector representation of the complete genome sequence of Haemophilus influenzae strain Rd (L420231), where the positions of the origin and terminus are indicated using black circles.

DNA vectorvoorstelling van H. influenza
Vector representation of the complete genome sequence of Haemophilus influenzae strain Rd (L420232). The positions of the origin and terminus of replication are indicated using black circles. Start and end points of the genome sequence as recorded in the INSDC database is indicated using red dots.

Assignment

We represent DNA sequences as strings (str) that only consists of the uppercase letters A, C, G and T. Your task:

Example

>>> vector('GACCCTTGT')
[(0, 0), (0, -1), (-1, -1), (-1, 0), (-1, 1), (-1, 2), (0, 2), (1, 2), (1, 1), (2, 1)]
>>> vector('CTGGGGTAA')
[(0, 0), (0, 1), (1, 1), (1, 0), (1, -1), (1, -2), (1, -3), (2, -3), (1, -3), (0, -3)]

>>> replicatie('GACCCTTGT')
(5, 1)
>>> replicatie('CTGGGGTAA')
(1, 6)

>>> sequentie([(0, 0), (0, -1), (-1, -1), (-1, 0), (-1, 1), (-1, 2), (0, 2), (1, 2), (1, 1), (2, 1)])
'GACCCTTGT'
>>> sequentie([(0, 0), (0, 1), (1, 1), (1, 0), (1, -1), (1, -2), (1, -3), (2, -3), (1, -3), (0, -3)])
'CTGGGGTAA'

Resources