The billion-year war

The war between viruses and bacteria has been waged for over a billion years. Viruses called bacteriophages1 (or simply phages) require a bacterial host to propagate, and so they must somehow infiltrate the bacterium. Such deception can only be achieved if the phage understands the genetic framework underlying the bacterium's cellular functions. The phage's goal is to insert DNA2 that will be replicated within the bacterium and lead to the reproduction of as many copies of the phage as possible, which sometimes also involves the bacterium's demise.

To defend itself, the bacterium must either obfuscate its cellular functions so that the phage cannot infiltrate it, or better yet, go on the counterattack by calling in the air force. Specifically, the bacterium employs aerial scouts called restriction enzymes3, which operate by cutting through viral DNA to cripple the phage. But what kind of DNA are restriction enzymes looking for?

EcoRV
DNA cleaved by EcoRV restriction enzyme.

The restriction enzyme is a homodimer4, which means that it is composed of two identical substructures. Each of these structures separates from the restriction enzyme in order to bind to and cut one strand of the phage DNA molecule. Both substructures are pre-programmed with the same target string containing 4 to 12 nucleotides to search for within the phage DNA (see figure above). The chance that both strands of phage DNA will be cut (thus crippling the phage) is greater if the target is located on both strands of phage DNA, as close to each other as possible. By extension, the best chance of disarming the phage occurs when the two target copies appear directly across from each other along the phage DNA, a phenomenon that occurs precisely when the target is equal to its own reverse complement5. Eons of evolution have made sure that most restriction enzyme targets now have this form.

Assignment

A DNA string is a reverse palindrome6 if it is equal to its reverse complement. For instance, GCATGC is a reverse palindrome because its reverse complement is GCATGC.

reverse palindroom
Palindromic recognition site.

Write a function restrictionSites that takes a DNA string. The function must return a set containing all restriction sites in the given DNA string. A restriction site is a position in a DNA sequence where a reverse palindrome is located. Each restriction site is represented by a tuple that contains the position of the first letter of the palindrome and the length of the palindrome. Here we assume that the first character of the DNA string is at position 1, the second letter at position 2, and so on. The function has two additional optional arguments minLength (default value: 4) and maxLength (default value: 12) that respectively take the minimal and maximal length of the palindromes that must be taken into account to determine the restriction sites.

Example

>>> restrictionSites('TCAATGCATGCGGGTCTATATGCAT')
{(4, 6), (5, 4), (6, 6), (7, 4), (17, 4), (18, 4), (20, 6), (21, 4)}
>>> restrictionSites('AAGTCATAGCTATCGATCAGATCAC', minLength=5)
{(6, 8), (7, 6), (12, 6)}

>>> from Bio import SeqIO
>>> restrictionSites(*SeqIO.parse('data.fna', 'fasta'), maxLength=5)
{(1, 4), (12, 4), (14, 4), (18, 4), (20, 4)}

Epilogue

You may be curious how the bacterium prevents its own DNA7 from being cut by restriction enzymes. The short answer is that it locks itself from being cut through a chemical process called DNA methylation8. DNA methylation is a chemical process that a cell applies to its own DNA by bonding methyl groups ($$CH_3$$) to nucleotides9, which effectively locks them from being involved in a reaction (especially those involving further bonding, like transcription10).

DNA-methylatie
Illustration of a methylated base pair of DNA11.

Methylation serves a number of fascinating practical purposes. In one example, restriction enzymes12 employed by a bacterium would not be capable of discriminating between the foreign DNA of a phage13 and the bacterium's own DNA, so the bacterium methylates its DNA to protect it from its own restriction enzymes.

Methylation is also a remarkable way to regulate gene activity14, as methylated DNA can be inherited, which has opened up a brand new field called epigenetics15. This field studies functionally relevant modifications to the genome16 that do not involve a change in the genome's sequence of nucleotides. In short, the ultimate truth is that there is a lot more to inheritance than simply replicating DNA17!

Methylation usually occurs at CpG sites18, where cytosine19 and guanine20 nucleotides appear consecutively. In recent years, researchers have shown that DNA methylation occurs in higher organisms and that it is important for normal development: methylated areas of the genome are protected from transcription activators21 and remain inactive. These "silent" parts of the genome are called heterochromatin22.