In February 2013, brothers named Elwin and Yohan were arrested for six rapes in France, but both denied the charges. Deciding which is guilty is a tricky affair — they're identical twins1, so the genetic difference between them is very slight. Marseille police chief Emmanual Kiehl said, "It could take thousands of separate tests before we know which one of them may be guilty."

This is only the latest in a series of legal conundrums involving identical twins and DNA evidence. During a jewel heist in Germany in January 2009, thieves left behind a drop of sweat on a latex glove. A crime database showed two hits — identical twins Hassan and Abbas O. (under German law their last name was withheld). Both brothers had criminal records for theft and fraud, but both were released2. The court ruled, "From the evidence we have, we can deduce that at least one of the brothers took part in the crime, but it has not been possible to determine which one."

Later that year, identical twins Sathis Raj and Sabarish Raj escaped hanging in Malaysia3 when a judge ruled it was impossible to determine which was guilty of drug smuggling. "Although one of them must be called to enter a defence, I can't be calling the wrong twin to enter his defence," the judge told the court. "I also can't be sending the wrong person to the gallows."

In 2003, a Missouri woman had sex with identical twins Raymon and Richard Miller within hours of one another. When she became pregnant, both men denied fathering the child4. In Missouri a man can be named a legal father only if a paternity test shows a 98 percent or higher probability of a DNA match, but the Miller twins both showed a probability of more than 99.9 percent.

"With identical twins, even if you sequenced their whole genome you wouldn't find difference," forensic scientist Bob Gaensslen told ABC News at the time. More recent research shows that this isn't the case5, but teasing out the difference can be expensive — in the Marseilles case, police were told that such a test would cost €996,000.

It goes on. In August 2013, British authorities were trying to decide how to prosecute a rape when DNA evidence identified both Mohammed and Aftab Asghar6. "It is an unusual case," said prosecutor Sandra Beck. "They are identical twins. The allegation is one of rape. There is further work due."


Most differences found in the genomes of identical twins are due to copy-number variations (CNV). These structural variations alter the DNA of a genome such that cells have an abnormal or — for certain genes — a normal variation in the number of copies of one or more sections of the DNA. CNVs correspond to relatively large regions of the genome that have been removed (deletions) or duplicated (insertions) on certain chromosomes. For example, the chromosome that normally has sections in order as A-B-C-D might instead have sections A-B-C-C-C-D (a duplication of C) or A-B-D (a deletion of C).

In order to detect CNVs, we assume a DNA sequence in composed as the concatenation of a prefix, followed by an infix and a suffix, where the infix is composed of $$n \in \mathbb{N}$$ repeats of a DNA-fragment that we call the copy.

CNV structure

In this assignment, we represent a DNA sequence as a string that only contains the uppercase letters A, C, G and T. Now, say that we have two DNA sequences that only differ in the number of repeats of the copy. Based on a comparison of both sequences, we can identify their different components. In order to do so, you proceed as follows:


>>> replicate('GATC')
>>> replicate('GATC', number=4)
>>> replicate('GATC', number=2, prefix='TAGCC')
>>> replicate(copy='GATC', number=3, suffix='AAGCTC')
>>> replicate(copy='GATC', number=3, prefix='TAGCC', suffix='AAGCTC')
>>> replicate(copy='GATC', number=5, suffix='AAGCTC', prefix='TAGCC')

>>> copy_number('CTCTCTCTCTCTCTCTCTCTCTCT')   # replicate(copy='CT', number=12)
('CT', 12)
>>> copy_number('GATCGATCGATCGATC')           # replicate(copy='GATC', number=4)
('GATC', 4)
>>> copy_number(replicate('GATCGATC', number=2))
('GATC', 4)
>>> copy_number('GATCGATCGATCGATCG')          # replicate(copy='GATC', number=4) + 'G'

>>> seq1 = 'TAGCCGATCGATCGATCAAGCTC'          # replicate(copy='GATC', number=3, prefix='TAGCC', suffix='AAGCTC')
>>> seq2 = 'TAGCCGATCGATCGATCGATCGATCAAGCTC'  # replicate(copy='GATC', number=5, suffix='AAGCTC', prefix='TAGCC')
>>> LCE(seq1, seq2)
>>> LCE(seq1, seq2, suffix=True)
>>> LCE(seq1, seq1)

>>> seq1 = 'TAGCCGATCGATCGATCAAGCTC'          # replicate(copy='GATC', number=3, prefix='TAGCC', suffix='AAGCTC')
>>> seq2 = 'TAGCCGATCGATCGATCGATCGATCAAGCTC'  # replicate(copy='GATC', number=5, suffix='AAGCTC', prefix='TAGCC')
>>> CNV(seq1, seq2)
('GATC', 3, 5)
>>> CNV(seq2, seq1)
('GATC', 5, 3)
>>> CNV(seq1, seq1)
Traceback (most recent call last):
AssertionError: no CNV found
>>> seq3 = 'TAGCGATCGATCGATCGATCGATCAAGCTC'   # replicate(copy='GATC', number=5, prefix='TAGC', suffix='AAGCTC')
>>> CNV(seq1, seq3)
('GATCGAT', 0, 1)
>>> seq4 = 'TAGCCGATCGATCGATCGATCGATCAGCTC'   # replicate(copy='GATC', number=5, suffix='AGCTC', prefix='TAGCC')
>>> CNV(seq1, seq4)
Traceback (most recent call last):
AssertionError: no CNV found