A group under Dr. Fred Sanger at Cambridge University sequenced the mitochondrial genome of one individual of European descent during the 1970s, determining it to have a length of 16,569 base pairs (0.0006% of the total human genome) containing some 37 genes. The Cambridge Reference Sequence (CRS) for human mitochondrial DNA was first published in 1981 leading to the initiation of the human genome project1.

When other researchers repeated the sequencing, some striking discrepancies were noted. The original published sequence included eleven errors, including one extra base pair in position 3107, and incorrect assignments of single base pairs. Some of these were the result of contamination with bovine and HeLa specimens. The corrected revised CRS was published by Andrews et al. in 1999 and is designated as rCRS.

When mitochondrial DNA sequencing is used for genealogical purposes, the results are often reported as differences from the revised CRS. This notation form is illustrated in the example below, in which a fictional reference sequence is used.

reference: GCTGTCCAGATA

sequence: GCTCTCTAGAGA $$\longrightarrow$$ 4C,7T,11G

In this notation, 4C indicates that the sequence differs from the reference sequence at the fourth position, in the sense that there the base C is present (whereas at the corresponding position in the reference sequence, the base is G). In exactly the same way 7T indicates that the sequence at the seventh position differs from the reference sequence, because there the base is T (whereas at the corresponding position in the reference sequence, the base is C). Observed differences to the reference sequence are separated by a comma. If the sequence does not differ from the reference sequence at any position, this is written as an empty string.

Assignment

  1. Write a function seq2diff, which returns the observed differences between a given sequence seq and a given reference sequence refseq in the format which has been explained above. Both sequences must be passed to the function as a parameter.

  2. Write a function diff2seq that returns the original sequence when the observed differences diff and the reference sequence refseq is given. A string with the observed differences and the reference sequence should be passed to the function as a parameter.

Example

>>> seq2diff(seq='GCTCTCTAGAGA', refseq='GCTGTCCAGATA')
'4C,7T,11G'
>>> diff2seq(diff='4C,7T,11G', refseq='GCTGTCCAGATA')
'GCTCTCTAGAGA'