One of the most remarkable qualities of genome sequences is the large repetitions of certain fragments within the genome. This is mainly the case with eukaryotes. The human chromosome Y, for example, consists mainly of repeated subsequences. When screening 3.6 million DNA base pairs extracted from C. elegans more than 7000 families with repeated sequences were identified. Prokaryotes, on the other hand are hardly ever a part of repeated sequences in the genome, except for some minor and well structured repetitions. Aside from the incredible amount of repetitions, DNA also stands out because of the large variety of repeated structures, the diversity of mechanisms which are represented to explain the origin and maintenance of these repetitions, and the biological functions that some of these repetitions can play.
It can be concluded from the introduction that genome research requires efficient algorithms to trace different types of repeated structures in genome sequences. Especially required is the implementation of a algorithm to find the longest repeated substring (LRS) within a given DNA sequence. To do this, follow these steps:
Write a function LCP
to which two strings must be passed as argument. This function should
return the longest common prefix (LCP) of the two given
strings. If the given strings have no common prefix, then an emtpy
string should be returned as a result. The longest common prefix of the
strings banaan
and banana,
for example, is bana. The
function should differentiate between uppercase and lowercase letter
when comparing the prefixes.
Note(terminology): A prefix consisting of one or more letters
in front of a string is called the prefix of that string. A
suffix consisting of one or more letters at the end of a string is
called the suffix of that string. For example, bio
is the prefix of the string biotechnology,
and technology is the
suffix of the same string.
Write a function LRS to which a string must be passed as argument. This function should return the longest repeated substring (LRS) of a given string. Note that we are mainly interested in using this function to determine the longest repeated subsequence of a given DNA sequence (a string consisting only of the letters A, C, G and T), but that this function should also be able to extract the longest repeated substring out of any string. The required algorithm to determine the LRS of a given string, works as follows:
First make a list of all suffixes of a given string. For the string banana, for example, a list with the suffixes a, na, ana, nana, anana and banana should be drawn up. The order of the suffixes in this list is initially of no importance (see next step).
Sort the suffix list alphabetically. This causes the suffixes with a common prefix to be listed underneath each other. The sorted suffix list of the previous example would then be: a, ana, anana, banana, na en nana.
Calculate the LCP for each pair of successive suffixes (use the function LCP). The longest of these common prefixes should then be the LRS we are looking for. If there are multiple common prefixes that are the longest, then the first one which was found when going through the pairs (from left to right) should be returned. From the sorted suffix list a, ana, anana, banana, na and nana it is immediately clear that ana is the LRS of the string banana. This example also shows that repetitions of a LRS can partially overlap.
>>> LCP('banaan', 'banana')
'bana'
>>> LCP('mango', 'mandarin')
'man'
>>> LCP('peach', 'peach')
'peach'
>>> LCP('apple', 'appletree')
'apple'
>>> LCP('appletree', 'apple')
'apple'
>>> LCP('apple', 'pear')
''
>>> LCP('potential', 'Pomelo')
''
>>> LRS('CAATCCGCCTGTATTAAGGTACGGCCTTTTGACTGAAGTT')
'GCCT'
>>> LRS('AGTGTGGGGCATGCCTATTGGCCACCTTGTAGGAGCCTTG')
'CCTTG'
>>> LRS('GACTAGCCGTCTACTTATTGAATCGCAGCATCATTTCGAG')
'ACT'
>>> LRS('GGCCCGTCGCGACATTTGAGCTAGTCAGCGATCTTTACCT')
'GCGA'
>>> LRS('TGCCCCGTCAACCTCTACTCATCTCGCTCTAAACTGTAAG')
'CTCTA'