In "Finding a shared motif1" we discussed searching through a database containing multiple genetic strings to find a longest common substring of these strings, which served as a motif shared by the two strings. However, as we saw in "RNA splicing2", coding regions of DNA are often interspersed by introns that do not code for proteins.
We therefore need to locate shared motifs that are separated across
exons, which means that the motifs are not required to be contiguous. To
model this situation, we need to enlist subsequences.
A string $$u$$ is a common subsequence of strings $$s$$ and $$t$$ if the symbols of $$u$$ appear in order as a subsequence of both $$s$$ and $$t$$. For example, ACTG is a common subsequence of AACCTTGG and ACACTGTGA.
Analogously to the definition of longest common substring, $$u$$ is a longest common subsequence of $$s$$ and $$t$$ if there does not exist a longer common subsequence of the two strings. Continuing our above example, ACCTTG is a longest common subsequence of AACCTTGG and ACACTGTGA, as is AACTGG.
Write a function LCS that takes two DNA strings $$s$$ and $$t$$. The function must return a string that contains a longest common sequence of $$s$$ and $$t$$. If more than one solution exists, the function may return any one.
In the following interactive session, we assume the FASTA file data.fna3 to be located in the current directory.
>>> LCS('AACCTTGG', 'ACACTGTGA') 'AACTTG' >>> from Bio import SeqIO >>> LCS(*SeqIO.parse('data.fna', 'fasta')) 'CACTTCGAGGGGTATGCGCCGTTG'