In "Finding a motif in DNA1" we searched for occurrences of a motif as a substring of a larger genetic string. However, because a DNA strand coding for a protein is often interspersed with introns (see "RNA splicing2"), we need a way to recognize a motif that has been chopped up into pieces along a chromosome.
A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of TATGCTAAGATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear. Thus, the indices of ACG in TATGCTAAGATC can be represented by the tuple (2, 5, 9).
As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence. For example, ACG is a subsequence of AACCGGTT in 8 different ways. Your task:
Write a function subsequence that takes two arguments: i) a DNA string and ii) the indices of a subsequence of the DNA string. The function must return a string containing the subsequence of the DNA string that corresponds with the given indices. If the given indices do not correspond to a subsequence of the given DNA string, the function must raise an AssertionError with the message invalid indices.
Write a function indices that takes two DNA strings $$s$$ and $$t$$. The function must return one collection of indices (a tuple) of $$s$$ in which the symbols of $$t$$ appear as a subsequence of $$s$$. If multiple indices exist, the function may return any one. However, if no such indices exist, the function must raise an AssertionError with the message invalid subsequence.
In the following interactive session, we assume the FASTA files data01.fna3 and data02.fna4 to be located in the current directory.
>>> from Bio import SeqIO >>> subsequence('ACGTACGTGACG', (3, 8, 10)) 'GTA' >>> subsequence(*SeqIO.parse('data01.fna', 'fasta'), (7, 12, 21, 32)) 'TGAC' >>> indices('ACGTACGTGACG', 'GTA') (3, 8, 10) >>> indices(*SeqIO.parse('data02.fna', 'fasta')) (7, 11, 21, 32)