Motifs are rarely contiguous

In "Finding a motif in DNA1" we searched for occurrences of a motif as a substring of a larger genetic string. However, because a DNA strand coding for a protein is often interspersed with introns (see "RNA splicing2"), we need a way to recognize a motif that has been chopped up into pieces along a chromosome.

Assignment

A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of TATGCTAAGATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear. Thus, the indices of ACG in TATGCTAAGATC can be represented by the tuple (2, 5, 9).

As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence. For example, ACG is a subsequence of AACCGGTT in 8 different ways. Your task:

Example

In the following interactive session, we assume the FASTA files data01.fna3 and data02.fna4 to be located in the current directory.

>>> from Bio import SeqIO

>>> subsequence('ACGTACGTGACG', (3, 8, 10))
'GTA'
>>> subsequence(*SeqIO.parse('data01.fna', 'fasta'), (7, 12, 21, 32))
'TGAC'

>>> indices('ACGTACGTGACG', 'GTA')
(3, 8, 10)
>>> indices(*SeqIO.parse('data02.fna', 'fasta'))
(7, 11, 21, 32)