There are three different ways to divide a DNA string into codons for translation, one starting at each of the first three starting positions of the string. These different ways of dividing a DNA string into codons are called reading frames. Since DNA is double-stranded, a genome has six reading frames (three on each strand), as shown in the figure below.

six-frame translation
Six different reading frames give six different ways for the same fragment of DNA to be transcribed and translated, three from each strand. The top three amino acid strings are read from left to right, whereas the bottom three strings are read from right to left. Stop codons are represented by XXX.

We say that a DNA string $$s$$ encodes an amino acid string $$p$$ if the RNA string transcribed from either $$s$$ or its reverse complement $$\bar{s}$$ translates into $$p$$.

Assignment

Write a function peptide_matches that takes an amino acid string $$p$$ and the location of a FASTA file containing a DNA string $$g$$. The function must return a set containing all substrings of $$g$$ encoding amino acid string $$p$$. Each of these substrings is represented as a tuple ($$x$$, $$y$$, $$s$$) containing the following elements:

All positions in the genome $$g$$ are zero-indexed. Substrings that encode $$p$$ on the forward strand have $$x < y$$. Substrings that encode $$p$$ on the backward strand (the reverse complement of $$g$$) have $$x > y$$.

Example

In the following interactive session, we assume the FASTA file data01.fna1 to be located in the current directory.

>>> peptide_matches('MA', 'data01.fna')
{(6, 12, 'ATGGCC'), (0, 6, 'ATGGCC'), (7, 1, 'ATGGCC')}

Note

The stop codon should not be translated, as shown in the sample dataset.