Classifying open reading frames

One of the first steps toward identifying possible genes in a piece of DNA is to search for an open reading frame (ORF), or an interval of DNA that can serve as a template for translation. An ORF is a reading frame that begins with a start codon, ends with either a stop codon or the end of the strand, and has no other stop codons in between.

open reading frames
The reading frame in the top strand starting with ATG and ending with the stop codon TAA is an ORF.

Recall that there are six reading frames for any strand of DNA: three derive from shifting translation of the strand itself (we can begin parsing codons at the first, second or third nucleotide) and three derive from shifts to the complementary strand. Both strands are counted because either strand of DNA can serve as the coding strand during transcription.

Of course, identifying genes by looking for ORFs is an oversimplification. To find a bona fide gene, you may need to search for promoters and (in the case of eukaryotes) identify introns. However, using ORFs to identify putative genes is a useful approximation in prokaryotes and viruses, whose genomes are less complicated than eukaryotic genomes.

Assignment

An ORF begins with a start codon and ends either at a stop codon or at the end of the string. We will assume the standard genetic code for translating an RNA string into a protein string.

Write a function longest ORF that takes a DNA string $$s$$ and returns the longest protein string that can be translated from an ORF of $$s$$. If more than one protein string of maximal length exists, then return the one that comes first in lexicographical order.

Example

>>> longestORF('AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG')
'MLLGSFRLIPKETLIQVAGSSPCNLS'

Programming shortcut

To find ORFs using Biopython, it may be useful to recall the translate() and reverse_complement() methods from the Bio.Seq module.

The EMBOSS package1 contains a program getorf23 that can be used to find ORFs of a given DNA string.