Once the complete genome of an organism has been determined, the question arises what regions of the genome are coding for proteins. Open reading frames (ORFs) are an important concept in predicting these coding regions. To understand how this works, it is important to know that translation of a DNA sequence into a protein happens by converting each three consecutive nucleotides (called codons) into the next amino acid that is appended to the growing protein chain. Certain codons play a special role in translating DNA into proteins:
start codon: the translation of a DNA sequence into a protein can start here
stop codon: the translation of a DNA sequence into a protein ends here
An ORF is a region within a genome that starts at the next start codon (from the beginning of the sequence or from the previous stop codon) and runs until the next stop codon. Note that according to this definition, it is perfectly possible that there are several start codons in between the first and the last codon of an ORF, but that there can not be any intermediate stop codons. The stop codon of an ORF is always located a multiple of three positions upstream to the position of the first start codon of the ORF. It is said that the start and stop codons are in the same reading frame. Since codons consist of three nucleotides, a DNA sequence thus has three reading frames that each may contain zero or more ORFs.
As an example, consider the following DNA sequence that can be read in three reading frames. The first reading frame starts at the letter g, the reading frame at the first letter c and the third reading frame at the second letter c.
1 gcc ctt aat ttt att cat tgg att cca ttc att aac gtg ctg atg tcc cat ttg ttt a
|-------------* |-----------|-----------|---------------------------------*
2 g ccc tta att tta ttc att gga ttc cat tca tta acg tgc tga tgt ccc att tgt tta
|-----------|---------------------------------* |
3 gc cct taa ttt tat tca ttg gat tcc att cat taa cgt gct gat gtc cca ttt gtt ta
* |---------------------------------*
We have separated the codons in each reading frame by a single space in the above representation. In this example we consider the codons att, gca and gcc as start codons (green) and the codons tct, tga and ttt as stop codons (red). Below each reading frame, we have underlined the ORFs with hyphens (-). In addition, we mark the first nucleotide of a start codon with a vertical bar (|) and the last nucleotide of a stop codon with an asterisk (*). In this way, it is clear to see that there are two ORFs in the first reading frame, and one ORF each in the second and third reading frame.
In this exercise we represent DNA sequences (including codons) as strings that only contain the lower case letters a, c, g and t. Write a function ORFs that takes three arguments: i) a DNA sequence (str), ii) a list of start codons (str) and iii) a list of stop codons (str). The function must return a list that contains all ORFs found in each of the three reading frames of the DNA sequence. An ORF is represented as a tuple having the following content
(start position, stop position, frame)
The start position of the ORF is the position (int) of the first nucleotide of the start codon, and the stop position is the position (int) of the last (third) nucleotide of the stop codon. Positions are given relative to the start of the DNA sequence, so that the first nucleotide is at position zero. The frame of an ORF refers to the reading frame in which the ORF was found. The list returned by the function must contain the ORFs sorted according to increasing starting position.
>>> ORFs('gcccttaattttattcattggattccattcattaacgtgctgatgtcccatttgttta', ['att', 'gca', 'gcc'], ['tct', 'tga', 'ttt'])
[(0, 11, 1), (7, 42, 2), (12, 56, 1), (26, 52, 3)]
>>> ORFs('cgtgtgcacaactcaccccgtagacccaaaatgtggataacatg', ['aca', 'gtg'], ['aaa', 'ccc', 'gac'])
[(1, 18, 2), (3, 17, 1)]
>>> ORFs('cgagggctctcactgggacggcagaggctagtcacagtat', ['agt'], ['gac', 'ggc'])
[]