gene finding

The first step in finding protein-coding genes in DNA is identifying Open Reading Frames (ORFs). A Reading Frame is a starting point from which you divide the DNA into codons, or groups of three letters. You can start from the 1st, 2nd, or 3rd letter in the forward DNA strand or the 1st, 2nd, or 3rd letter on the complementary DNA strand. This results in a total of 6 reading frames. An Open Reading Frame is defined as a start codon followed by a series of codons and then a stop codon within the same reading frame. When examining a piece of DNA, we usually find multiple overlapping ORFs across the different reading frames. The longest ORF is typically a valid protein-coding gene, while the shorter overlapping ORFs do not code for proteins. Additional signals are needed to validate that the longest ORF indeed codes for a protein, such as an RNA polymerase binding site and a ribosome binding site. In this assignment, we will focus only on the longest ORFs.

Overlappende Open Reading Frames

Task

Write a function read_dna that reads a file containing a DNA sequence. The DNA strand should be stored in a string variable, without any whitespace or newline characters that may be present in the file. The input for the function is the filename, and the output is the string variable. You can assume that the file is located in the current directory.

Write a function reverse_complement that takes a string variable as input and returns the complementary strand in reverse order.

Write a function find_orfs that generates a list of all Open Reading Frames in a piece of DNA in both directions. Each element in the list should be a string variable containing the DNA sequence. Within this function, you will call the reverse_complement function. We will only find ORFs that begin with the start codon ATG and end with one of the three stop codons TAG, TAA, or TGA.

Write a function longest_orf that takes a list as input (such as generated by vind_orfs) and returns the longest ORF.

Write a function translate_orf that converts an ORF to an amino acid sequence. You can use the codon table from the file codon_tabel.txt for this. You can assume that the file codon_tabel.txt is located in the current directory.