For this task we ask to write a number of Python functions that can be used to generate an overview with the % GC of any DNA sequence from a given file in FASTA format. Terms used in the preceding sentence that were displayed in italic font are explained below in detail.
>118480563|DQ207729|Bacillus cereus|16S ribosomal RNA gene
AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAATGGATTA
AGAGCTTGCTCTTATGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCATAAGACTGGG
ATAACTCCGGGAAACCGGGGCTAATACCGGATAACATTTTGAACCGCATGGTTCGAAATTGAAAGGCGGC
TTCGGCTGTCACTTATGGATGGACCCGCGTCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCAA
CGATGCGTA
>571435|U16165|Clostridium acetobutylicum|16S ribosomal RNA gene
TGGCGGCGTGCTTAACACATGCAAGTCGAGCGATGAAGCTCCTTCGGGAGTGGATTAGCGGCGGACGGGT
GAGTAACACGTGGGTAACCTGCCTCATAGAGGGGAATAGCCTTTCGAAAGGAAGATTAATACCGCATAAG
ATTGTAGTGCCGCATGGCATAGCAATTAAAGGAGTAATCCGCTATGAGATGGACCCGCGTCGCATTAGCT
AGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACATTGG
GACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTG
>996091|L07834|Geobacter metallireducens|16S ribosomal RNA gene
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGAGTGCCTAACACATGCAAGTCGAACGTGAAGGGGG
CTTCGGTCCCCGGAAAGTGGCGCACGGGTGAGTAACGCGTGGATAATCTGCCCAGTGATCTGGGATAACA
TCTCGAAAGGGGTGCTAATACCGGATAAGCCCACGGAGTCCTTGGATTCTGCGGGAAAAGGGGGGGACCT
TCGGGCCTTTTGTCACTGGATGAGTCCGCGTACCATTAGCTAGTTGGTGGGGTAATGGCCCACCAAGGCT
ACGATGGTTAG
Write function readFasta to which a file location must be passed as an argument. At this location a text file must be retrievable, that contains one or more DNA sequences in FASTA format. As a result, the function should return a list, that for each record from the FASTA file contains a tuple containing the accession number, the generic name and the DNA sequence.
Write a percentGC function that for a given DNA sequence - that is to be passed to the function as an argument - calculates the% GC and returns it as a real value.
Write a function showOverview, to which a list of tuples as generated by the function readFasta must be passed as an argument. This function should display a list that first of all writes a line for each sequence from the given FASTA file (read as a list of tuples) bearing the generic name (for which 30 characters reserved, left aligned), followed by the %GC of the sequence (rounded off to two decimal places), a space and the accession number in parentheses. This is followed by a blank line, and successively also the minimum, maximum and mean %GC of all sequences are printed on separate lines. See the examples below for an illustration of the format in which the overview is to be displayed.
This interactive Python session uses the file seq1.fasta1.
>>> fasta = readFasta('seq1.fasta')
>>> fasta
[('ABCDE', 'elephant', 'AGAGTTTGATAGAGCTTGCT'), ('FGHIJ', 'donkey', 'GAACGCTGGCGGCATGCCTT')]
>>> percentGC('AGAGTTTGATAGAGCTTGCT')
40.0
>>> showOverview(fasta)
elephant 40.00% (ABCDE)
donkey 65.00% (FGHIJ)
minimum 40.00%
maximum 65.00%
mean 52.50%
This interactive Python session uses the file seq2.fasta2.
>>> fasta = readFasta('seq2.fasta')
>>> showOverview(fasta)
Bacillus cereus 53.49% (DQ207729)
Burkholderia xenovorans 56.41% (U86373)
Clostridium acetobutylicum 52.68% (U16165)
Geobacter metallireducens 56.40% (L07834)
Listeria welshimeri 53.61% (X98532)
Methanosarcina acetivorans 56.63% (M59137)
Oceanobacillus iheyensis 52.85% (AB010863)
Thermus thermophilus 63.96% (X07998)
Xanthomonas campestris 55.13% (X95917)
Bacillus sporothermodurans 54.38% (U49078)
minimum 52.68%
maximum 63.96%
mean 55.55%