A number of different data presentation formats have been used to represent genetic strings. The history of file formats presents its own kind of evolution: some formats have died out, being replaced by more successful ones. Three file formats are currently the most popular:
A simple reference on file formats can be found here.
In this assignment, we will familiarize ourselves with FASTA. We will save the other two formats for later problems. In FASTA format, a string is introduced by a line that begins with a greater then symbol (>), followed by some information labeling the string. Subsequent lines contain the string itself. The next line beginning with > indicates that the current string is complete and begins the label of the next string in the file.
GenBank hosts its own file format for storing genome data, containing a large amount of information about each interval of DNA. The GenBank file describes the interval's source, taxonomic position, authors, and features.
A sample GenBank entry can be found here13.
You may export an entry to a variety of file formats by selecting the
appropriate file format under the Send To: dropdown menu at
the top of the page.
GenBank can be accessed here14. A detailed description of the GenBank format can be found here15.
Write a function shortestSequence that takes two arguments: i) a collection of $$n$$ ($$1 \leq n \leq 10$$) GenBank accession numbers and ii) a string containing a file path. The function must write the shortest of the DNA strings associated with the given accession numbers to a FASTA file with the given path.
>>> shortestSequence(['FJ817486', 'JX069768', 'JX469983'], 'output.fna') >>> print(open('output.fna', 'r').read(), end='') >JX469983.1 Zea mays subsp. mays clone UT3343 G2-like transcription factor mRNA, partial cds ATGATGTATCATGCGAAGAATTTTTCTGTGCCCTTTGCTCCGCAGAGGGCACAGGATAAT GAGCATGCAAGTAATATTGGAGGTATTGGTGGACCCAACATAAGCAACCCTGCTAATCCT GTAGGAAGTGGGAAACAACGGCTACGGTGGACATCGGATCTTCATAATCGCTTTGTGGAT GCCATCGCCCAGCTTGGTGGACCAGACAGAGCTACACCTAAAGGGGTTCTCACTGTGATG GGTGTACCAGGGATCACAATTTATCATGTGAAGAGCCATCTGCAGAAGTATCGCCTTGCA AAGTATATACCCGACTCTCCTGCTGAAGGTTCCAAGGACGAAAAGAAAGATTCGAGTGAT TCCCTCTCGAACACGGATTCGGCACCAGGATTGCAAATCAATGAGGCACTAAAGATGCAA ATGGAGGTTCAGAAGCGACTACATGAGCAACTCGAGGTTCAAAGACAACTGCAACTAAGA ATTGAAGCACAAGGAAGATACTTGCAGATGATCATTGAGGAGCAACAAAAGCTTGGTGGA TCAATTAAGGCTTCTGAGGATCAGAAGCTTTCTGATTCACCTCCAAGCTTAGATGACTAC CCAGAGAGCATGCAACCTTCTCCCAAGAAACCAAGGATAGACGCATTATCACCAGATTCA GAGCGCGATACAACACAACCTGAATTCGAATCCCATTTGATCGGTCCGTGGGATCACGGC ATTGCATTCCCAGTGGAGGAGTTCAAAGCAGGCCCTGCTATGAGCAAGTCA >>> shortestSequence(['JQ867090', 'JX317622', 'JX445144', 'JX472277', 'NM_002124', 'JF927163', 'JX308815', 'JQ290344'], 'output.fna') >>> print(open('output.fna', 'r').read(), end='') >JX317622.1 Ochlerotatus triseriatus prohibitin-1 mRNA, partial cds GCGCTGCGTATTCTGTTCCGACCGATTCCGGATCAGCTGCCGAAGATCTACACAATTCTG GGTCCGGATTACGACGAGCGAGTGCTTCCGTCGATTACGACTGAAGTCCTGAAGGCCGTC GTGGCCCAGTTCGATGCCGGAGAGTTGATCACCCAGCGTGAGATGGTGTCGCAGAAGGTT TCCGACGATCTGACCGAACGTGCCGCCCAGTTCGGCGTCATTCTGGATGACATTTCGATT ACGCATTTGACGTTCGGAAAGGAATTCACGCAGGCCGTTGAAATGAAGCAGGTTGCCCAG CAGGAAGCCGAGAAGGCCCGGTTCATGGTCGAAAAGGCGGAACAGATGAAGATGGCTGCG ATCATTTCGGCGGAAGGTGACGCCGAGGCTGCTGCCCTACTGGCGAAATCGTTCGGCGAC AGCGGAGACGGTTTGGTCGAACTGCGAAGAATCGAAGCGGCCGAGGACATTGCCTACCAG ATGAGCCGGTCC
Here we can again use the Bio.Entrez module introduced in "GenBank introduction16". To search for particular accession numbers you can use the function Bio.Entrez.efetch(db, rettype), which takes two parameters: the db parameter takes the database to search, and the rettype parameter takes the data format to be returned. For example, we use nucleotide (or nuccore) as the db parameter for Genbank and fasta as the rettype parameter for FASTA format.
The following code illustrates efetch() in action. It obtains plain text records in FASTA format from NCBI's Nucleotide database.
>>> from Bio import Entrez >>> Entrez.email = 'your_name@your_mail_server.com' >>> handle = Entrez.efetch(db='nucleotide', id=['FJ817486', 'JX069768', 'JX469983'], rettype='fasta') >>> records = handle.read() >>> print(records) >FJ817486.1 Malus hybrid cultivar flavanone 3-hydroxylase protein (F3H) mRNA, complete cds CGCGTATTTCGTTTGAGCCAATACCAAGTAGACAGAACCAACAAATTCGACACCAAATATGGCTCCTGCT ACTACGCTCACATCCATAGCGCATGAGAAAACCCTGCAACAAAAATTTGTCCGAGACGAAGACGAGCGTC CAAAGGTTGCCTACAACGACTTCAGCAACGAAATTCCGATCATCTCGCTTGCCGGGATCGATGAGGTGGA AGGCCGCCGGGGCGAGATTTGCAAGAAGATTGTAGCGGCTTGTGAAGACTGGGGTATTTTCCAGATTGTT GACCATGGGGTTGATGCTGAGCTCATATCGGAAATGACCGGTCTCGCTAGAGAGTTCTTTGCTTTGCCAT CGGAGGAGAAGCTCCGCTTCGACATGTCCGGTGGCAAAAAGGGTGGCTTCATCGTGTCCAGTCATTTACA GGGAGAAGCTGTGCAAGATTGGCGTGAAATTGTGACCTACTTTTCATATCCGATTCGTCACCGGGACTAT TCGAGGTGGCCAGACAAGCCTGAGGCCTGGAGGGAGGTGACAAAGAAGTACAGTGACGAGTTGATGGGGC TGGCATGCAAGCTCTTGGGCGTTTTATCAGAAGCCATGGGGTTGGATACAGAGGCATTGACAAAGGCATG TGTGGACATGGACCAAAAAGTCGTCGTGAATTTCTACCCAAAATGCCCTCAGCCCGACCTAACCCTTGGC CTCAAGCGCCATACCGACCCGGGCACAATTACCCTTCTGCTTCAAGACCAAGTTGGGGGCCTCCAGGCTA CTCGGGATGATGGGAAAACGTGGATCACCGTTCAACCAGTGGAAGGAGCTTTTGTGGTCAATCTTGGAGA TCATGGTCATCTTCTGAGCAATGGGAGGTTCAAGAATGCTGATCACCAAGCAGTGGTGAACTCAAACAGC AGCAGGCTGTCCATAGCCACATTCCAGAACCCAGCGCAAGAAGCAATAGTGTATCCACTCAGTGTGAGGG AGGGAGAGAAGCCGATTCTCGAGGCGCCAATCACCTACACCGAGATGTACAAGAAGAAGATGAGCAAGGA TCTTGAGCTCGCCAGGCTGAAAAAACTGGCCAAGGAACAGCAATCGCAGGACTTGGAGAAAGCCAAAGTG GATACAAAGCCAGTGGACGACATTTTTGCTTAGACTTTTCCAGTCACTTGAAAGCTCTTTGTGGAACTAT AGCTACTTGTACCTTTTCCTTCCACTTCTTGTACTCGTAACTTCTTTTTGGTGTGCTGGTGGCTTCCCCC CTAATCTGTTTAAGATCCGTGGTTGTCAAGGGCCCTTATATCCCATATTTAGTTTTTGTTCTTGAATTTT CATATCAGTTTCTTATCCTCCAACTTAAAAAAAAAAAAAA >JX069768.1 Momordica charantia carotenoid cleavage dioxygenase 1 (CCD1) mRNA, complete cds ATGGCGGAGGAGAAGCAGAAGCTCAATGGCGGAGTTGTTGACCGCTCGTTGGTGGAGGTCAATCCCAAGC CAAGCAAAGGCCTGGCTTCGAAGGCCACGGATTTGTTGGAGAAGCTGTTTGTGAAGCTCATGTATGATGC TTCAAACCCTCAGCATTATCTTTCCGGTAATTTCGCTCCGGTTCGCGATGAGACGCCTCCGATTACCGAT CTCCCTGTTAAAGGGTATCTTCCGGAATGCTTAAATGGAGAGTTTGTTAGGGTGGGACCAAATCCGAAGT TTAGCCCAGTTGCTGGCTATCACTGGTTTGACGGAGATGGCATGATCCATGGACTGCGCATTAAAGATGG AAAAGCAACATATGTTTCCCGTTATGTGAAGACATCTCGACTTAAACAAGAAGAATATTTTGGAGGTGCT AAATTCATGAAGATTGGTGATCTCAAAGGGTTCTTTGGGTTAATAATGGTCAATATGCAAATGCTGAGAG CAAAGTTGAAAGTGTTGGATGTTTCATATGGAACTGGGACAGGTAACACGGCTCTCATATATCATCATGG GAAGCTGCTTGCACTATCGGAGGCAGATAAACCTTATGTTATAAAAGTGTTGGAGGATGGAGACCTGCAA ACACTTGGTCTGCTGGATTATGACAAGAGATTAACGCACTCCTTCACTGCTCACCCAAAGGTTGACCCAG TGACTGGCGAAATGTTTACATTTGGTTATTCCCATTCACCACCATATGTTACTTACAGAGTTATTTCCAA GGATGGTCTCATGCATGACCCAATACCAATCACAATACCAAACCCGGTCATGATGCATGACTTCGCCATT ACTGAAAATTATGCAATTTTTATGGATCTTCCTTTATATTTTAAACCCAAGGAATTGGTCAAAGAAAATA AGTTAATTTTCACATTTGATGCTACTAAAAGAGCACGGTTTGGCGTGCTTCCAAGATATGCAAGAGATGA TTTGCTTATCCGATGGTTTGAGCTTCCAAATTGTTTTATATTTCATAATGCTAATGCCTGGGAGGAAGGA GATGAAGTAGTTTTGATTACTTGCCGTCTTGAGAACCCAGACTTGGACATGGTCAGTGGGTCTGTCAAGG AGAAGCTTGAGAACTTCTCAAATGAGCTGTATGAGATGAGATTCAATCTTAAATCTGGTCGAGCTTCACA GAAGAAACTATCAGAATCTGCTGTAGATTTTCCTAGAGTGAACGAAAGCTACACTGGCAGGAAACAACAA TATGTATATGGAACTATACTGGACAGCATTGCAAAAGTCACGGGGATTGCCAAATTTGATCTGAATGCTA AACCAGAAACTGGAAAAACAAAGATTGAAGTTGGAGGAAATGTTCAGGGCCTCTATGACCCCGGACCTGG TAGATTTGGTTCTGAAGCTATCTTTGTTCCTCGCATACCTGGCACCACTTCAGAAGAAGATGATGGCTAC TTAATATTCTTCGTACATGATGAGAACACCGGAAAATCGTCGGTGAATGTCATTGATGCAAAAACTATGT CAACTGAGCCTGTTGCAGTCGTTGAACTGCCACACAGAGTTCCATACGGGTTTCATGCCTTCTTTGTAAC AGAGGAGCAACTTCAAGAACAAGAAAGGCTCTGA >JX469983.1 Zea mays subsp. mays clone UT3343 G2-like transcription factor mRNA, partial cds ATGATGTATCATGCGAAGAATTTTTCTGTGCCCTTTGCTCCGCAGAGGGCACAGGATAATGAGCATGCAA GTAATATTGGAGGTATTGGTGGACCCAACATAAGCAACCCTGCTAATCCTGTAGGAAGTGGGAAACAACG GCTACGGTGGACATCGGATCTTCATAATCGCTTTGTGGATGCCATCGCCCAGCTTGGTGGACCAGACAGA GCTACACCTAAAGGGGTTCTCACTGTGATGGGTGTACCAGGGATCACAATTTATCATGTGAAGAGCCATC TGCAGAAGTATCGCCTTGCAAAGTATATACCCGACTCTCCTGCTGAAGGTTCCAAGGACGAAAAGAAAGA TTCGAGTGATTCCCTCTCGAACACGGATTCGGCACCAGGATTGCAAATCAATGAGGCACTAAAGATGCAA ATGGAGGTTCAGAAGCGACTACATGAGCAACTCGAGGTTCAAAGACAACTGCAACTAAGAATTGAAGCAC AAGGAAGATACTTGCAGATGATCATTGAGGAGCAACAAAAGCTTGGTGGATCAATTAAGGCTTCTGAGGA TCAGAAGCTTTCTGATTCACCTCCAAGCTTAGATGACTACCCAGAGAGCATGCAACCTTCTCCCAAGAAA CCAAGGATAGACGCATTATCACCAGATTCAGAGCGCGATACAACACAACCTGAATTCGAATCCCATTTGA TCGGTCCGTGGGATCACGGCATTGCATTCCCAGTGGAGGAGTTCAAAGCAGGCCCTGCTATGAGCAAGTC A
To work with FASTA format, we can use the Bio.SeqIO module, which provides an interface to input and output methods for different file formats. One of its main functions is Bio.SeqIO.parse(), which takes a handle and format name as parameters and returns entries as SeqRecord objects.
>>> from Bio import Entrez >>> from Bio import SeqIO >>> Entrez.email = 'your_name@your_mail_server.com' >>> handle = Entrez.efetch(db='nucleotide', id=['FJ817486', 'JX069768', 'JX469983'], rettype='fasta') >>> records = list(SeqIO.parse(handle, 'fasta')) # obtain list of SeqRecord objects in FASTA format >>> print(records[0].id) # header line of first SeqRecord object FJ817486.1 >>> print(len(records[-1].seq)) # length of SeqRecord object 771