Same data, different formats

A number of different data presentation formats have been used to represent genetic strings. The history of file formats presents its own kind of evolution: some formats have died out, being replaced by more successful ones. Three file formats are currently the most popular:

A simple reference on file formats can be found here.

In this assignment, we will familiarize ourselves with FASTA. We will save the other two formats for later problems. In FASTA format, a string is introduced by a line that begins with a greater then symbol (>), followed by some information labeling the string. Subsequent lines contain the string itself. The next line beginning with > indicates that the current string is complete and begins the label of the next string in the file.

GenBank hosts its own file format for storing genome data, containing a large amount of information about each interval of DNA. The GenBank file describes the interval's source, taxonomic position, authors, and features.

sample GenBank file
An example of a GenBank record header.

A sample GenBank entry can be found here13. You may export an entry to a variety of file formats by selecting the appropriate file format under the Send To: dropdown menu at the top of the page.

Assignment

GenBank can be accessed here14. A detailed description of the GenBank format can be found here15.

Write a function shortestSequence that takes two arguments: i) a collection of $$n$$ ($$1 \leq n \leq 10$$) GenBank accession numbers and ii) a string containing a file path. The function must write the shortest of the DNA strings associated with the given accession numbers to a FASTA file with the given path.

Example

>>> shortestSequence(['FJ817486', 'JX069768', 'JX469983'], 'output.fna')
>>> print(open('output.fna', 'r').read(), end='')
>JX469983.1 Zea mays subsp. mays clone UT3343 G2-like transcription factor mRNA, partial cds
ATGATGTATCATGCGAAGAATTTTTCTGTGCCCTTTGCTCCGCAGAGGGCACAGGATAAT
GAGCATGCAAGTAATATTGGAGGTATTGGTGGACCCAACATAAGCAACCCTGCTAATCCT
GTAGGAAGTGGGAAACAACGGCTACGGTGGACATCGGATCTTCATAATCGCTTTGTGGAT
GCCATCGCCCAGCTTGGTGGACCAGACAGAGCTACACCTAAAGGGGTTCTCACTGTGATG
GGTGTACCAGGGATCACAATTTATCATGTGAAGAGCCATCTGCAGAAGTATCGCCTTGCA
AAGTATATACCCGACTCTCCTGCTGAAGGTTCCAAGGACGAAAAGAAAGATTCGAGTGAT
TCCCTCTCGAACACGGATTCGGCACCAGGATTGCAAATCAATGAGGCACTAAAGATGCAA
ATGGAGGTTCAGAAGCGACTACATGAGCAACTCGAGGTTCAAAGACAACTGCAACTAAGA
ATTGAAGCACAAGGAAGATACTTGCAGATGATCATTGAGGAGCAACAAAAGCTTGGTGGA
TCAATTAAGGCTTCTGAGGATCAGAAGCTTTCTGATTCACCTCCAAGCTTAGATGACTAC
CCAGAGAGCATGCAACCTTCTCCCAAGAAACCAAGGATAGACGCATTATCACCAGATTCA
GAGCGCGATACAACACAACCTGAATTCGAATCCCATTTGATCGGTCCGTGGGATCACGGC
ATTGCATTCCCAGTGGAGGAGTTCAAAGCAGGCCCTGCTATGAGCAAGTCA

>>> shortestSequence(['JQ867090', 'JX317622', 'JX445144', 'JX472277', 'NM_002124', 'JF927163', 'JX308815', 'JQ290344'], 'output.fna')
>>> print(open('output.fna', 'r').read(), end='')
>JX317622.1 Ochlerotatus triseriatus prohibitin-1 mRNA, partial cds
GCGCTGCGTATTCTGTTCCGACCGATTCCGGATCAGCTGCCGAAGATCTACACAATTCTG
GGTCCGGATTACGACGAGCGAGTGCTTCCGTCGATTACGACTGAAGTCCTGAAGGCCGTC
GTGGCCCAGTTCGATGCCGGAGAGTTGATCACCCAGCGTGAGATGGTGTCGCAGAAGGTT
TCCGACGATCTGACCGAACGTGCCGCCCAGTTCGGCGTCATTCTGGATGACATTTCGATT
ACGCATTTGACGTTCGGAAAGGAATTCACGCAGGCCGTTGAAATGAAGCAGGTTGCCCAG
CAGGAAGCCGAGAAGGCCCGGTTCATGGTCGAAAAGGCGGAACAGATGAAGATGGCTGCG
ATCATTTCGGCGGAAGGTGACGCCGAGGCTGCTGCCCTACTGGCGAAATCGTTCGGCGAC
AGCGGAGACGGTTTGGTCGAACTGCGAAGAATCGAAGCGGCCGAGGACATTGCCTACCAG
ATGAGCCGGTCC

Programming shortcut

Here we can again use the Bio.Entrez module introduced in "GenBank introduction16". To search for particular accession numbers you can use the function Bio.Entrez.efetch(db, rettype), which takes two parameters: the db parameter takes the database to search, and the rettype parameter takes the data format to be returned. For example, we use nucleotide (or nuccore) as the db parameter for Genbank and fasta as the rettype parameter for FASTA format.

The following code illustrates efetch() in action. It obtains plain text records in FASTA format from NCBI's Nucleotide database.

>>> from Bio import Entrez
>>> Entrez.email = 'your_name@your_mail_server.com'
>>> handle = Entrez.efetch(db='nucleotide', id=['FJ817486', 'JX069768', 'JX469983'], rettype='fasta')
>>> records = handle.read()
>>> print(records)
>FJ817486.1 Malus hybrid cultivar flavanone 3-hydroxylase protein (F3H) mRNA, complete cds
CGCGTATTTCGTTTGAGCCAATACCAAGTAGACAGAACCAACAAATTCGACACCAAATATGGCTCCTGCT
ACTACGCTCACATCCATAGCGCATGAGAAAACCCTGCAACAAAAATTTGTCCGAGACGAAGACGAGCGTC
CAAAGGTTGCCTACAACGACTTCAGCAACGAAATTCCGATCATCTCGCTTGCCGGGATCGATGAGGTGGA
AGGCCGCCGGGGCGAGATTTGCAAGAAGATTGTAGCGGCTTGTGAAGACTGGGGTATTTTCCAGATTGTT
GACCATGGGGTTGATGCTGAGCTCATATCGGAAATGACCGGTCTCGCTAGAGAGTTCTTTGCTTTGCCAT
CGGAGGAGAAGCTCCGCTTCGACATGTCCGGTGGCAAAAAGGGTGGCTTCATCGTGTCCAGTCATTTACA
GGGAGAAGCTGTGCAAGATTGGCGTGAAATTGTGACCTACTTTTCATATCCGATTCGTCACCGGGACTAT
TCGAGGTGGCCAGACAAGCCTGAGGCCTGGAGGGAGGTGACAAAGAAGTACAGTGACGAGTTGATGGGGC
TGGCATGCAAGCTCTTGGGCGTTTTATCAGAAGCCATGGGGTTGGATACAGAGGCATTGACAAAGGCATG
TGTGGACATGGACCAAAAAGTCGTCGTGAATTTCTACCCAAAATGCCCTCAGCCCGACCTAACCCTTGGC
CTCAAGCGCCATACCGACCCGGGCACAATTACCCTTCTGCTTCAAGACCAAGTTGGGGGCCTCCAGGCTA
CTCGGGATGATGGGAAAACGTGGATCACCGTTCAACCAGTGGAAGGAGCTTTTGTGGTCAATCTTGGAGA
TCATGGTCATCTTCTGAGCAATGGGAGGTTCAAGAATGCTGATCACCAAGCAGTGGTGAACTCAAACAGC
AGCAGGCTGTCCATAGCCACATTCCAGAACCCAGCGCAAGAAGCAATAGTGTATCCACTCAGTGTGAGGG
AGGGAGAGAAGCCGATTCTCGAGGCGCCAATCACCTACACCGAGATGTACAAGAAGAAGATGAGCAAGGA
TCTTGAGCTCGCCAGGCTGAAAAAACTGGCCAAGGAACAGCAATCGCAGGACTTGGAGAAAGCCAAAGTG
GATACAAAGCCAGTGGACGACATTTTTGCTTAGACTTTTCCAGTCACTTGAAAGCTCTTTGTGGAACTAT
AGCTACTTGTACCTTTTCCTTCCACTTCTTGTACTCGTAACTTCTTTTTGGTGTGCTGGTGGCTTCCCCC
CTAATCTGTTTAAGATCCGTGGTTGTCAAGGGCCCTTATATCCCATATTTAGTTTTTGTTCTTGAATTTT
CATATCAGTTTCTTATCCTCCAACTTAAAAAAAAAAAAAA

>JX069768.1 Momordica charantia carotenoid cleavage dioxygenase 1 (CCD1) mRNA, complete cds
ATGGCGGAGGAGAAGCAGAAGCTCAATGGCGGAGTTGTTGACCGCTCGTTGGTGGAGGTCAATCCCAAGC
CAAGCAAAGGCCTGGCTTCGAAGGCCACGGATTTGTTGGAGAAGCTGTTTGTGAAGCTCATGTATGATGC
TTCAAACCCTCAGCATTATCTTTCCGGTAATTTCGCTCCGGTTCGCGATGAGACGCCTCCGATTACCGAT
CTCCCTGTTAAAGGGTATCTTCCGGAATGCTTAAATGGAGAGTTTGTTAGGGTGGGACCAAATCCGAAGT
TTAGCCCAGTTGCTGGCTATCACTGGTTTGACGGAGATGGCATGATCCATGGACTGCGCATTAAAGATGG
AAAAGCAACATATGTTTCCCGTTATGTGAAGACATCTCGACTTAAACAAGAAGAATATTTTGGAGGTGCT
AAATTCATGAAGATTGGTGATCTCAAAGGGTTCTTTGGGTTAATAATGGTCAATATGCAAATGCTGAGAG
CAAAGTTGAAAGTGTTGGATGTTTCATATGGAACTGGGACAGGTAACACGGCTCTCATATATCATCATGG
GAAGCTGCTTGCACTATCGGAGGCAGATAAACCTTATGTTATAAAAGTGTTGGAGGATGGAGACCTGCAA
ACACTTGGTCTGCTGGATTATGACAAGAGATTAACGCACTCCTTCACTGCTCACCCAAAGGTTGACCCAG
TGACTGGCGAAATGTTTACATTTGGTTATTCCCATTCACCACCATATGTTACTTACAGAGTTATTTCCAA
GGATGGTCTCATGCATGACCCAATACCAATCACAATACCAAACCCGGTCATGATGCATGACTTCGCCATT
ACTGAAAATTATGCAATTTTTATGGATCTTCCTTTATATTTTAAACCCAAGGAATTGGTCAAAGAAAATA
AGTTAATTTTCACATTTGATGCTACTAAAAGAGCACGGTTTGGCGTGCTTCCAAGATATGCAAGAGATGA
TTTGCTTATCCGATGGTTTGAGCTTCCAAATTGTTTTATATTTCATAATGCTAATGCCTGGGAGGAAGGA
GATGAAGTAGTTTTGATTACTTGCCGTCTTGAGAACCCAGACTTGGACATGGTCAGTGGGTCTGTCAAGG
AGAAGCTTGAGAACTTCTCAAATGAGCTGTATGAGATGAGATTCAATCTTAAATCTGGTCGAGCTTCACA
GAAGAAACTATCAGAATCTGCTGTAGATTTTCCTAGAGTGAACGAAAGCTACACTGGCAGGAAACAACAA
TATGTATATGGAACTATACTGGACAGCATTGCAAAAGTCACGGGGATTGCCAAATTTGATCTGAATGCTA
AACCAGAAACTGGAAAAACAAAGATTGAAGTTGGAGGAAATGTTCAGGGCCTCTATGACCCCGGACCTGG
TAGATTTGGTTCTGAAGCTATCTTTGTTCCTCGCATACCTGGCACCACTTCAGAAGAAGATGATGGCTAC
TTAATATTCTTCGTACATGATGAGAACACCGGAAAATCGTCGGTGAATGTCATTGATGCAAAAACTATGT
CAACTGAGCCTGTTGCAGTCGTTGAACTGCCACACAGAGTTCCATACGGGTTTCATGCCTTCTTTGTAAC
AGAGGAGCAACTTCAAGAACAAGAAAGGCTCTGA

>JX469983.1 Zea mays subsp. mays clone UT3343 G2-like transcription factor mRNA, partial cds
ATGATGTATCATGCGAAGAATTTTTCTGTGCCCTTTGCTCCGCAGAGGGCACAGGATAATGAGCATGCAA
GTAATATTGGAGGTATTGGTGGACCCAACATAAGCAACCCTGCTAATCCTGTAGGAAGTGGGAAACAACG
GCTACGGTGGACATCGGATCTTCATAATCGCTTTGTGGATGCCATCGCCCAGCTTGGTGGACCAGACAGA
GCTACACCTAAAGGGGTTCTCACTGTGATGGGTGTACCAGGGATCACAATTTATCATGTGAAGAGCCATC
TGCAGAAGTATCGCCTTGCAAAGTATATACCCGACTCTCCTGCTGAAGGTTCCAAGGACGAAAAGAAAGA
TTCGAGTGATTCCCTCTCGAACACGGATTCGGCACCAGGATTGCAAATCAATGAGGCACTAAAGATGCAA
ATGGAGGTTCAGAAGCGACTACATGAGCAACTCGAGGTTCAAAGACAACTGCAACTAAGAATTGAAGCAC
AAGGAAGATACTTGCAGATGATCATTGAGGAGCAACAAAAGCTTGGTGGATCAATTAAGGCTTCTGAGGA
TCAGAAGCTTTCTGATTCACCTCCAAGCTTAGATGACTACCCAGAGAGCATGCAACCTTCTCCCAAGAAA
CCAAGGATAGACGCATTATCACCAGATTCAGAGCGCGATACAACACAACCTGAATTCGAATCCCATTTGA
TCGGTCCGTGGGATCACGGCATTGCATTCCCAGTGGAGGAGTTCAAAGCAGGCCCTGCTATGAGCAAGTC
A

To work with FASTA format, we can use the Bio.SeqIO module, which provides an interface to input and output methods for different file formats. One of its main functions is Bio.SeqIO.parse(), which takes a handle and format name as parameters and returns entries as SeqRecord objects.

>>> from Bio import Entrez
>>> from Bio import SeqIO
>>> Entrez.email = 'your_name@your_mail_server.com'
>>> handle = Entrez.efetch(db='nucleotide', id=['FJ817486', 'JX069768', 'JX469983'], rettype='fasta')
>>> records = list(SeqIO.parse(handle, 'fasta'))    # obtain list of SeqRecord objects in FASTA format
>>> print(records[0].id)                            # header line of first SeqRecord object
FJ817486.1
>>> print(len(records[-1].seq))                     # length of SeqRecord object
771