The world's biggest genetic database

The most comprehensive database available for molecular biologists is GenBank, an open access resource that contains an annotated collection of all publicly available sequenced DNA and its translation into proteins. GenBank was founded by NCBI in 1982, and over the last three decades, the data it houses has grown exponentially, doubling every 18 months. As of August 2012, GenBank contained over 143 billion nucleobases.

Every sequence has a unique GenBank identifier that will directly retrieve its full sequence record. Here are some examples of database IDs:

CAA79696
NP_778203
263191547
BC043443
NM_002020

You can also search by submitter/author name in the format last_name first_initials (e.g., Smith JR). To search for an exact match, enclose it in quotation marks.

"contactin associated protein"
"duchenne muscular dystrophy"

You can also restrict your search by using Boolean operators (AND, OR, NOT) as well as use specific subsets of records.

Assignment

GenBank comprises several subdivisions:

Nucleotide¹: a collection of nucleic acid sequences from several sources
Genome Survey Sequence² (GSS): uncharacterized short genomic sequences
Expressed Sequence Tags³ (EST): uncharacterized short cDNA sequences

Searching the Nucleotide database with general text queries will produce the most relevant results. You can also use a simple query based on protein name, gene name or gene symbol.

To limit your search to only certain kinds of records, you can search using GenBank's Limits⁴ page or alternatively use the Filter your results field to select categories of records after a search.

If you cannot find what you are searching for, check how the database interpreted your query by investigating the Search details field on the right side of the page. This field automatically translates your search into standard keywords.

For example, if you search for Drosophila, the Search details field will contain

Drosophila[All Fields]

and you will obtain all entries that mention Drosophila (including all its endosymbionts). You can restrict your search to only organisms belonging to the Drosophila genus by using a search tag and searching for

Drosophila[Organism]

Write a function recordCount that takes an organism name (string) as its first argument. The function also has a second optional parameter publicationDate that either takes a datetime.date object or a tuple of two datetime.date objects. The function must return the number of records that have been deposited in GenBank with the given organism name. If a single date is passed to the parameter publicationDate, the record count must be restricted to the records published at the given date. If a tuple containing two dates is passed to the parameter publicationDate, the record count must be restricted to the records published between the two given dates.

Example

        >>> from datetime import date
>>> recordCount('Anthoxanthum')
84506
>>> recordCount('Anthoxanthum', publicationDate=date(2003, 8, 2))
4
>>> recordCount('Anthoxanthum', publicationDate=(date(2003, 7, 25), date(2005, 12, 27)))
7
>>> recordCount('Stenosemella', publicationDate=(date(2000, 11, 9), date(2012, 8, 9)))
22

Programming shortcut

NCBI's databases, such as PubMed, GenBank, GEO, and many others, can be accessed via Entrez, a data retrieval system offered by NCBI. For direct access to Entrez, you can use Biopython's Bio.Entrez module.

The Bio.Entrez.esearch() function will search any of the NCBI databases. This function takes the following arguments:

db: the database to search; for example, this field can be nucleotide for GenBank or pubmed for PubMed
term: The search term for the "Query" field. You can use search tags here.

We will now demonstrate a quick search for the rbcL gene in corn (Zea mays):

        >>> from Bio import Entrez
>>> Entrez.email = 'your_name@your_mail_server.com'
>>> handle = Entrez.esearch(db='nucleotide', term='"Zea mays"[Organism] AND rbcL[Gene]')
>>> record = Entrez.read(handle)
>>> int(record['Count'])
6 # surely this value will change over time because GenBank is constantly updated

Note that when you request Entrez databases you must obey NCBI's requirements⁵:

for any series of more than 100 requests, access the database on the weekend or outside peak times in the US
make no more than three requests every second
fill in the Entrez.email field so that NCBI can contact you if there is a problem
be sensible with your usage levels; if you want to download whole mammalian genomes, use NCBI's FTP⁶.