Four commonly used protein databases

Proteins are identified in different labs around the world, and data about them is gathered into freely accessible databases. A central repository for protein data is UniProt¹, a comprehensive high-quality database established by an international consortium. UniProt provides detailed protein annotation, including function description, domain structure, and post-translational modifications. It also supports protein similarity search, taxonomy analysis, and literature citations.

Screenshot of the UniProt homepage.

UniProt is part of ExPASy, the resource portal of Swiss Institute of Bioinformatics. ExPASy provides access to numerous databases and software tools in different areas of the life sciences.

The most important component of UniProt is the UniProt Knowledgebase² (or UniProtKB for short), a protein database partially curated by experts. UniProtKB comprises two major parts:

Swiss-Prot: a manually annotated and reviewed, non-redundant protein sequence database. Each Swiss-Prot entry holds all relevant information about a particular protein, including data taken from scientific literature data and the results of computational analysis of the protein.
TrEMBL (stands for "Translated EMBL"): an automatically annotated database that is not reviewed. TrEMBL is not a true protein database; instead, it holds translated versions of nucleic acid sequences taken from multiple sources, including the European Molecular Biology Laboratory³ (EMBL) database⁴, worldwide genome sequencing projects, and the coding regions of genes accessed from GenBank⁵.

The National Center for Biotechnology Information⁶ (NCBI) also maintains a Protein Database⁷. Data equivalent to that of Swiss-Prot is part of the NCBI's RefSeq⁸ database, a curated collection of genetic sequences. RefSeq records are annotated by NCBI personnel, and they provide reliable information for genomic DNA along with RNA transcribed from DNA and the corresponding translated proteins. NCBI's equivalent project to TrEmbl is the NCBI Protein database, which is part of the larger GenBank database and holds unannotated protein sequences. If a nucleotide sequence contained in GenBank codes for protein, then the corresponding amino acid sequence is automatically annotated and included in the NCBI database with its own protein ID. The NCBI databases also recognize UniProt IDs as search terms.

Because Swiss-Prot annotation provides so much information, NCBI protein records usually provide links to corresponding Swiss-Prot entries whenever possible.

Assignment

You can see a complete description of a protein in the UniProt Knowledgebase⁹ by entering its UniProt access ID into the site's query field. Equivalently, you may simply insert its ID (uniprot_id) directly into a UniProt hyperlink as follows:

http://www.uniprot.org/uniprot/uniprot_id

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00¹⁰.

Swiss-Prot holds protein data as a structured .txt file. You can obtain it by simply adding .txt to the link:

http://www.uniprot.org/uniprot/uniprot_id.txt

Your task:

Write a function biologicalProcesses that takes the UniProt ID of a protein (a string). The function must return the set of all biological processes in which the protein is involved.
Write a function molecularFunctions that takes the UniProt ID of a protein (a string). The function must return the set of all molecular functions of the given protein.
Write a function cellularComponents that takes the UniProt ID of a protein (a string). The function must return the set of all cellular components where the given protein is active.

Biological processes, molecular functions and cellular components are found as a subsection of the protein's "Gene Ontology" (GO) section.

Example

        >>> biologicalProcesses('Q5SLP9')
{'DNA recombination', 'DNA repair', 'DNA replication'}

>>> molecularFunctions('Q5SLP9')
{'single-stranded DNA binding'}

>>> cellularComponents('A0A010Q1J5')
{'fungal-type vacuole membrane', 'Golgi apparatus', 'Vps55/Vps68 complex', 'integral component of membrane', 'late endosome'}

Programming shortcut

ExPASy databases can be accessed automatically via Biopython's Bio.ExPASy module. The function .get_sprot_raw will find a target protein by its ID.

We can obtain data from an entry by using the SwissProt module. The read() function will handle one SwissProt record and parse will allow you to read multiple records at a time. Let's get the data for the B5ZC00 protein:

        >>> from Bio import ExPASy
>>> from Bio import SwissProt
>>> handle = ExPASy.get_sprot_raw('Q5SLP9') # you can give several IDs separated by commas
>>> record = SwissProt.read(handle)         # use SwissProt.parse for multiple proteins

        >>> dir(record)
[..., 'accessions', 'annotation_update', 'comments', 'created', 'cross_references', 'data_class', 'description', 'entry_name', 'features', 'gene_name', 'host_organism', 'host_taxonomy_id', 'keywords', 'molecule_type', 'organelle', 'organism', 'organism_classification', 'references', 'seqinfo', 'sequence', 'sequence_length', 'sequence_update', 'taxonomy_id']

To see the list of references to other databases, we can check the .cross_references attribute of our record:

        >>> record.cross_references[0]
('EMBL', 'AF079160', 'AAC28386.2', '-', 'Genomic_DNA')