In this exercise we represent protein sequences as strings that only contain upper case letters. Each letter represents an amino acid within the protein sequence. Trypsin is a serine protease found in the digestive system of humans and many other vertebrates, where it helps to digest food proteins. The enzyme has a very specific function — it only cleaves peptide1 chains at the carboxyl2 side of the amino acids3 lysine4 (represented by the letter K) or arginine5 (represented by the letter R). As such, it is often used in laboratories studying protein structures.

High-performance liquid chromatography (HPLC) is a chromatographic6 technique used to separate the components in a mixture, to identify each component, and to quantify each component. When combined with shotgun tandem mass spectrometric methods, the active proteins within a biological sample may be determined. A trypsin digest is used to cleave the proteins in a sample downstream to every K or R. The individual components that result after the cleavage step are called tryptic peptides. The amino acid sequence of these tryptic peptides may then be determined by means of mass spectrometry. However, most devices have a detection limit that only allows to determine the amino acid sequence of peptides having a length between 5 and 50 amino acids. Further, if the last peptide of the protein chain does not end with K or R, it will not be picked up by the mass spectrometer.

trypsin digest

Software suites such as Unipept7 are based on large protein databases, containing tryptic peptides taken from more than 29 million known proteins. This online platform can be used to determine both the diversity and the functional activity of a biological sample by comparing the tryptic peptides found in the sample with those found in the database.

Assignment

Define a class proteinDB that can be used to create simple protein databases. These protein databases can be used to look for proteins that contain a given list of tryptic peptides. The objects of the class proteinDB must support at least the following methods:

Take care in implementing these methods that you make optimal reuse of the methods that have already been implemented.

Example

The the following interactive session we assume that the text file proteins.txt8 is located in the current directory.

>>> unipept = proteinDB()

>>> unipept.addPeptide('PROT0001', 'ECESAWK')
>>> unipept.peptides
{'ECESAWK': {'PROT0001'}}

>>> unipept.addPeptide('PROT0002', 'WHK')
Traceback (most recent call last):
AssertionError: invalid peptide
>>> unipept.addPeptide('PROT0002', 'ESHLSTLAVQENEIG')
Traceback (most recent call last):
AssertionError: invalid peptide
>>> unipept.addPeptide('PROT0002', 'NWAQNAKIGGADWDCVCR')
Traceback (most recent call last):
AssertionError: invalid peptide

>>> unipept.addProtein('PROT0002', 'HAEWTDNQCCPVLKECESAWKYEMWQHPGEQHKRRRYEMWQHPGEQHKPCHSHTKVWKRY')
>>> unipept.peptides
{'ECESAWK': {'PROT0002', 'PROT0001'}, 'PCHSHTK': {'PROT0002'}, 'HAEWTDNQCCPVLK': {'PROT0002'}, 'YEMWQHPGEQHK': {'PROT0002'}}

>>> unipept.addProtein('PROT0003', 'NRRPCHSHTKECESAWKNRPCHSHTKKPCHSHTKKNRKVWKIPPFFW')
>>> unipept.peptides
{'ECESAWK': {'PROT0003', 'PROT0002', 'PROT0001'}, 'PCHSHTK': {'PROT0003', 'PROT0002'}, 'HAEWTDNQCCPVLK': {'PROT0002'}, 'YEMWQHPGEQHK': {'PROT0002'}}

>>> unipept.addProtein('PROT0004', 'YEMWQHPGEQHKECESAWKVPYCGFITRPCHSHTKECESAWK')
>>> unipept.peptides
{'ECESAWK': {'PROT0004', 'PROT0003', 'PROT0002', 'PROT0001'}, 'PCHSHTK': {'PROT0004', 'PROT0003', 'PROT0002'}, 'HAEWTDNQCCPVLK': {'PROT0002'}, 'VPYCGFITR': {'PROT0004'}, 'YEMWQHPGEQHK': {'PROT0004', 'PROT0002'}}

>>> unipept.identify(['VPYCGFITR'])
['PROT0004']
>>> unipept.identify({'ECESAWK', 'PCHSHTK'})
['PROT0002', 'PROT0003', 'PROT0004']
>>> unipept.identify(('YEMWQHPGEQHK', 'ECESAWK', 'PCHSHTK'))
['PROT0002', 'PROT0004']
>>> unipept.identify({'PCHSHTK', 'VPYCGFITR'})
['PROT0004']

>>> unipept.addProteins('proteins.txt')
>>> unipept.peptides
{'ECESAWK': {'PROT0005', 'PROT0004', 'PROT0003', 'PROT0002', 'PROT0001'}, 'VCEFPWFPMLINDVCR': {'PROT0007'}, 'VPYCGFITR': {'PROT0005', 'PROT0004'}, 'YEMWQHPGEQHK': {'PROT0006', 'PROT0005', 'PROT0004', 'PROT0002'}, 'PCHSHTK': {'PROT0006', 'PROT0005', 'PROT0004', 'PROT0003', 'PROT0002'}, 'HAEWTDNQCCPVLK': {'PROT0002'}, 'CSFHCLEK': {'PROT0006'}, 'AFNYMMPNTK': {'PROT0006'}, 'AYDDEVASFPGCMMATK': {'PROT0007', 'PROT0006'}, 'FIPYYPIYSR': {'PROT0006'}, 'TLCHETMR': {'PROT0005'}, 'HTPNYGVMWMFMNEWMSYDR': {'PROT0006', 'PROT0005'}, 'CDQMHVFDIYMIAIACSWGGPPSLTK': {'PROT0007'}, 'FGHSMTR': {'PROT0005'}}
>>> unipept.identify(('YEMWQHPGEQHK', 'VPYCGFITR', 'ECESAWK'))
['PROT0004', 'PROT0005']
>>> unipept.identify(('PCHSHTK', 'AYDDEVASFPGCMMATK'))
['PROT0006']
>>> unipept.identify(['NEGNLNVMK'])
[]

Bronnen