In this exercise we represent protein sequences as strings that only contain upper case letters. Each letter represents an amino acid within the protein sequence. Trypsin is a serine protease found in the digestive system of humans and many other vertebrates, where it helps to digest food proteins. The enzyme has a very specific function — it only cleaves peptide1 chains at the carboxyl2 side of the amino acids3 lysine4 (represented by the letter K) or arginine5 (represented by the letter R). As such, it is often used in laboratories studying protein structures.

High-performance liquid chromatography (HPLC) is a chromatographic6 technique used to separate the components in a mixture, to identify each component, and to quantify each component. When combined with shotgun tandem mass spectrometric methods, the active proteins within a biological sample may be determined. A trypsin digest is used to cleave the proteins in a sample downstream to every K or R. The individual components that result after the cleavage step are called tryptic peptides. The amino acid sequence of these tryptic peptides may then be determined by means of mass spectrometry. However, most devices have a detection limit that only allows to determine the amino acid sequence of peptides having a length between 5 and 50 amino acids. Further, if the last peptide of the protein chain does not end with K or R, it will not be picked up by the mass spectrometer.

trypsin digest

Software suites such as Unipept7 are based on large protein databases, containing tryptic peptides taken from more than 29 million known proteins. This online platform can be used to determine both the diversity and the functional activity of a biological sample by comparing the tryptic peptides found in the sample with those found in the database.

Assignment

Example

>>> trypsin('NRRPCHSHTKECESAWKNRPCHSHTKKPCHSHTKKNRKVWKIPPFFW')
['NR', 'R', 'PCHSHTK', 'ECESAWK', 'NR', 'PCHSHTK', 'K', 'PCHSHTK', 'K', 'NR', 'K', 'VWK', 'IPPFFW']
>>> trypsin('HAEWTDNQCCPVLKECESAWKYEMWQHPGEQHKRRRYEMWQHPGEQHKPCHSHTKVWKRY')
['HAEWTDNQCCPVLK', 'ECESAWK', 'YEMWQHPGEQHK', 'R', 'R', 'R', 'YEMWQHPGEQHK', 'PCHSHTK', 'VWK', 'R', 'Y']

>>> massSpectrometer('NRRPCHSHTKECESAWKNRPCHSHTKKPCHSHTKKNRKVWKIPPFFW')
['PCHSHTK', 'ECESAWK', 'PCHSHTK', 'PCHSHTK']
>>> massSpectrometer('HAEWTDNQCCPVLKECESAWKYEMWQHPGEQHKRRRYEMWQHPGEQHKPCHSHTKVWKRY')
['HAEWTDNQCCPVLK', 'ECESAWK', 'YEMWQHPGEQHK', 'YEMWQHPGEQHK', 'PCHSHTK']

Resources