In this exercise we represent protein sequences as strings that only contain upper case letters. Each letter represents an amino acid within the protein sequence. Trypsin is a serine protease found in the digestive system of humans and many other vertebrates, where it helps to digest food proteins. The enzyme has a very specific function — it only cleaves peptide1 chains at the carboxyl2 side of the amino acids3 lysine4 (represented by the letter K) or arginine5 (represented by the letter R). As such, it is often used in laboratories studying protein structures.
High-performance liquid chromatography (HPLC) is a chromatographic6 technique used to separate the components in a mixture, to identify each component, and to quantify each component. When combined with shotgun tandem mass spectrometric methods, the active proteins within a biological sample may be determined. A trypsin digest is used to cleave the proteins in a sample downstream to every K or R. The individual components that result after the cleavage step are called tryptic peptides. The amino acid sequence of these tryptic peptides may then be determined by means of mass spectrometry. However, most devices have a detection limit that only allows to determine the amino acid sequence of peptides having a length between 5 and 50 amino acids. Further, if the last peptide of the protein chain does not end with K or R, it will not be picked up by the mass spectrometer.
Software suites such as Unipept7 are based on large protein databases, containing tryptic peptides taken from more than 29 million known proteins. This online platform can be used to determine both the diversity and the functional activity of a biological sample by comparing the tryptic peptides found in the sample with those found in the database.
Write a function trypsin that takes a protein sequence as its argument. The function must return the list of tryptic peptides that results from cleaving the given protein sequence by trypsin. The order of the peptides in the list should correspond to the order of the peptides in the protein sequence.
Write a function massSpectrometer that takes a protein sequence as its argument. Analogous to the function trypsin,the function must return the list of tryptic peptides that results from cleaving the given protein sequence by trypsin. However, only those tryptic peptides that are with the detection limit of a mass spectrometer (length between 5 and 50 amino acids, including limits; ending with K or R) must be included in the list.
>>> trypsin('NRRPCHSHTKECESAWKNRPCHSHTKKPCHSHTKKNRKVWKIPPFFW')
['NR', 'R', 'PCHSHTK', 'ECESAWK', 'NR', 'PCHSHTK', 'K', 'PCHSHTK', 'K', 'NR', 'K', 'VWK', 'IPPFFW']
>>> trypsin('HAEWTDNQCCPVLKECESAWKYEMWQHPGEQHKRRRYEMWQHPGEQHKPCHSHTKVWKRY')
['HAEWTDNQCCPVLK', 'ECESAWK', 'YEMWQHPGEQHK', 'R', 'R', 'R', 'YEMWQHPGEQHK', 'PCHSHTK', 'VWK', 'R', 'Y']
>>> massSpectrometer('NRRPCHSHTKECESAWKNRPCHSHTKKPCHSHTKKNRKVWKIPPFFW')
['PCHSHTK', 'ECESAWK', 'PCHSHTK', 'PCHSHTK']
>>> massSpectrometer('HAEWTDNQCCPVLKECESAWKYEMWQHPGEQHKRRRYEMWQHPGEQHKPCHSHTKVWKRY')
['HAEWTDNQCCPVLK', 'ECESAWK', 'YEMWQHPGEQHK', 'YEMWQHPGEQHK', 'PCHSHTK']