We have thus far worked with theoretical spectra of cyclic peptides, in which the mass of every subpeptide is given. However, mass spectrometers generate spectra that are far from ideal — they are characterized by having both false masses and missing masses. A false mass is present in the experimental spectrum but absent from the theoretical spectrum; a missing mass is present in the theoretical spectrum but absent from the experimental spectrum.
To generalize finding a cyclic peptide with theoretical spectrum matching an ideal spectrum to handle “noisy” spectra having false and missing masses, we need to relax the requirement that a candidate peptide’s theoretical spectrum must match the experimental spectrum exactly, and instead incorporate a scoring function that will select the peptide whose theoretical spectrum matches the given experimental spectrum the most closely. Given a cyclic peptide Peptide
and a spectrum Spectrum
, we define score(Peptide, Spectrum)
as the number of masses shared between cyclospectrum(Peptide)
and Spectrum
. E.g. for Spectrum = {0, 99, 113, 114, 128, 227, 257, 299, 355, 356, 370, 371, 484}
, score("NQEL", Spectrum) = 11
.
To limit the number of candidate peptides under consideration, we will use a leaderboard, which holds the N highest scoring candidate peptides for further extension. At each step, we will expand all candidate peptides found in the leaderboard by adding every possible amino acid to the end. Then, we will eliminate those peptides whose newly calculated scores are not high enough to keep them on the leaderboard. This idea is similar to the notion of a “cut” in a golf tournament; after the cut, only the top N golfers are allowed to play in the next round, since they are the only players who have a reasonable chance of winning.
To be fair, a cut should include anyone who is tied with the Nth-place competitor. Thus, the leaderboard should be trimmed down to the “N highest-scoring peptides including ties”, which may include more than N peptides. Given an integer N, a list of peptides Leaderboard
and a spectrum Spectrum
, the algorithm Trim(N, Leaderboard, Spectrum)
shown below returns the top N highest-scoring peptides in Leaderboard
(including ties) with respect to Spectrum
. It sorts all peptides in Leaderboard
according to their scores, resulting in a sorted Leaderboard
. It then retains the top N scoring peptides including ties, and removes all other peptides from Leaderboard
.
Trim(N, Leaderboard, Spectrum)
for j ← 1 to |Leaderboard|
Peptide ← j-th peptide in Leaderboard
LinearScores(j) ← LinearScore(Peptide, Spectrum)
sort Leaderboard according to the decreasing order of scores in LinearScores
sort LinearScores in decreasing order
for j ← N + 1 to |Leaderboard|
if LinearScores(j) < LinearScores(N)
remove all peptides starting from the j-th peptide from Leaderboard
return Leaderboard
return Leaderboard
Below is the algorithm LeaderboardCyclopeptideSequencing
. In what follows, the 0-peptide is the peptide “” containing no amino acids.
LeaderboardCyclopeptideSequencing(N, Spectrum)
Leaderboard ← {0-peptide}
LeaderPeptide ← 0-peptide
while Leaderboard is non-empty
Leaderboard ← Expand(Leaderboard)
for each Peptide in Leaderboard
if Mass(Peptide) = ParentMass(Spectrum)
if Score(Peptide, Spectrum) > Score(LeaderPeptide, Spectrum)
LeaderPeptide ← Peptide
else if Mass(Peptide) > ParentMass(Spectrum)
remove Peptide from Leaderboard
Leaderboard ← Trim(N, Leaderboard, Spectrum)
output LeaderPeptide
Write a Python function leaderpeptide
that takes an integer N and a collection of integers Spectrum
and returns the leaderpeptide obtained by the above algorithm.
>>> leaderpeptide(10, [0, 71, 113, 129, 147, 200, 218, 260, 313, 331, 347, 389, 460])
'113-147-71-129'
>>> leaderpeptide(325, [0, 71, 71, 71, 87, 97, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 137, 147, 163, 163, 170, 184, 184, 186, 186, 190, 211, 215, 226, 226, 229, 231, 238, 241, 244, 246, 257, 257, 276, 277, 278, 299, 300, 312, 316, 317, 318, 318, 323, 328, 340, 343, 344, 347, 349, 356, 366, 370, 373, 374, 391, 401, 414, 414, 415, 419, 427, 427, 431, 437, 441, 446, 453, 462, 462, 462, 470, 472, 502, 503, 503, 511, 515, 529, 530, 533, 533, 540, 543, 547, 556, 559, 569, 574, 575, 584, 590, 600, 600, 604, 612, 616, 617, 630, 640, 640, 643, 646, 648, 660, 671, 683, 684, 687, 693, 703, 703, 719, 719, 719, 729, 730, 731, 737, 740, 741, 745, 747, 754, 774, 780, 784, 790, 797, 800, 806, 818, 826, 827, 832, 833, 838, 846, 846, 847, 850, 868, 869, 877, 884, 889, 893, 897, 903, 908, 913, 917, 930, 940, 947, 956, 960, 960, 961, 964, 965, 966, 983, 983, 985, 1002, 1009, 1010, 1011, 1021, 1031, 1031, 1036, 1053, 1054, 1058, 1059, 1062, 1063, 1074, 1076, 1084, 1092, 1103, 1113, 1122, 1124, 1130, 1133, 1134, 1145, 1146, 1146, 1149, 1150, 1155, 1156, 1171, 1173, 1174, 1187, 1191, 1193, 1200, 1212, 1221, 1233, 1240, 1242, 1246, 1259, 1260, 1262, 1277, 1278, 1283, 1284, 1287, 1287, 1288, 1299, 1300, 1303, 1309, 1311, 1320, 1330, 1341, 1349, 1357, 1359, 1370, 1371, 1374, 1375, 1379, 1380, 1397, 1402, 1402, 1412, 1422, 1423, 1424, 1431, 1448, 1450, 1450, 1467, 1468, 1469, 1472, 1473, 1473, 1477, 1486, 1493, 1503, 1516, 1520, 1525, 1530, 1536, 1540, 1544, 1549, 1556, 1564, 1565, 1583, 1586, 1587, 1587, 1595, 1600, 1601, 1606, 1607, 1615, 1627, 1633, 1636, 1643, 1649, 1653, 1659, 1679, 1686, 1688, 1692, 1693, 1696, 1702, 1703, 1704, 1714, 1714, 1714, 1730, 1730, 1740, 1746, 1749, 1750, 1762, 1773, 1785, 1787, 1790, 1793, 1793, 1803, 1816, 1817, 1821, 1829, 1833, 1833, 1843, 1849, 1858, 1859, 1864, 1877, 1886, 1890, 1893, 1900, 1900, 1903, 1904, 1918, 1922, 1930, 1930, 1931, 1961, 1963, 1971, 1971, 1971, 1980, 1987, 1992, 1996, 2002, 2006, 2006, 2014, 2018, 2019, 2019, 2032, 2042, 2059, 2060, 2063, 2067, 2077, 2084, 2086, 2089, 2090, 2093, 2105, 2110, 2115, 2115, 2116, 2117, 2121, 2133, 2134, 2155, 2156, 2157, 2176, 2176, 2187, 2189, 2192, 2195, 2202, 2204, 2207, 2207, 2218, 2222, 2243, 2247, 2247, 2249, 2249, 2263, 2270, 2270, 2286, 2296, 2304, 2305, 2305, 2318, 2319, 2320, 2320, 2330, 2332, 2334, 2336, 2336, 2346, 2362, 2362, 2362, 2433])
'97-129-97-147-99-71-186-71-113-163-115-71-113-128-103-87-128-101-137-163-114'