Imagine that we have an algorithm generating the set of all peptides scoring at least threshold against a spectral vector Spectrum'. We will henceforth call this set of high-scoring peptides a spectral dictionary, denoted Dictionarythreshold(Spectrum'). For a PSM (Peptide, Spectrum'), we will use the term PSM dictionary, denoted Dictionary(Peptide, Spectrum'), to refer to the spectral dictionary DictionaryScore(Peptide, Spectrum')(Spectrum).
Say we have a spectral dictionary consisting of a single amino acid string Peptide, which we will attempt to match against a randomly generated string DecoyProteome of length n. Because DecoyProteome was randomly generated, the probability that Peptide matches the string beginning at a given position of DecoyProteome is 1/20|Peptide|. We call this expression the probability of Peptide.
If the spectral dictionary consists of more than one peptide, then we define the probability of the dictionary as
$$\displaystyle\sum_{\text{each peptide Peptide in Dictionary}} \frac{1}{20^{|\textit{Peptide}|}}.$$
We will first compute the number of peptides in a spectral dictionary, since this simpler problem will provide insights on how to compute the probability of a spectral dictionary.
We will use dynamic programming to find the size of a spectral Dictionary problem. Given a spectral vector Spectrum' = (s1, . . . , sm), we define its i-prefix (for i between 1 and m) as Spectrum'i = (s1, . . . , si ) and introduce a variable Size(i, t) as the number of peptides Peptide of mass i such that Score(Peptide, Spectrum'i) is equal to t.
The key to establishing a recurrence relation for computing Size(i, t) is to realize that the set of peptides contributing to Size(i, t) can be split into 20 subsets depending on their final amino acid a. Each peptide ending in a specific amino acid a results in a shorter peptide with mass i − |a| and score t − si if we remove a from the peptide (here, |a| denotes the mass of a). Thus,
$$\textit{Size}(i, t) = \displaystyle\sum_{\text{all amino acids }a} \textit{Size}(i-|a|, t-s_i).$$
Since there is a single “empty” peptide of length zero, we initialize Size(0, 0) = 1. We also define Size(0, t) = 0 for all possible scores t, and set Size(i, t) = 0 for negative values of i. Using the above recurrence, we can compute the size of a spectral dictionary of Spectrum' = (s1, . . . , sm) as
$$\left|\textit{Dictionary}_{\textit{threshold}}({\textit{Spectrum}})\right|= \displaystyle\sum_{t \geq \textit{threshold}} \textit{Size}(m, t).$$
Given a file with a spectral vector Spectrum', an integer threshold, and an integer max_score.
Return the size of the dictionary Dictionarythreshold(Spectrum').
Use the provided max_score for the height of your table. Your answer should be the number of peptides whose score is at least threshold and at most max_score.
>>> spectral_dictionary_size('data01.txt'1, 15, 62) 1 >>> spectral_dictionary_size('data02.txt'2, 35, 55) 0 >>> spectral_dictionary_size('data03.txt'3, 28, 77) 355