
Overview of Tandem Mass Spectrometry (MS/MS) Database Search Algorithms


Mass spectrometry?based methods for the identification of proteins are fundamental platform technologies for proteomics. One comprehensive approach is to subject trypsinized peptides to tandem mass spectrometry (MS/MS) to obtain detailed structural information. Different strategies are available for interpreting MS/MS data and hence deducing the amino acid sequence of the peptides. The most common method is to use a search algorithm to identify peptides by correlating experimental and theoretical MS/MS data (the latter being generated from possible peptides in the protein sequence database). Identified peptides are collated and protein entries from the sequence database inferred. This unit focuses on the most widely used tandem MS peptide identification search algorithms (commercial and open source), their availability, ease of use, strengths, speed and scoring, as well as their relative sensitivity and specificity. Curr. Protoc. Protein Sci. 49:25.2.1?25.2.19. © 2007 by John Wiley & Sons, Inc.

Keywords: SEQUEST; Mascot; X!Tandem; OMSSA; PLGS; Sorcerer; ProteinPilot; Phenyx; SpectrumMill

  • Introduction
  • Sequest
  • Spectrum Mill
  • X!Tandem
  • Mascot
  • Proteinlynx Global Server
  • Phenyx
  • Omssa
  • Peaks (Spider)
  • Proteinpilot
  • Sequest Sorcerer
  •   Figure Figure 25.2.1 Overview of the MS/MS search process. A peak representing a peptide is isolated in the first stage of analysis. Each isolated peptide is then induced to fragment, possibly by collision, and the second stage of mass analysis used to collect an MS/MS spectrum. The mass of the peptide to search against a protein or DNA sequence database is determined based on the precursor m/z and charge state. For each MS/MS spectrum, software is used to determine which peptide sequence in the database gives the best match to the experimental spectrum. This involves simulating the cleavage specificity of the enzyme, followed by calculation of the mass values expected from the gas phase fragmentation of the peptide. A score or probability is assigned to each peptide and the top 10 peptide scores for each MS/MS spectrum is output.
  •   Figure Figure 25.2.2 Calculation of the SEQUEST preliminary score (Sp). The search algorithm performs data reduction on the experimental MS/MS spectrum and then compares the constructed (simulated) spectrum of candidate peptide sequences using an empirically derived formula based on the notion of shared peak counts. Bonus points are awarded for consecutive matching ion‐series as well as matching of immonium ions.
  •   Figure Figure 25.2.3 Calculation of the SEQUEST cross‐correlation score (XCorr). The search algorithm performs data reduction on the experimental MS/MS spectrum by dividing the spectrum into 10 equal segments and normalizing each segment to an abundance of 50. Candidate peptide sequences that are to be compared are constructed in such a way as to resemble the experimental spectrum (y and b ions are normalized to 50, a ions and neutral losses are normalized to 10). Fast‐fourier transforms (FFT) are used to compare the simulated spectrum with the experimental one. While this approach is computer intensive, it has been found to be very sensitive. A deltaCn value (relative score difference between 1st and 2nd ranked peptide) is calculated to indicate the significance of the match.
  •   Figure Figure 25.2.4 The Spectrum Mill search parameters Web‐interface page. Spectra can be filtered and discarded prior to searching in order to speed up searches by enabling the Sequence tag length checkbox as well as Minimum detected peaks . An Identity or Homology search mode is available allowing for a potential unknown modification in a peptide sequence.
  •   Figure Figure 25.2.5 Derivation of the X!Tandem hyperscore and Expectation value. (A ) Histogram distribution of hyperscores of all matching peptides. Dot‐product scores are converted to hyperscores by multiplying the score by the factorial number of matching b and y ions (based on hypergeometric distribution). High‐scoring correct match indicated by circle. (B ) The data in the right‐half of the histogram (shaded portion in A) are log‐transformed and any scores higher than the intersection with zero are assumed to be significant. (C ) Extrapolation of the linear regression of the data in (B) to high‐scoring peptide matches. The Expectation value ( E ‐value) for the given example (circle; hyperscore of 83) is −8.2 (i.e., E ‐value = e−8.2 ).
  •   Figure Figure 25.2.6 The Mascot Daemon (Windows) application interface showing the Task Editor pane. Mascot Daemon enables local and remote automation as well as security. Multiple MS/MS datasets can be searched simultaneously and from data acquired on many different mass spectrometry instruments. Searches can be scheduled and follow‐up tasks created allowing iterative type searches. For example, unidentified MS/MS spectra can be researched against other databases or searched using an alternative set of parameters (e.g., no‐enzyme etc.).
  •   Figure Figure 25.2.7 The method definition screen for the Paragon Algorithm. The left half of the screen under Describe Sample contains fields that can be derived directly from what was done in the laboratory and attributes of the sample–like ‘species of origin.’ A full database can also be searched without a species filter by selecting None in the species field. The right side of the screen contains fields that control what is desired from the analysis. Large sets of biological modifications and substitutions can be considered in the search if a ‘ Thorough ’ search mode is selected. All modifications and their importance are determined automatically, and mass tolerance controls are managed by the algorithm based on the Instrument setting.
