Overview of Tandem Mass Spectrometry (MS/MS) Database Search Algorithms

互联网2013-12-31

2349

Abstract
Table of Contents
Figures
Literature Cited

Abstract

Mass spectrometry?based methods for the identification of proteins are fundamental platform technologies for proteomics. One comprehensive approach is to subject trypsinized peptides to tandem mass spectrometry (MS/MS) to obtain detailed structural information. Different strategies are available for interpreting MS/MS data and hence deducing the amino acid sequence of the peptides. The most common method is to use a search algorithm to identify peptides by correlating experimental and theoretical MS/MS data (the latter being generated from possible peptides in the protein sequence database). Identified peptides are collated and protein entries from the sequence database inferred. This unit focuses on the most widely used tandem MS peptide identification search algorithms (commercial and open source), their availability, ease of use, strengths, speed and scoring, as well as their relative sensitivity and specificity. Curr. Protoc. Protein Sci. 49:25.2.1?25.2.19. © 2007 by John Wiley & Sons, Inc.

Keywords: SEQUEST; Mascot; X!Tandem; OMSSA; PLGS; Sorcerer; ProteinPilot; Phenyx; SpectrumMill

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
Sequest
Spectrum Mill
X!Tandem
Mascot
Proteinlynx Global Server
Phenyx
Omssa
Peaks (Spider)
Proteinpilot
Sequest Sorcerer
Acknowledgements
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 25.2.1 Overview of the MS/MS search process. A peak representing a peptide is isolated in the first stage of analysis. Each isolated peptide is then induced to fragment, possibly by collision, and the second stage of mass analysis used to collect an MS/MS spectrum. The mass of the peptide to search against a protein or DNA sequence database is determined based on the precursor m/z and charge state. For each MS/MS spectrum, software is used to determine which peptide sequence in the database gives the best match to the experimental spectrum. This involves simulating the cleavage specificity of the enzyme, followed by calculation of the mass values expected from the gas phase fragmentation of the peptide. A score or probability is assigned to each peptide and the top 10 peptide scores for each MS/MS spectrum is output.

View Image

Figure 25.2.2 Calculation of the SEQUEST preliminary score (Sp). The search algorithm performs data reduction on the experimental MS/MS spectrum and then compares the constructed (simulated) spectrum of candidate peptide sequences using an empirically derived formula based on the notion of shared peak counts. Bonus points are awarded for consecutive matching ion‐series as well as matching of immonium ions.

View Image

Figure 25.2.3 Calculation of the SEQUEST cross‐correlation score (XCorr). The search algorithm performs data reduction on the experimental MS/MS spectrum by dividing the spectrum into 10 equal segments and normalizing each segment to an abundance of 50. Candidate peptide sequences that are to be compared are constructed in such a way as to resemble the experimental spectrum (y and b ions are normalized to 50, a ions and neutral losses are normalized to 10). Fast‐fourier transforms (FFT) are used to compare the simulated spectrum with the experimental one. While this approach is computer intensive, it has been found to be very sensitive. A deltaCn value (relative score difference between 1^st and 2^nd ranked peptide) is calculated to indicate the significance of the match.

View Image

Figure 25.2.4 The Spectrum Mill search parameters Web‐interface page. Spectra can be filtered and discarded prior to searching in order to speed up searches by enabling the Sequence tag length checkbox as well as Minimum detected peaks . An Identity or Homology search mode is available allowing for a potential unknown modification in a peptide sequence.

View Image

Figure 25.2.5 Derivation of the X!Tandem hyperscore and Expectation value. (A ) Histogram distribution of hyperscores of all matching peptides. Dot‐product scores are converted to hyperscores by multiplying the score by the factorial number of matching b and y ions (based on hypergeometric distribution). High‐scoring correct match indicated by circle. (B ) The data in the right‐half of the histogram (shaded portion in A) are log‐transformed and any scores higher than the intersection with zero are assumed to be significant. (C ) Extrapolation of the linear regression of the data in (B) to high‐scoring peptide matches. The Expectation value ( E ‐value) for the given example (circle; hyperscore of 83) is −8.2 (i.e., E ‐value = e^−8.2 ).

View Image

Figure 25.2.6 The Mascot Daemon (Windows) application interface showing the Task Editor pane. Mascot Daemon enables local and remote automation as well as security. Multiple MS/MS datasets can be searched simultaneously and from data acquired on many different mass spectrometry instruments. Searches can be scheduled and follow‐up tasks created allowing iterative type searches. For example, unidentified MS/MS spectra can be researched against other databases or searched using an alternative set of parameters (e.g., no‐enzyme etc.).

View Image

Figure 25.2.7 The method definition screen for the Paragon Algorithm. The left half of the screen under Describe Sample contains fields that can be derived directly from what was done in the laboratory and attributes of the sample–like ‘species of origin.’ A full database can also be searched without a species filter by selecting None in the species field. The right side of the screen contains fields that control what is desired from the analysis. Large sets of biological modifications and substitutions can be considered in the search if a ‘ Thorough ’ search mode is selected. All modifications and their importance are determined automatically, and mass tolerance controls are managed by the algorithm based on the Instrument setting.

View Image

Videos

Literature Cited

Literature Cited
	Biemann, K., Cone, C., Webster, B.R., and Arsenault, G.P. 1966. Determination of the amino acid sequence in oligopeptides by computer interpretation of their high‐resolution mass spectra. J. Am. Chem. Soc. 88:5598‐5606.
	Cargile, B.J., Bundy, J.L., and Stephenson, J.L., Jr. 2004. Potential for false positive identifications from large databases through tandem mass spectrometry. J. Proteome Res. 3:1082‐1085.
	Carr, S., Aebersold, R., Baldwin, M., Burlingame, A., Clauser, K., and Nesvizhskii, A. 2004. The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data. Mol. Cell Proteomics 3:531‐533.
	Chelius, D., Wu, S.L., and Bondarenko, P.V. 2002. Identification of N‐linked oligosaccharides of rat insulin‐like growth factor binding protein‐4. Growth Horm. IGF Res. 12:169‐177.
	Chiang, D. 2006. Ten things you absolutely need to know about proteomics analysis for mass spectrometry. Sage‐N Research, Inc.
	Colinge, J., Masselot, A., Giron, M., Dessingy, T., and Magnin, J. 2003. OLAV: Towards high‐throughput tandem mass spectrometry data identification. Proteomics 3:1454‐1463.
	Colinge, J., Masselot, A., Cusin, I., Mahe, E., Niknejad, A., Argoud‐Puy, G., Reffas, S., Bederr, N., Gleizes, A., Rey, P.A., and Bougueleret, L. 2004. High‐performance peptide identification by tandem mass spectrometry allows reliable automatic data processing in proteomics. Proteomics 4:1977‐19784.
	Craig, R. and Beavis, R.C. 2003. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17:2310‐2316.
	Craig, R. and Beavis, R.C. 2004. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 20:1466‐1467.
	Desiere, F., Deutsch, E.W., King, N.L., Nesvizhskii, A.I., Mallick, P., Eng, J., Chen, S., Eddes, J., Loevenich, S.N., and Aebersold, R. 2006. The PeptideAtlas project. Nucleic Acids Res. 34:D655‐D658.
	Dongre, A.R., Jones, J.L., Somogyi, Á.A., and Wysocki, V.H. 1996. Influence of peptide composition, gas‐phase basicity, and chemical modification on fragmentation efficiency: Evidence for the mobile proton model. J. Am. Chem. Soc. 118:8365‐8374.
	Duncan, D.T., Craig, R., and Link, A.J. 2005. Parallel tandem: A program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. J. Proteome Res. 4:1842‐1847.
	Edwards, N. and Lippert, R. 2004. Sequence database compression for peptide identification from tandem mass spectra. proc. 4th workshop on algorithms in bioinformatics WABI. Bergen, Norway, Springer‐Verlag.
	Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P., and Gygi, S.P. 2004. Intensity‐based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22:214‐219.
	Eng, J.K., McCormack, A.L., and III, J.R.Y. 1994. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5:976‐989.
	Fenyo, D. and Beavis, R.C. 2003. A method for assessing the statistical significance of mass spectrometry‐based protein identifications using general scoring schemes. Anal. Chem. 75:768‐774.
	Geer, L.Y., Markey, S.P., Kowalak, J.A., Wagner, L., Xu, M., Maynard, D.M., Yang, X., Shi, W., and Bryant, S.H. 2004. Open mass spectrometry search algorithm. J. Proteome Res. 3:958‐964.
	Gibson, B.W. and Biemann, K. 1984. Strategy for the mass spectrometric verification and correction of the primary structures of proteins deduced from their DNA sequences. Proc. Natl. Acad. Sci. U.S.A. 81:1956‐1960.
	Griffin, P.R., MacCoss, M.J., Eng, J.K., Blevins, R.A., Aaronson, J.S., and Yates, J.R., 3rd 1995. Direct database searching with MALDI‐PSD spectra of peptides. Rapid Commun. Mass Spectrom. 9:1546‐1551.
	Guo, T., Rudnick, P.A., Wang, W., Lee, C.S., Devoe, D.L., and Balgley, B.M. 2006. Characterization of the human salivary proteome by capillary isoelectric focusing/nanoreversed‐phase liquid chromatography coupled with ESI‐tandem MS. J. Proteome Res. 5:1469‐1478.
	Heller, M., Ye, M., Michel, P.E., Morier, P., Stalder, D., Junger, M.A., Aebersold, R., Reymond, F., and Rossier, J.S. 2005. Added value for tandem mass spectrometry shotgun proteomics data validation through isoelectric focusing of peptides. J. Proteome Res. 4:2273‐2282.
	Kapp, E.A., Schutz, F., Reid, G.E., Eddes, J.S., Moritz, R.L., O'Hair, R.A., Speed, T.P., and Simpson, R.J. 2003. Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal. Chem. 75:6251‐6264.
	Kapp, E.A., Schutz, F., Connolly, L.M., Chakel, J.A., Meza, J.E., Miller, C.A., Fenyo, D., Eng, J.K., Adkins, J.N., Omenn, G.S., and Simpson, R.J. 2005. An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 5:3475‐3490.
	Keller, A., Nesvizhskii, A.I., Kolker, E., and Aebersold, R. 2002. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74:5383‐5392.
	Keller, A., Eng, J., Zhang, N., Li, X.J., and Aebersold, R. 2005. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 1:2005‐0017.
	Kinter, M. and Sherman, N.E. 2000. Collisionally induced dissociation of protonated peptide ions and the interpretation of product ion spectra. In Protein Sequencing and Identification Using Tandem Mass Spectrometry. (M. Kinter and N.E. Sherman, eds.) Wiley‐Interscience, Inc.
	Ma, B., Zhang, K., Hendrie, C., Liang, C., Li, M., Doherty‐Kirby, A., and Lajoie, G. 2003. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17:2337‐2342.
	Maclean, B., Eng, J.K., Beavis, R.C., and McIntosh, M. 2006. General framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 22:2830‐2832.
	Mann, M. and Wilm, M. 1994. Error‐tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66:4390‐4399.
	Nesvizhskii, A.I., Keller, A., Kolker, E., and Aebersold, R. 2003. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75:4646‐4658.
	Nielsen, M.L., Savitski, M.M., and Zubarev, R.A. 2005. Improving protein identification using complementary fragmentation techniques in fourier transform mass spectrometry. Mol. Cell Proteomics 4:835‐845.
	Pappin, D.J., Hojrup, P., and Bleasby, A.J. 1993. Rapid identification of proteins by peptide‐mass fingerprinting. Curr. Biol. 6:327‐332.
	Patel, A.A., Seymour, S.L., Shilov, I.V., Stanick, W.A., Hattan, S.J., Hunter, C.L., Tang, W.H, Parker, K., Schaeffer, D.A, and Purkayastha, B. 2005. Application of a novel tag‐based protein identification algorithm to serum. In 53rd ASMS Conference on Mass Spectrometry San Antonio, TX.
	Patel, A.A., Tang, W.H., Seymour, S.L., Shilov, I.V., and Schaeffer, D.A. 2006. Investigation of atypical peptides found via thorough database search. 54rd ASMS Conference on Mass Spectrometry, Seattle, WA.
	Perkins, D.N., Pappin, D.J., Creasy, D.M., and Cottrell, J.S. 1999. Probability‐based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551‐3567.
	Rooney, F.R. 2006. Assessing the diversity of the immunopeptidome. 54rd ASMS Conference on Mass Spectrometry Seattle, WA.
	Sadygov, R.G. and Yates, J.R., 3rd 2003. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75:3792‐3798.
	Seymour, S.L. 2005a. Methodology advances in ease of use in protein ID and expression analysis. In 7th International Symposium on Mass Spectrometry in the Health & Life Sciences San Francisco, CA.
	Seymour, S.L. 2005b. Pro Group: Criteria for Publication of Proteomic Data. MCP Workshop Paris, France.
	Seymour, S.L. 2006. Assembly of peptide MS/MS database search results to determine which proteins to report. In ABRF 2006 Integrating Science Tools, and Technologies with Systems Biology, Long Beach, CA.
	Seymour, S.L., Loboda, A., Tang, W.H., Nimkar, S., and Schaeffer, D.A 2004. A new protein identification software analysis tool to group proteins and assemble and view results. In 52nd ASMS Conference on Mass Spectrometry Nashville, TN.
	Seymour, S.L., Shilov, I.V., Patel, A.A., Loboda, A., Keating, S.P., Tang, W.H., and Schaeffer, D.A. 2006. A next generation search engine that substantially improves peptide identification by using sequence temperatures and feature probabilities. In 54rd ASMS Conference on Mass Spectrometry Seattle, WA.
	Shadforth, I., Xu, W., Crowther, D., and Bessant, C. 2006. GAPP: A fully automated software for the confident identification of human peptides from tandem mass spectra. J. Proteome Res.
	Simpson, R.J. 2003. Proteins and proteomics: A laboratory manual. Cold Spring Harbor Laboratory Press New York.
	Steen, H. and Mann, M. 2004. The ABC's (and XYZ's) of peptide sequencing. Nat. Rev. Mol. Cell. Biol. 9:699‐711.
	Tanner, S., Shu, H., Frank, A., Wang, L.C., Zandi, E., Mumby, M., Pevzner, P.A., and Bafna, V. 2005. InsPecT: Identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 77:4626‐4639.
	Yates, J.R. 3rd, Eng, J.K., and McCormack, A.L., 1995a. Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem. 67:3202‐3210.
	Yates, J.R., 3rd, Eng, J.K., McCormack, A.L., and Schieltz, D., 1995b. Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem. 67:1426‐1436.
	Yates, J.R., Eng, J.K., Clauser, K.R., and Burlingame, A.L. 1996. Search of sequenced databases with uninterpreted high‐energy collision‐induced dissociation spectra of peptides. J. Am. Soc. Mass Spectrom. 7:1089‐1098.