PepArML: A Meta‐Search Peptide Identification Platform for Tandem Mass Spectra
互联网
- Abstract
- Table of Contents
- Materials
- Figures
- Literature Cited
Abstract
The PepArML meta?search peptide identification platform for tandem mass spectra provides a unified search interface to seven search engines; a robust cluster, grid, and cloud computing scheduler for large?scale searches; and an unsupervised, model?free, machine?learning?based result combiner, which selects the best peptide identification for each spectrum, estimates false?discovery rates, and outputs pepXML format identifications. The meta?search platform supports Mascot; Tandem with native, k?score and s?score scoring; OMSSA; MyriMatch; and InsPecT with MS?GF spectral probability scores?reformatting spectral data and constructing search configurations for each search engine on the fly. The combiner selects the best peptide identification for each spectrum based on search engine results and features that model enzymatic digestion, retention time, precursor isotope clusters, mass accuracy, and proteotypic peptide properties, requiring no prior knowledge of feature utility or weighting. The PepArML meta?search peptide identification platform often identifies two to three times more spectra than individual search engines at 10% FDR. Curr. Protoc. Bioinform . 44:13.23.1?13.23.23. © 2013 by John Wiley & Sons, Inc.
Keywords: proteomics; tandem mass spectra; machine learning; cloud computing
Table of Contents
- Introduction
- Basic Protocol 1: Upload Tandem Mass Spectra
- Alternate Protocol 1: Batch Upload of Many, Large, or Vendor‐Format Spectra Datafiles
- Support Protocol 1: Registration and Login
- Basic Protocol 2: Configure and Initiate the Search
- Basic Protocol 3: Monitor and Manage the Search Jobs
- Alternate Protocol 2: Run Search Jobs in the Cloud
- Basic Protocol 4: Combine Search Results using PepArML Combiner
- Guidelines for Understanding Results
- Commentary
- Literature Cited
- Figures
- Tables
Materials
Basic Protocol 1: Upload Tandem Mass Spectra
Necessary Resources
Alternate Protocol 1: Batch Upload of Many, Large, or Vendor‐Format Spectra Datafiles
Necessary Resources
Support Protocol 1: Registration and Login
Necessary Resources
Basic Protocol 2: Configure and Initiate the Search
Necessary Resources
Basic Protocol 3: Monitor and Manage the Search Jobs
Necessary Resources
Alternate Protocol 2: Run Search Jobs in the Cloud
Necessary Resources
Basic Protocol 4: Combine Search Results using PepArML Combiner
Necessary Resources
|
Figures
-
Figure 13.23.1 PepArML homepage. View Image -
Figure 13.23.2 Uploading 17mix‐test2.mzxml.gz to the Tutorial folder of the spectra repository. View Image -
Figure 13.23.3 Completed upload of datafile 17mix‐test2.mzxml.gz to the Tutorial folder of the spectra repository. View Image -
Figure 13.23.4 Tutorial folder of spectra repository populated with spectra 17mix‐test2 and selection of Search from the popup menu. View Image -
Figure 13.23.5 Batch upload of 17mix‐test2.mzxml.gz to the Tutorial folder of the spectra repository. View Image -
Figure 13.23.6 Search parameters for the example analysis of 17mix‐test2. View Image -
Figure 13.23.7 Tutorial folder of results repository showing progress of the example analysis. View Image -
Figure 13.23.8 Example analysis search jobs running on the Edwards lab cluster (http://edwardslab.bmcb.georgetown.edu), Amazon Web Services (http://amazonaws.com), and Georgetown HPC computing resources (http://matrix.georgetown.edu). View Image -
Figure 13.23.9 Selection of PepArML Worker Amazon Machine Image for spot request. View Image -
Figure 13.23.10 Setting the Amazon spot request instance type and bid price. View Image -
Figure 13.23.11 PepArML username and password in the Amazon spot request User Data field. View Image -
Figure 13.23.12 Completed PepArML analysis for the Tutorial folder. View Image -
Figure 13.23.13 Evaluation of combiner methods by spectrum and peptide q ‐values (fdrcurves.png). View Image -
Figure 13.23.14 Information gain of PepArML PSM features for the example analysis (infogain.png). View Image -
Figure 13.23.15 Schema for unsupervised PepArML training heuristic (Edwards et al., , used with permission) View Image
Videos
Literature Cited
Literature Cited | |
Breiman, L. 2001. Random forests. Mach. Learn. 45:5‐32. | |
Craig, R. and Beavis, R.C. 2004. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 20:1466‐1467. | |
Edwards, N., Wu, X., and Tseng, C.‐W., 2009. An unsupervised, Model‐Free, Machine‐Learning combiner for peptide identifications from tandem mass spectra. Clin. Proteomics 5 (1). | |
Elias, J.E. and Gygi, S.P. 2007. Target‐decoy search strategy for increased confidence in large‐scale protein identifications by mass spectrometry. Nat. Methods 4:207‐214. | |
Geer, L.Y., Markey, S.P., Kowalak, J.A., Wagner, L., Xu, M., Maynard, D.M., Yang, X., Shi, W., and Bryant, S.H. 2004. Open mass spectrometry search algorithm. J. Proteome Res. 3:958‐964. | |
Keller, A., Nesvizhskii, A.I., Kolker, E., and Aebersold, R. 2002. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74:5383‐5392. | |
Kessner, D., Chambers, M., Burke, R., Agus, D., and Mallick, P. 2008. ProteoWizard: Open source software for rapid proteomics tools development. Bioinformatics 24:2534‐2536. | |
Kim, S., Gupta, N., and Pevzner, P.A. 2008. Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. J. Proteome Res. 7:3354‐3363. | |
MacLean, B., Eng, J.K., Beavis, R.C., and McIntosh, M. 2006. General framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 22:2830‐2832. | |
Mallick, P., Schirle, M., Chen, S.S., Flory, M.R., Lee, H., Martin, D., Ranish, J., Raught, B., Schmitt, R., Werner, T., Kuster, B., and Aebersold, R. 2006. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 25:125‐131. | |
Nesvizhskii, A.I., Keller, A., Kolker, E., and Aebersold, R. 2003. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75:4646‐4658. | |
Peng, J., Elias, J.E., Thoreen, C.C., Licklider, L.J., amd Gygi, S.P. 2003. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC−MS/MS) for Large‐Scale protein analysis: The yeast proteome. J. Proteome Res. 2:43‐50. | |
Perkins, D.N., Pappin, D.J., Creasy, D.M., and Cottrell, J.S. 1999. Probability‐based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551‐3567. | |
Tabb, D.L., Fernando, C.G., and Chambers, M.C. 2007. MyriMatch: Highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J. Proteome Res. 6:654‐661. | |
Tanner, S., Shu, H., Frank, A., Wang, L.C., Zandi, E., Mumby, M., Pevzner, P.A., and Bafna, V. 2005. InsPecT: Identification of post translationally modified peptides from tandem mass spectra. Anal. Chem. 77:4626‐4639. |