Using OrthoMCL to Assign Proteins to OrthoMCL‐DB Groups or to Cluster Proteomes Into New Ortholog Groups

互联网2013-12-31

1081

Abstract
Table of Contents
Figures
Literature Cited

Abstract

OrthoMCL is an algorithm for grouping proteins into ortholog groups based on their sequence similarity. OrthoMCL?DB is a public database that allows users to browse and view ortholog groups that were pre?computed using the OrthoMCL algorithm. Version 4 of this database contained 116,536 ortholog groups clustered from 1,270,853 proteins obtained from 88 eukaryotic genomes, 16 archaean genomes, and 34 bacterial genomes. Future versions of OrthoMCL?DB will include more proteomes as more genomes are sequenced. Here, we describe how you can group your proteins of interest into ortholog clusters using two different means provided by the OrthoMCL system. The OrthoMCL?DB Web site has a tool for uploading and grouping a set of protein sequences, typically representing a proteome. This method maps the uploaded proteins to existing groups in OrthoMCL?DB. Alternatively, if you have proteins from a set of genomes that need to be grouped, you can download, install, and run the stand?alone OrthoMCL software. Curr. Protoc. Bioinform. 35:6.12.1?6.12.19. © 2011 by John Wiley & Sons, Inc.

Keywords: OrthoMCL; ortholog groups; paralog; proteome; Markov clustering; reciprocal best hits; MCL

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
Strategic Planning
Basic Protocol 1: Assign a Proteome to OrthoMCL‐DB Groups
Basic Protocol 2: Create Ortholog Groups from Your Proteomes Using the OrthoMCL Software
Support Protocol 1: Downloading, Installing, and Configuring the OrthoMCL Programs
Guidelines for Understanding Results
Commentary
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 6.12.1 Overview of the OrthoMCL algorithm. (1) Proteomes must each be in FASTA format where the file name and definition lines comply with simple requirements. (2) The proteome files are filtered to remove low‐quality sequences based on length and percent stop codons. (3) The proteomes are all compared to each other using BLASTP. They are masked with seg and an e‐value cutoff of 1e‐5 is applied. (4) For each pair of sequences that match, compute the “percent match length” score: count the number of amino acids in the shorter sequence that participate in any HSP, divide that by the length of the shorter sequence, and multiply by 100. Filter away matches with percent match < 50%. (5) For all pairs of proteomes, find all pairs of proteins across them that have hits as good as or better than any other hits between these proteins and other proteins in those species. (6) Find all pairs of proteins within a species that have mutual e‐values that are better than or equal to all of those proteins' hits to proteins in other species. (7) Find all pairs of proteins across two species that are connected through orthology and in‐parology. (8) Normalize in‐paralog e‐values by averaging all qualifying in‐paralog pairs in a genome and divide each pair by the average. Within a genome, in‐paralog pairs qualify if either of the proteins in the pair has an ortholog in any genome. If no in‐paralogs within a genome have any orthologs, all in‐paralogs in that genome qualify. Normalize ortholog and co‐ortholog pairs for any two species by averaging the e‐values across them, and normalize using that average. (9) Pass on all ortholog, in‐paralog, and co‐ortholog pairs, with their normalized e‐values, to the MCL program for clustering.

View Image

Figure 6.12.2 OrthoMCL‐DB home page with the Tools link circled.

View Image

Figure 6.12.3 A proteome mapped to OrthoMCL‐DB. The results are downloaded as a .zip file that contains five files. Shown here is the orthologGroups file obtained after submitting the Erwinia carotovora proteome (Bell et al., ).

View Image

Videos

Literature Cited

	Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths‐Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., Studholme, D.J., Yeats, C., and Eddy, S.R. 2004. The Pfam protein families database. Nucleic Acids Res. 32:D138‐D141.
	Bell, K.S., Sebaihia, M., Pritchard, L., Holden, M.T., Hyman, L.J., Holeva, M.C., Thomson, N.R., Bentley, S.D., Churcher, L.J., Mungall, K., Atkin, R., Bason, N., Brooks, K., Chillingworth, T., Clark, K., Doggett, J., Fraser, A., Hance, Z., Hauser, H., Jagels, K., Moule, S., Norbertczak, H., Ormond, D., Price, C., Quail, M.A., Sanders, M., Walker, D., Whitehead, S., Salmond, G.P., Birch, P.R., Parkhill, J., and Toth, I.K. 2004. Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors. Proc. Natl. Acad. Sci. U.S.A. 101:11105‐11110.
	Chen, F., Mackey, A.J., Stoeckert, C.J. Jr., and Roos, D.S. 2006. OrthoMCL‐DB: Querying a comprehensive multi‐species collection of ortholog groups. Nucleic Acids Res. 34:D363‐D368.
	Chen, F., Mackey, A.J., Vermunt, J.K., and Roos, D.S. 2007. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One. 2:e383.
	Enright, A.J., Van Dongen, S., and Ouzounis, C.A. 2002. An efficient algorithm for large‐scale detection of protein families. Nucleic Acids Res. 30:1575‐1584.
	The Gene Ontology Consortium. 2000. Gene ontology: Tool for the unification of biology. Nat. Genet. 25:25‐29.
	Li, L., Stoeckert, C.J. Jr., and Roos, D.S. 2003. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 13:2178‐2189.
	Webb, E., and International Union of Biochemistry and Molecular Biology. Enzyme nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes. 1984th ed. Academic Press, New York.
Key References
	Li et al., 2003. See above.
	The original paper describing the OrthoMCL algorithm.
	Chen et al., 2006. See above.
	A paper describing the OrthoMCL‐DB.
	Chen et al., 2007. See above.
	A paper comparing OrthoMCL to other approaches.
Internet Resources
	http://orthomcl.org
	The OrthoMCL‐Db site
	http://pfam.sanger.ac.uk/search#tabview=tab1
	Submit a set of proteins to find Pfam domains
	http://www.ebi.ac.uk/Tools/msa/clustalw2/
	Submit a set of proteins for multiple sequence alignment
	http://www.biolayout.org/
	Download software to visualize groups using Biolayout.