Computational Methods for Protein Sequence Comparison and Search

互联网2013-12-31

1078

Abstract
Table of Contents
Figures
Literature Cited

Abstract

Protein sequence comparison and search has become commonplace not only for bioinformatics researchers but also for experimentalists in many cases. Because of the exponential growth in sequence data, sequence comparison in particular has become an increasingly important tool. Relating a new gene sequence to other known sequences often reveals its function, structure, and evolution. Many sequence comparison and search tools are available through public Web servers, and biologists can use them easily with little knowledge of computers or bioinformatics. This unit provides some theoretical background and describes popular tools for dot plot, sequence search against a database, multiple sequence alignments, protein tree construction, and protein family and motif search. Step?by?step examples are provided to illustrate how to use some of the most well?known tools. Finally, some general advice is given on combining different sequence analysis tools for biological inference. Curr. Protoc. Protein Sci. 56:2.1.1?2.1.27. © 2009 by John Wiley & Sons, Inc.

Keywords: protein sequence comparison; dot plot; multiple sequence alignment; protein tree; protein family; motif search

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
Theoretical Background for Protein Sequence Analysis
Matrix Methods for Sequence Comparison: Dot Plots
Sequence Similarity Searching
Multiple Alignments
Protein Trees
Protein Family and Functional Site Identification
General Strategy for Sequence Analyses
Acknowledgement
Internet Resources
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 2.1.1 Dot plot generated from comparison of peanut allergens Ara h 1 and Ara h 3 using PLALIGN. Regions of similarity between the two sequences appear as lines parallel and offset to the line of identity. The expectation values for the local alignments of these regions are shown in color. The horizontal axis indicates Ara h 1 and the vertical axis indicates Ara h 3.

View Image

Figure 2.1.2 The best local sequence alignment for peanut allergens Ara h 1 and Ara h 3 using PLALIGN. In the alignment, the lower sequence is Ara h 1 and the upper one is Ara h 3.

View Image

Figure 2.1.3 FASTA histogram from a global‐alignment search of the SWISS‐PROT database for a lectin protein. Numbers of windows at each opt score are plotted. Note that there are seven highly significant alignments.

View Image

Figure 2.1.4 FASTA alignment table and the best scoring alignment for the same search illustrated in Figure . The table shows the best alignment scores sorted by the highest opt score.

View Image

Figure 2.1.5 Input sequence file to run TCoffee for multiple sequence alignment. The sequences are from the query protein (“test”) and top seven significant hits in Figure .

View Image

Figure 2.1.6 TCoffee output multiple sequence alignment results in the ClustalW format for the input sequences in Figure . The fully conserved residues are marked with “”, while somewhat conserved residues are indicated with “:” or “.”, the latter of which is less conserved.

View Image

Figure 2.1.7 TreeView display for the phylogenetic produced using TCoffee based on the multiple sequence alignment in Figure .

View Image

Figure 2.1.8 Partial output from MotifScan for protein Sin1, indicating the bipartite localization signals.

View Image

Videos

Literature Cited

Literature Cited
	Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25:3389‐3402.
	Argos, P. 1987. A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 193:385‐396.
	Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A. and Zygouri, C. 2003. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31:400‐402.
	Bairoch, A. 1992. PROSITE: A dictionary of protein sites and patterns. Nucl. Acids Res. 19:2241‐2245.
	Barton, G.J. 1990. Protein multiple sequence alignment and flexible pattern matching. Methods Enzymol. 183:403‐428.
	Borodovsky, M. and Ekisheva, S. 2006. Problems and Solutions in Biological Sequence Analysis. Cambridge University Press.
	Brendel, V., Bucher, P., Nourbaksh, I.R., Blaisdell, B.E., and Karlin, S. 1992. Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. U.S.A. 89:2002‐2006.
	Burks, C. 1990. The flow of nucleotide sequence data into data banks: Role and impact of large‐scale sequencing projects. In Computers and DNA, Santa Fe Institute (G. Bell and T. Marr, eds.) pp. 35‐45. Addison‐Wesley, Reading, Mass.
	Chou, P.Y. and Fasman, G.D. 1974. Prediction of protein conformation. Biochemistry 13:222‐244.
	Corpet, F., Servant, F., Gouzy, J., and Kahn, D. 2000. ProDom and ProDom‐CG: Tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28:267‐269.
	Day, W.H.E. and McMorris, F.R. 1993. A consensus program for molecular sequences. CABIOS 9:653‐656.
	Dayhoff, M.O. 1978. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C.
	Depiereux, E. and Feytmans, E. 1991. Simultaneous and multivariate alignment of protein sequences: Correspondence between physicochemical profiles and structurally conserved regions (SCR). Protein Eng. 4:603‐613.
	De Rijk, P. and De Wachter, R. 1993. DCSE, an interactive tool for sequence alignment and secondary structure search. CABIOS 9:735‐740.
	Dodo, H., Marsic, D., Callender, M., Cebert, E., and Viquez, O. 2002 Screening 34 Peanut Introductions for Allergen Content Using Elisa, Food and Agricultural Immunology 14:147‐154.
	Doolittle, R.F. 1981. Similar amino acid sequences: Chance or common ancestry? Science 214:167‐339.
	Doolittle, R.F. 1986. Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Ann Arbor, Mich.
	Doolittle, R.F. 1989. Redundancies in protein sequences. In Prediction of Protein Structure and the Principles of Protein Conformation (G.D. Fasman, ed.) pp. 599‐623. Plenum, New York.
	Doolittle, R.F. 1990. What we have learned and will learn from sequence databases. In Computers and DNA, Santa Fe Institute (G. Bell and T. Marr, eds.) pp. 21‐31. Addison‐Wesley, Reading, Mass.
	Dumas, J.P. and Nunio, J. 1982. Efficient algorithm for folding and comparing nucleic acid sequences. Nucl. Acids Res. 10:197‐206.
	Eddy, S.R. Profile hidden Markov models. 1998. Bioinformatics 14:755‐763.
	Edgar, R.C. and Sjolander, K. 2004. Coach: profile‐profile alignment of protein families using hidden Markov models. Bioinformatics 20:1309‐1318.
	Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome‐wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863‐14868.
	Eroshkin, A.M., Zhilkin, P.A., and Fomin, V.I. 1993. Algorithm and computer program: Pro_Anal for analysis of relationship between structure and activity in a family of proteins or peptides. CABIOS 9:491‐497.
	Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., and Bairoch, A. 2002. The PROSITE database, its status in 2002. Nucleic Acids Res. 30:235‐238.
	Felsenstein, J. 1989. PHYLIP ‐ Phylogeny Inference Package (Version 3.2). Cladistics 5:164‐166.
	Feng, D.F. and Doolittle, R.F. 1987. Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. J. Mol. Evol. 25:351‐360.
	Finkelstein, A.V. and Ptitsyn, O.B. 1987. Why do globular proteins fit the limited set of folding patterns? Prog. Biophys. Mol. Biol. 50:171‐190.
	Finn, R.D., Mistry, J., Schuster‐Bockler, B., Griffiths‐Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S.R., Sonnhammer, E L., and Bateman, A. 2006. Pfam: Clans, web tools and services. Nucleic Acids Res. 34:D247‐D251.
	Fitch, W.M. 1966. An improved method of testing for evolutionary homology. J. Mol. Biol. 16:9‐16.
	Fitch, W.M. 1969. Locating gaps in amino acid sequences to optimize the homology between two proteins. Biochem. Genet. 3:99‐108.
	Fitch, W.M. 1970. Distinguishing homologous from analogous proteins. Syst Zool. 19:99‐113.
	Fuchs, R. 1994. Fast protein block searches. CABIOS 10:79‐80.
	Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97‐120.
	Genetics Computer Group. 1994. GCG Program Manual for the Wisconsin Package, Version 8, September 1994. Genetics Computer Group Inc., Madison, Wis.
	George, D., Hunt, L.T., and Barker, W.C. 1990. Mutation data matrix and its uses. Methods Enzymol. 183:333‐351.
	Gibbs, A.J. and McIntyre, G.A. 1970. The diagram, a method for comparing sequences. J. Biochem. 16:1‐11.
	Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of amino acid substitution matrices. Proteins Struct. Funct. Genet. 17:49‐61.
	Henikoff, J.G., Greene, E.A., Pietrokovski, S., and Henikoff, S. 2000. Increased coverage of protein families with the blocks database servers. Nucl. Acids Res. 28:228‐230.
	Heringa, J., Sommerfeldt, H., Higgins, D.G., and Argos, P. 1992. OBSTRUCT: A program to obtain the largest cliques from a protein sequence set according to structural resolution and sequence similarity. CABIOS 8:599‐600.
	Hodgman, T.C. 1992. Nucleic acid and protein sequence management. In Microcomputers in Biochemistry: A Practical Approach (C.F.A. Bryce, ed.) pp. 131‐158. IRL Press, Oxford.
	Huang, H., Barker, W.C., Chen, Y., and Wu, C.H. 2003. iProClass: An integrated database of protein family, function and structure information. Nucleic Acids Res. 31:390‐392.
	Junier, T. and Pagni, M. 2000. Dotlet: Diagonal plots in a web browser. Bioinformatics 16:178‐179.
	Kanaoka, M., Kishimoto, F., Ueki, Y., and Umeyama, H. 1989. Alignment of protein sequences using the hydrophobic core scores. Protein Eng. 2:347‐351.
	Karlin, S.P., Morris, M., Ghandour, G., and Leung, M.‐Y. 1988. Algorithms for identifying local molecular sequence features. CABIOS 4:41‐51.
	Karlin, S.P., Ost, F., and Blaisdell, B.E. 1989. Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) pp. 133‐157. CRC Press, Boca Raton, Fla.
	Karlin, S., Bucher, P., and Brendel, V. 1991. Statistical methods and insights for protein and DNA sequences. Annu. Rev. Biophys. Chem. 20:175‐203.
	Karplus, K., Barrett, C., and Hughey, R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846‐856.
	Koonin, E.V., Makarova, K.S., and Aravind, L. 2001. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55:709‐742.
	Kruskal, J.B. 1983. An overview of sequence comparison. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (D. Sankoff and J.B. Kruskal, eds.) pp. 1‐44. Addison‐Wesley, Reading, Mass.
	Kruskal, J.B. and Sankoff, D. 1983. An anthology of algorithms and concepts for sequence comparison. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (D. Sankoff and J.B. Kruskal, eds.) pp. 265‐310. Addison‐Wesley, Reading, Mass.
	Kyte, J. and Doolittle, R.F. 1982. A simple method for displaying the hydrophobic character of a protein. J. Mol. Biol. 157:105‐132.
	Landau, G.M., Vishkin, U., and Nussinov, R. 1988. Locating alignments with k differences for nucleotide and amino acid sequences. CABIOS 4:19‐24.
	Landau, G.M., Vishkin, U., and Nussinov, R. 1990. Fast alignment of DNA and protein sequences. Methods Enzymol. 183:487‐502.
	Landes, C., Henaut, A., and Risler, J.‐L. 1993. Dot‐plot comparisons by multivariate analysis (DOCMA): A tool for classifying protein sequences. CABIOS 9:91‐196.
	Lipman, D.J. and Pearson, W.R. 1985. Rapid and sensitive protein similarity searches. Science 227:1435‐1441.
	Livingstone, C.D. and Barton, G.F. 1993. Protein sequence alignments: A strategy for the hierarchical analysis of residue conservation. CABIOS 9:745‐756.
	Madera, M. and Gough, J. 2002. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 30:4321‐4328.
	Maizel, J.V. and Lenk, R.P. 1981. Enhanced graphic matrix analysis of nucleic acids and protein sequences. Proc. Natl. Acad. Sci. U.S.A. 78:7665‐7669.
	McLachlan, A.D. 1971. Test for comparing related amino acid sequences: Cytochrome c and cytochrome c‐551. J. Mol. Biol. 61:409‐424.
	Mrazek, J. and Kypr, J. 1993. UNIREP: A microcomputer program to find unique and repetitive nucleotide sequences in genomes. CABIOS 9:355‐360.
	Nedde, D.N. and Ward, M.O. 1993. Visualizing relationships between nucleic acid sequences using correlation images. CABIOS 9:331‐335.
	Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443‐453.
	Notredame, C., Holme, L., and Higgins, D. 1998. COFFEE: A New Objective Function for Multiple Sequence Alignment. Bioinformatics 14:407‐422.
	Notredame, C., Higgins, D., and Heringa, J. 2000. T‐Coffee: A novel method for multiple sequence alignments. J. Mol. Biol. 302:205‐217.
	Panjukov, V.V. 1993. Finding steady alignments: Similarity and distance. CABIOS 9:285‐290.
	Pearson, W.R. 1990. Rapid and sensitive comparison with FASTP and FASTA. Methods Enzymol. 183:63‐98.
	Pearson, W.R. 1994. Using the FASTA program to search protein and DNA sequence databases. Methods Mol. Biol. 24:365‐389.
	Pearson, W.R. and Miller, W. 1992. Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 210:576‐610.
	Pevzner, P.A. 1992. Statistical distance between texts and filtration methods in sequence comparison. CABIOS 8:121‐127.
	Pizzi, E.M., Attimonelli, M., Liuni, S., Frontali, C., and Saccone, C. 1991. A simple method for global sequence comparison. Nucl. Acids Res. 20:131‐136.
	Raghava, G.P., Searle, S.M., Audley, P.C., Barber, J.D., and Barton, G.J. 2003. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4:47.
	Sankoff, D., Kruskal, J., and Nerbonne, J. (eds) 2000. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Cambridge University Press.
	Sellers, P.H. 1974. On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 26:787‐793.
	Smith, R.F. and Smith, T.F. 1992. Pattern‐induced multisequence alignment (PIMA) algorithm employing secondary structure‐dependent gap penalties for use in comparative protein modeling. Protein Eng. 5:35‐41.
	Smith, T.F. and Waterman, M.S. 1981. Comparative biosequence metrics. J. Mol. Evol. 18:38‐46.
	Soding, J. 2005. Protein homology detection by HMM‐HMM comparison. Bioinformatics 21:951‐960
	Sonnhammer, E.L. and Durbin, R. 1995. A dot‐matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167:GC1‐10.
	Sonnhammer, E.L. and Wootton, J.C. 2001. Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 45:262‐273.
	Staden, R. 1994a. Statistical and structural analysis of protein sequences. Methods Mol. Biol. 24:125‐130.
	Staden, R. 1994b. Searching for motifs in protein sequences. Methods Mol. Biol. 24:131‐139.
	Staden, R. 1994c. Using patterns to analyze protein sequences. Methods Mol. Biol. 24:141‐154.
	Staden, R. 1994d. Comparing sequences. Methods Mol. Biol. 24:155‐170.
	States, D.J. 1992. Molecular sequence accuracy: Analyzing imperfect data. Trends Genet. 8:52‐55.
	States, D.J. and Boguski, M.S. 1990. Sequence Analysis Primer. Stockton Press, New York.
	Streletc, V.B., Shindyalov, I.N., Kolchanov, N.A., and Lim, H.A. 1991. Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of dot matrices. CABIOS 8:529‐534.
	Swofford, D.L. 2002. PAUP 4.0: Phylogenetic Analysis Using Parsimony (And Other Methods). Sinauer Associates, Sunderland, Mass.
	Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A genomic perspective on protein families. Science 278:631‐637.
	Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673‐4680.
	Wan, X. and Xu, D. 2005. Computational methods for remote homlog identification. Curr. Protein Pept. Sci. 6:527‐546.
	Waterman, M.S. 1989. Sequence alignments. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) pp. 53‐90. CRC Press, Boca Raton, Fla.
	Waterman, M.S. 1990. Consensus patterns in sequences. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) pp. 93‐115. CRC Press, Boca Raton, Fla.
	Waterman, M.S. and Eggert, M. 1991. A new algorithm for best subsequence alignments with application to tRNA‐rRNA comparisons. J. Mol. Biol. 197:723‐728.
	Waterman, M.S. and Jones, R. 1990. Consensus methods for DNA and protein sequence alignment. Methods Enzymol. 183:221‐237.
	Wilbur, W.J. and Lipman, D.J. 1983. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80:726‐730.
	Xu, D., Xu, Y., and Uberbacher, E.C. 2000. Computational tools for protein modeling. Curr. Protein Pept. Sci. 1:1‐21.
	Yona, G. and Levitt, M. 2002. Within the twilight zone: A sensitive profile‐profile comparison tool based on information theory. J. Mol. Biol. 315:1257‐1275.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Computational Methods for Protein Sequence Comparison and Search

Abstract

Table of Contents

Materials

Figures

Videos

Literature Cited