Selecting the Right Similarity‐Scoring Matrix

互联网2013-12-31

1304

Abstract
Table of Contents
Figures
Literature Cited

Abstract

Protein sequence similarity searching programs like BLASTP, SSEARCH, and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SSEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. ?Deep? scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20% to 30% identity, while ?shallow? scoring matrices (e.g., VTML10 to VTML80) target alignments that share 90% to 50% identity, reflecting much less evolutionary change. While ?deep? matrices provide very sensitive similarity searches, they also require longer sequence alignments and can sometimes produce alignment overextension into nonhomologous regions. Shallower scoring matrices are more effective when searching for short protein domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match and mismatch parameters set evolutionary look?back times and domain boundaries. In this unit, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full?length protein sequences, but short domains or restricted evolutionary look?back require shallower scoring matrices. Curr. Protoc. Bioinform . 43:3.5.1?3.5.9. © 2013 by John Wiley & Sons, Inc.

Keywords: similarity scoring matrices; PAM matrices; BLOSUM matrices; sequence alignment

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Similarity Searching, Homology, and Statistical Significance
Amino Acid Substitution Matrices: History and Classification
The Algebra of Similarity Scoring (Log‐Odds) Matrices
Scoring Matrices and Gap Penalties
Long Alignments and Overextension
Scoring Matrices for DNA
Summary
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 3.5.1 The BLOSUM62 matrix. The BLOSUM62 matrix used by BLASTP, BLASTX, and TBLASTN is actually 23 × 23: 20 amino acids plus X (any amino acid), B (D or E), and Z (N or Q). Only the lower half of the symmetric matrix is shown to highlight the identity scores on the diagonal. The most positive value is 11 (W:W alignment); the most negative is −4 (found for many hydrophobic/hydrophilic and small/large replacements). The BLOSUM62 matrix is scaled in 1/2‐bit units, so the W:W alignment of 11 is 2^5.5 = 45 times more common in homologous proteins than by chance. Weighted by amino acid abundance, the average similarity score is about −1 half‐bits.

View Image

Figure 3.5.2 Comparison of a “shallow” (VTML 20) and “deep” (BLOSUM62) scoring matrix. Both matrices are scaled in 1/2‐bits. For the small part of the matrices shown here, the VTML20 matrix produces an average 2.80 half‐bit identity score, and an average −0.59 nonidentical score (weighted by amino‐acid abundance). In contrast, BLOSUM62 produces 1.86 for identities but only −0.06 for nonidentities. Thus, VTML20 targets shorter, higher‐identity alignments, because it penalizes nonidentities much more strongly.

View Image

Figure 3.5.3 Overextension of an alignment of homologous SH2 domains. (A ) BLASTP alignment of VAV_HUMAN with SKAP2_XENTR. The two proteins share a homologous SH2 domain (highlighted in red) over about 58 amino acids that contributes more than 85% of the similarity score. The remaining 140 amino acid alignment juxtaposes an SH3 domain from VAV_HUMAN (brown) with a Pleckstrin domain from SKAP2_XENTR (green). These two domains are not homologous; they are classified as having different folds in SCOP. (B ) Sub‐alignment scores produced by the SSEARCH36 program using the same scoring matrix as BLASTP (BLOSUM62, 11/1) for the VAV_HUMAN/SKAP2_XENTR alignment. Boundaries for annotated domains in the two proteins were taken from InterPro using the query VAV_HUMAN (qRegion) or the subject SKAP2_XENTR (sRegion). Thus, 103‐206 for the Pleckstrin domain comes from InterPro annotations for SKAP2_XENTR, as does 671‐765 for SH3 domain in VAV_HUMAN. The raw score, bit‐score, and percent identity are shown for the subregions. The Q ‐score is −10log( p ‐value) based on the bit score; thus Q = 30 corresponds to a probability (uncorrected for database size) of 0.001.

View Image

Videos

Literature Cited

Literature Cited
	Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555‐565.
	Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. A basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389‐3402.
	Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive matching of the entire protein sequence database. Science 256:1443‐1445.
	Gonzalez, M.W. and Pearson, W.R. 2010. Homologous over‐extension: A challenge for iterative similarity searches. Nucleic Acids Res. 38:2177‐2189.
	Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915‐10919.
	Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 8:275‐282.
	Mueller, T., Spang, R., and Vingron, M. 2002. Estimating amino acid substitution models: A comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19:8‐13.
	Pearson, W.R. 1991. Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith‐Waterman and FASTA algorithms. Genomics 11:635‐650.
	Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444‐2448.
	Reese, J.T. and Pearson, W.R. 2002. Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 18:1500‐1507.
	Schwartz, R.M. and Dayhoff, M. 1978. Matrices for detecting distant relationships. In Atlas of Protein Sequence and Structure, Volume 5, Supplement 3 (M. Dayhoff, ed.), pp. 353‐358. National Biomedical Research Foundation, Silver Spring, Maryland.
	Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197.
	States, D.J., Gish, W., and Altschul, S.F. 1991. Improved sensitivity of nucleic acid database searches using application‐specific scoring matrices. Methods Enzymol. 3:66‐70.