Finding Similar Nucleotide Sequences Using Network BLAST Searches
互联网
- Abstract
- Table of Contents
- Figures
- Literature Cited
Abstract
The Basic Local Alignment Search Tool (BLAST) is a keystone of bioinformatics due to its performance and user?friendliness. Beginner and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNA, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn . Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low?complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez, PUBMED, structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge. Curr. Protoc. Bioinform. 26:3.3.1?3.3.26. © 2009 by John Wiley & Sons, Inc.
Keywords: BLAST; sequence alignment; database search; homology search; mapping; nucleic acid; DNA; RNA; genome; blastn; Megablast
Table of Contents
- Introduction
- Basic Protocol 1: Using the Web‐Interface BLAST from the NCBI BLAST Server for Nucleotide Sequences
- Basic Protocol 2: The Default Blastn Result Output
- Support Protocol 1: Setting Optional Parameters
- Support Protocol 2: Formatting Results of a BLAST Search
- Alternate Protocol 1: Megablast Search for Ribosomal RNA
- Alternate Protocol 2: Finding Transcribed Gene Copies and Splice Variants Using Megablast
- Guidelines for Understanding Results
- Commentary
- Literature Cited
- Figures
- Tables
Materials
Figures
-
Figure 3.3.1 The home page of the NCBI BLAST server (http://blast.ncbi.nlm.nih.gov). View Image -
Figure 3.3.2 The basic search screen for nucleic acid BLAST at NCBI. The title line and the sequence of the human let‐7c microRNA in FASTA‐format were pasted into the search field. View Image -
Figure 3.3.3 The results of a blastn search using . (A ) Administrative section and the color‐coded graphical display of the best hits to the query sequence. (B ) One‐line descriptions of the database sequences similar to the query with maximal and total scores, total coverage, E‐value, maximal percent identity, and links to other databases. View Image -
Figure 3.3.4 Pairwise local alignment of the query and the mouse BAC clone RP24‐270A10 from chromosome 13. Note that the query matches this clone at two distant locations. View Image -
Figure 3.3.5 Algorithm parameters. View Image -
Figure 3.3.6 Reformatting BLAST results. View Image -
Figure 3.3.7 The hit table view for automatic parsing. View Image -
Figure 3.3.8 The query, database and limit selection page for Megablast for the human 5.8S ribosomal RNA (NR_003285). View Image -
Figure 3.3.9 The Algorithm parameter selection page for the Megablast project of Figure . View Image -
Figure 3.3.10 Alignment number 1000: human 5.8S ribosomal RNA (NR_003285) versus the 18S ribosomal RNA of Pisidium nitidum , a bivalvic mollusc. View Image -
Figure 3.3.11 The query, database and limit selection page for Megablast for the human Duchenne muscular dystrophy gene (NM_000109). View Image -
Figure 3.3.12 The Algorithm parameter selection page for the Megablast for the human Duchenne muscular dystrophy gene (NM_000109). View Image -
Figure 3.3.13 Splice variants of the human Duchenne muscular dystrophy gene (NM_000109). Splice variants are indicated by interrupted lines representing sequences. View Image -
Figure 3.3.14 The Genomic View of the localizations of the sequences similar to the human Duchenne muscular dystrophy gene (NM_000109). View Image
Videos
Literature Cited
Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555‐565. | |
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410. | |
Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. 1994. Issues in searching molecular sequence databases. Nat. Genet. 6:119‐129. | |
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402. | |
Baker, M.E., Yan, L., and Pear, M.R. 2000. Three‐dimensional model of human TIP30, a coactivator for HIV‐1 Tat‐activated transcription, and CC3, a protein associated with metastasis suppression. Cell Mol. Life Sci. 57:851‐858. | |
Barrett, C., Hughey, R., and Karplus, K. 1997. Scoring hidden Markov models. Comput. Appl. Biosci. 13:191‐199. | |
Baxevanis, A.D. 2005. Assessing pairwise sequence similarity: BLAST and FASTA. In Bioinformatics. A Practical Guide to the Analysis of Genes and Proteins (A.D. Baxevanis and B.F. Ouellette, eds.), pp. 295‐324. John Wiley & Sons, Hoboken, N.J. | |
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucl. Acids Res. 28:235‐242. | |
Birney, E. and Durbin, R. 1997. Dynamite: A flexible code generating language for dynamic programming methods used in sequence comparison. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5:56‐64. | |
Bolten, E., Schliep, A., Schneckener, S., Schomburg, D., and Schrader, R. 2001. Clustering protein sequences–structure prediction by transitive homology. Bioinformatics 17:935‐941. | |
Dayhoff, M.O. and Eck, R.V. 1968. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure (M.O. Dayhoff, ed.), pp. 33‐45. National Biomedical Research Foundation, Silver Spring, Md. | |
Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14:755‐763. | |
Elbashir, S.M., Harborth, J., Weber, K., and Tuschl, T. 2002. Analysis of gene function in somatic mammalian cells using small interfering RNAs. Methods 26:199‐213. | |
Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L., and Bateman, A. 2008. The Pfam protein families database. Nucl. Acids Res. 36:D281‐D288. | |
Gerstein, M. 1998. Measurement of the effectiveness of transitive sequence comparison, through a third ‘intermediate’ sequence. Bioinformatics 14:707‐714. | |
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915‐10919. | |
Holm, L. and Sander, C. 1998. Removing near‐neighbor redundancy from large protein sequence collections. Bioinformatics 14:423‐429. | |
Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., and Madden, T.L. 2008. NCBI BLAST: A better web interface. Nucl. Acids Res. 36:W5‐W9. | |
Jurka, J., Kapitonov, V.V., Kohany, O., and Jurka, M.V. 2007. Repetitive sequences in complex genomes: Structure and evolution. Annu. Rev. Genomics Hum. Genet. 8:241‐259. | |
Karlin, S. and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87:2264‐2268. | |
Karlin, S. and Bucher, P. 1992. Correlation analysis of amino acid usage in protein classes. Proc. Natl. Acad. Sci. U.S.A. 89:12165‐12169. | |
Karolchik, D., Kuhn, R.M., Baertsch, R., Barber, G.P., Clawson, H., Diekhans, M., Giardine, B., Harte, R.A., Hinrichs, A.S., Hsu, F., Kober, K.M., Miller, W., Pedersen, J.S., Pohl, A., Raney, B.J., Rhead, B., Rosenbloom, K.R., Smith, K.E., Stanke, M., Thakkapallayil, A., Trumbower, H., Wang, T., Zweig, A.S., Haussler, D., and Kent, W.J. 2008. The UCSC Genome Browser Database: 2008 update. Nucl. Acids Res. 36:D773‐D779. | |
Kent, W.J. 2002. BLAT–the BLAST‐like alignment tool. Genome Res. 12:656‐664. | |
Korf, I., Yandell, M., and Bedell, J. 2003. BLAST. An Essential Guide to the Basic Local Alignment Tool. O'Reilly, Sebastopol, Calif. | |
Letunic, I., Copley, R.R., Pils, B., Pinkert, S., Schultz, J., and Bork, P. 2006. SMART 5: Domains in the context of genomes and networks. Nucl. Acids Res. 34:D257‐D260. | |
Liang, Y.D. 2006. Introduction to JAVA programming: Comprehensive Version, 3rd Ed. Pearson Prentice Hall, Lebanon, Ind. | |
Møller, A. and Schwartzbach, M.I. 2006. An introduction to XML and Web technologies. Addison‐Wesley, New York. | |
Morgulis, A., Gertz, E.M., Schaffer, A.A., and Agarwala, R. 2006. A fast and symmetric DUST implementation to mask low‐complexity DNA sequences. J. Comput. Biol. 13:1028‐1040. | |
Morgulis, A., Coulouris, G., Raytselis, Y., Madden, T.L., Agarwala, R., and Schaffer, A.A. 2008. Database Indexing for Production MegaBLAST Searches. Bioinformatics 24:1757‐1756. | |
Ning, Z., Cox, A.J., and Mullikin, J.C. 2001. SSAHA: A fast search method for large DNA databases. Genome Res. 11:1725‐1729. | |
Schultz, J., Milpetz, F., Bork, P., and Ponting, C.P. 1998. SMART, a simple modular architecture research tool: Identification of signaling domains. Proc. Natl. Acad. Sci. U.S.A. 95:5857‐5864. | |
Stajich, J.E. 2007. An introduction to BioPerl. Methods Mol. Biol. 406:535‐548. | |
Stein, L. 1998. Official Guide to Programming with CGI.pm. The Standard for Building Web Scripts. John Wiley and Sons, New York. | |
Tisdall, J.D. 2001 Beginning PERL for Bioinformatics. An Introduction to PERL for Biologists. O'Reilly, Sebastopol, Calif. | |
Ullman, L. 2006. MySQL: Visual Quickstart Guide. Peachpit Press, Berkeley, Calif. | |
Wang, Y., Addess, K.J., Chen, J., Geer, L.Y., He, J., He, S., Lu, S., Madej, T., Marchler‐Bauer, A., Thiessen, P.A., Zhang, N., and Bryant, S.H. 2007. MMDB: Annotating protein sequences with Entrez's 3D‐structure database. Nucl. Acids Res. 35:D298‐D300. | |
Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S., Feolo, M., Geer, L.Y., Helmberg, W., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D.J., Madden, T.L., Maglott, D.R., Miller, V., Ostell, J., Pruitt, K.D., Schuler, G.D., Shumway, M., Sequeira, E., Sherry, S.T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R.L., Tatusova, T.A., Wagner, L., and Yaschenko, E. 2008. Database resources of the National Center for Biotechnology Information. Nucl. Acids Res. 36:D13‐D21. | |
Wootton, J.C. and Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266:554‐571. | |
Wu, T.D. and Watanabe, C.K. 2005. GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859‐1875. | |
Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203‐214. | |
Zweig, A.S., Karolchik, D., Kuhn, R.M., Haussler, D., and Kent, W.J. 2008. UCSC genome browser tutorial. Genomics 92:75‐84. | |
Key References | |
Altschul et al., 1994. See above. | |
An excellent review on the application of pairwise BLAST tools for the identification of possible coding regions, for the elucidation of gene structure and protein function. This review discusses significance sequence filtering, database issues, alignment statistics, gap costs, scoring systems, and others. | |
Altschul et al., 1997. See above. | |
This is the original research paper on gapped alignment blast and position specific iterative BLAST. A series of algorithmic and performance improvements, gap penalty, and statistical considerations, as well as biological examples with marginal similarities are covered. | |
Baxevanis, A.D. and Ouellette, B.F. 2005. Bioinformatics. A Practical Guide to the Analysis of Genes and Proteins. John Wiley & Sons, Hoboken, N.J. | |
A widely taught, clearly written textbook that introduces pairwise sequence similarity searches, biological databases, and many other areas of bioinformatics. Reviews the general concepts of alignments, scoring matrices, and BLAST with practical applications and guidelines for interpretation. | |
Korf et al., 2003. See above. | |
An excellent overview of theory and practice of the BLAST tools as of 2003. This most comprehensive and easy‐to‐understand textbook is highly recommended to everyone in bioinformatics or computational biology. | |
Internet Resources | |
http://blast.ncbi.nlm.nih.gov | |
The NCBI BLAST Web site. | |
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.chapter.EntrezHelp | |
The Entrez Documentation at NCBI. | |
http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore&itool=toolbar | |
The Entrez site for nucleic acid searches at NCBI. | |
http://www.bioperl.org | |
The BioPerl site. | |
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs | |
The full documentation for BLAST at NCBI. | |
http://www.ebi.ac.uk/blast2/nucleotide.html | |
The European Bioinformatics Institute Server for the Washington University BLAST. | |
http://repeatmasker.genome.washington.edu | |
The RepeatMasker Website. | |
http://www.girinst.org/Censor_Server.html | |
The Genetic Research Institute Website. |