Finding Protein and Nucleotide Similarities with FASTA
互联网
- Abstract
- Table of Contents
- Materials
- Figures
- Literature Cited
Abstract
The FASTA package provides a comprehensive set of similarity searching programs, similar to those provided by the BLAST package, and some additional programs that are not provided by BLAST for searching with short peptides and oligonucleotides. The FASTA programs work with a wide variety of database formats, including mySQL sequence databases. FASTA provides very accurate statistical significance estimates, and is more sensitive than BLASTN when comparing DNA sequences. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.
Table of Contents
- Strategic Planning
- Basic Protocol 1: Using the FASTA Programs Interactively
- Alternate Protocol 1: Using the Command‐Line Version of FASTA
- Support Protocol 1: Downloading and Installing the FASTA Programs
- Support Protocol 2: Downloading and Preparing Sequence Databases
- Basic Protocol 2: Large‐Scale Sequence Analysis
- Guidelines for Understanding Results
- Commentary
- Literature Cited
- Figures
- Tables
Materials
Basic Protocol 1: Using the FASTA Programs Interactively
Necessary ResourcesHardware
Alternate Protocol 1: Using the Command‐Line Version of FASTA
Necessary ResourcesHardware
Support Protocol 1: Downloading and Installing the FASTA Programs
Necessary ResourcesHardware
Support Protocol 2: Downloading and Preparing Sequence Databases
Necessary ResourcesHardware
|
Figures
-
Figure 3.9.1 A simple fasta34 search using a soybean cytochrome P450 (SwissProt C972_SOYBN) as a query sequence in a search of the PIR1 (annotated section of the PIR database). (A ) Search summary and statistical output. The name and version of the program, and the name and length of the query sequence, are reported, as well as the name of the database searched. The histogram shows the distribution of similarity scores calculated by the program. The left column of numbers indicates a normalized similarity score; the center column reports the number of sequences obtaining that score, and the right column reports the number of sequences expected to obtain the score, based on the database size. (B ) The list of top scoring sequences, with their raw similarity scores (opt), the normalized bit score, and the expectation value. View Image -
Figure 3.9.2 Testing availability of machines with the pvm command. View Image -
Figure 3.9.3 Homologs of carbonic anhydrase (crhu2). View Image -
Figure 3.9.4 fasta34 output (F102M–m) and statistics (–z) options. View Image -
Figure 3.9.5 Alternate formats for high‐scoring sequences. (A ) The conventional display of high‐scoring database sequence alignment scores. (B ) Alignment information that is added with the ‐m 9 option; additional information includes the percent identity with and without counting gaps (%_id, %_gid), the Smith‐Waterman score (sw), the alignment length (alen), the coordinates of the start and end of the alignments in the query (an0, ax0) and library (an1, ax1) sequences, the start and end of the query (pn0, px0) and library (pn1, px1) sequences, the number of gaps in the query (gapq) and library (gapl) sequences, and the number of frameshifts (fs). (C ) A conventional alignment display (D ) An encoding of the alignment, provided with the ‐m 9c option. View Image -
Figure 3.9.6 Virtual sequence coordinates. (A ) A query sequence that indicates, with the @C:51 token, that the beginning of the sequence should have the virtual coordinate 51. (B ) A –m 9 coordinate output indicating that the alignment begins at residue 51 (virtual coordinate) rather than residue 1. (C ) Virtual coordinate numbering in the alignment display. View Image -
Figure 3.9.7 Example of a fastlibs file. Each line in the fastlibs file specifies: (a) a descriptive title of the library; (b) whether the library is protein (0) or DNA (1); (c) the abbreviation for the library; and (d) the library's filename and library type. View Image
Videos
Literature Cited
Literature Cited | |
Mackey, A.J., Haystead, T.A.J., and Pearson, W.R. 2002. Getting more from less: Algorithms for rapid protein identification with multiple short peptide sequences. Mol. Cell. Proteomics 1:139‐147. | |
Mott, R. 1992. Maximum likelihood estimation of the statistical distribution of smith‐waterman local sequence similarity scores. Bull. Math. Biol. 54:59‐75. | |
Pearson, W.R., Wood, T.C., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:24‐36. | |
Reese, J.T. and Pearson, W.R. 2002. Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 18:1500‐1507. | |
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197. | |
Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149‐163. |