Finding Protein and Nucleotide Similarities with FASTA

互联网2013-12-31

1126

Abstract
Table of Contents
Materials
Figures
Literature Cited

Abstract

The FASTA package provides a comprehensive set of similarity searching programs, similar to those provided by the BLAST package, and some additional programs that are not provided by BLAST for searching with short peptides and oligonucleotides. The FASTA programs work with a wide variety of database formats, including mySQL sequence databases. FASTA provides very accurate statistical significance estimates, and is more sensitive than BLASTN when comparing DNA sequences. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Strategic Planning
Basic Protocol 1: Using the FASTA Programs Interactively
Alternate Protocol 1: Using the Command‐Line Version of FASTA
Support Protocol 1: Downloading and Installing the FASTA Programs
Support Protocol 2: Downloading and Preparing Sequence Databases
Basic Protocol 2: Large‐Scale Sequence Analysis
Guidelines for Understanding Results
Commentary
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Using the FASTA Programs Interactively

Necessary ResourcesHardware

A Windows 32‐bit (Windows 95, 98, NT, 2000, XP), Macintosh (PowerPC), or Unix/Linux computer with at least 5 Mb of free disk space for the programs and 100 to 600 Mb of disk space for protein sequence databases. The FASTA programs require very little memory over that required by the computer's operating system.

Software

The FASTA programs, installed and configured as described in protocol 3

Files

Appropriate sequence databases, downloaded from the NCBI (ftp://ftp.ncbi.nih.gov) or EBI (ftp://ftp.ebi.ac.uk/databases), as described in protocol 4 .
A FASTLIBS file that indicates where the sequence databases are located and their corresponding abbreviations (see protocol 3 and see ).
A query protein sequence in FASTA format ( appendix 1B ); this example uses the SwissProt sequence C972_SOYBN and the PIR sequence CRHU2. These sequences can be downloaded from the NCBI Entrez Internet site (http://www.ncbi.nlm.nih.gov/Entrez/) by searching the protein database for C972_SOYBN or CRHU2. Copy the FASTA‐formatted sequences into a file for the searches.

Alternate Protocol 1: Using the Command‐Line Version of FASTA

Necessary ResourcesHardware

A Windows 32‐bit (Windows 95, 98, NT, 2000, XP), Macintosh (PowerPC), or Unix/Linux computer with at least 5 Mb of free disk space for the programs and 100 to 600 Mb of disk space for protein sequence databases. The FASTA programs require very little memory over that required by the computer's operating system.

Software

The FASTA programs, installed and configured as described in protocol 3

Files

Appropriate sequence databases, downloaded from the NCBI (ftp://ftp.ncbi.nih.gov) or EBI (ftp://ftp.ebi.ac.uk/databases), as described in protocol 4
A FASTLIBS file that indicates where the sequence databases are located and their corresponding abbreviations (see protocol 3 and see ).
A query protein sequence in FASTA format ( appendix 1B ); in this example, the SwissProt sequence C972_SOYBN and the PIR sequence CRHU2 will be used. These sequences can be downloaded from the NCBI Entrez Internet site (http://www.ncbi.nlm.nih.gov/Entrez/) by searching the protein database for C972_SOYBN or CRHU2. Copy the FASTA formatted sequences into a file for the searches.

Support Protocol 1: Downloading and Installing the FASTA Programs

Necessary ResourcesHardware

A Windows 32‐bit (Windows 95, 98, NT, 2000, XP), Macintosh (PowerPC), or Unix/Linux computer with at least 5 Mb of free disk space for the programs and 100 to 600 Mb of disk space for protein sequence databases. The FASTA programs require very little memory over that required by the computer's operating system.

Software

Current versions of the FASTA programs can be downloaded from ftp://ftp.virginia.edu/pub/fasta. Unix/Linux versions of the programs are provided as compressed “shell archives,” e.g., fasta3.shar.Z. Be sure to transfer the fasta3.shar.Z file in binary format. Windows and Macintosh versions of the programs are available in the win32_fasta and mac_fasta directories. The latest versions of the similarity searching programs are in fasta3.zip (for Windows) or fasta3.sea.bin (for Macintosh OS8.5 to 9.x). The programs come with complete source code, but recompiling should not be necessary on Windows and Macintosh machines, except those running Macintosh OSX, which is really a variant of Unix. The FASTA distribution file should be copied to a new directory for installation.

Files

To verify that the program is installed correctly, this protocol uses the mgstm1.aa and prot_test.lseg files included in the FASTA3 distribution file.

Support Protocol 2: Downloading and Preparing Sequence Databases

Necessary ResourcesHardware

Clusters of Linux workstations, or Beowulf clusters, can provide a very cost‐effective computing platform. Systems costing less than $50,000 (16 dual‐processor 2 GHz Intel/Athlon machines with 1 Gb memory) are capable of meeting large‐scale sequence comparison needs of all but the largest genome centers. To run the PVM/MPI parallel versions of the programs in the FASTA package, one will need accounts on several Unix/Linux/MacOSX computers that share access to the same directories.

Software

PVM or MPI parallel environment: There are two widely used environments for network parallel computing, PVM (Parallel Virtual Machine; http://www.epm.ornl.gov/pvm/pvm_home.html), and MPI (Message Passing Interface; implementations are available from www-unix.mcs.anl.gov/mpi/mpich/ and http://www.lam-mpi.org/). Both environments have their proponents; FASTA supports both environments. Installing and testing PVM or MPI is more difficult than installing FASTA, so one should probably use the implementation that is best supported at one's institution.
PVM/MPI parallel versions of FASTA: The PVM/MPI parallel versions of the FASTA programs are included with the standard FASTA distribution, and use identical code for the comparison, statistics, and alignment display functions. However, the program names are different, and the compilation process uses a different Makefile.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 3.9.1 A simple fasta34 search using a soybean cytochrome P450 (SwissProt C972_SOYBN) as a query sequence in a search of the PIR1 (annotated section of the PIR database). (A ) Search summary and statistical output. The name and version of the program, and the name and length of the query sequence, are reported, as well as the name of the database searched. The histogram shows the distribution of similarity scores calculated by the program. The left column of numbers indicates a normalized similarity score; the center column reports the number of sequences obtaining that score, and the right column reports the number of sequences expected to obtain the score, based on the database size. (B ) The list of top scoring sequences, with their raw similarity scores (opt), the normalized bit score, and the expectation value.

View Image

Figure 3.9.2 Testing availability of machines with the pvm command.

View Image
Figure 3.9.3 Homologs of carbonic anhydrase (crhu2).

View Image
Figure 3.9.4 fasta34 output (F102M–m) and statistics (–z) options.

View Image

Figure 3.9.5 Alternate formats for high‐scoring sequences. (A ) The conventional display of high‐scoring database sequence alignment scores. (B ) Alignment information that is added with the ‐m 9 option; additional information includes the percent identity with and without counting gaps (%_id, %_gid), the Smith‐Waterman score (sw), the alignment length (alen), the coordinates of the start and end of the alignments in the query (an0, ax0) and library (an1, ax1) sequences, the start and end of the query (pn0, px0) and library (pn1, px1) sequences, the number of gaps in the query (gapq) and library (gapl) sequences, and the number of frameshifts (fs). (C ) A conventional alignment display (D ) An encoding of the alignment, provided with the ‐m 9c option.

View Image

Figure 3.9.6 Virtual sequence coordinates. (A ) A query sequence that indicates, with the @C:51 token, that the beginning of the sequence should have the virtual coordinate 51. (B ) A –m 9 coordinate output indicating that the alignment begins at residue 51 (virtual coordinate) rather than residue 1. (C ) Virtual coordinate numbering in the alignment display.

View Image

Figure 3.9.7 Example of a fastlibs file. Each line in the fastlibs file specifies: (a) a descriptive title of the library; (b) whether the library is protein (0) or DNA (1); (c) the abbreviation for the library; and (d) the library's filename and library type.

View Image

Videos

Literature Cited

Literature Cited
	Mackey, A.J., Haystead, T.A.J., and Pearson, W.R. 2002. Getting more from less: Algorithms for rapid protein identification with multiple short peptide sequences. Mol. Cell. Proteomics 1:139‐147.
	Mott, R. 1992. Maximum likelihood estimation of the statistical distribution of smith‐waterman local sequence similarity scores. Bull. Math. Biol. 54:59‐75.
	Pearson, W.R., Wood, T.C., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:24‐36.
	Reese, J.T. and Pearson, W.R. 2002. Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 18:1500‐1507.
	Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197.
	Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149‐163.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Finding Protein and Nucleotide Similarities with FASTA

Abstract

Table of Contents

Materials

Basic Protocol 1: Using the FASTA Programs Interactively

Alternate Protocol 1: Using the Command‐Line Version of FASTA

Support Protocol 1: Downloading and Installing the FASTA Programs

Support Protocol 2: Downloading and Preparing Sequence Databases

Figures

Videos

Literature Cited