Finding Homologs to Nucleic Acid or Protein Sequences Using the Framesearch Program

互联网2013-12-31

1001

Abstract
Table of Contents
Materials
Figures
Literature Cited

Abstract

The Framesearch algorithm includes the possibility of a frameshift error in its alignment algorithm, and therefore can find alignments that span different reading frames. Protocols in this unit describe the use of Framesearch to search a protein sequence database for sequences that are similar to a query nucleotide sequence, and to search a nucleotide sequence database for sequences that are similar to a query protein sequence. Three alternate protocols describe ways to improve the speed of Framesearch and thus make it practical for routine use. Framesearch is especially appropriate for low?quality single?read nucleotide sequence data, such as ESTs (expressed sequence tags) or early drafts of genomic sequences; it does not offer any significant advantage over less CPU?intensive algorithms for relatively high?quality nucleotide sequences without many single?nucleotide insertion or deletion errors.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Basic Protocol 1: Framesearch Using a Nucleic Acid Query Sequence
Basic Protocol 2: Framesearch Using a Protein Query Sequence
Alternate Protocol 1: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Nucleic Acid Query Sequence
Alternate Protocol 2: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Protein Query Sequence
Alternate Protocol 3: Improving Speed of Framesearch by Using Specialized Hardware
Support Protocol 1: Downloading and Converting Sequence Files for the Examples Used in the Protocols
Guidelines for Understanding Results
Commentary
Figures

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Framesearch Using a Nucleic Acid Query Sequence

Necessary Resources

Hardware
Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user

Software
GCG Wisconsin Package (v. 8.1 or higher)

Files
DNA sequence file of interest (this will be the query sequence; maximum length, 350 kb)
Protein database of sequences to which the DNA sequence will be compared

For example, BA000007.faa contains the amino acid translations of all putative genes found in this bacterial genome by the lab where it was sequenced, as a single FASTA format text file ( appendix 1B ).Both the query sequence and the database files must be converted to the GCG format ( protocol 6 ).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6 .

Basic Protocol 2: Framesearch Using a Protein Query Sequence

Necessary Resources

Hardware
Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user

Software
GCG Wisconsin Package (v. 8.1 or higher)

Files
Protein sequence file of interest (this will be the query sequence)
Nucleic acid database of sequences to which the protein sequence will be compared

For example, BA000007.fna contains the nucleotide sequence of all putative genes found in this bacterial genome by the laboratory where it was sequenced, as a single FASTA format text file ( appendix 1B ).Both the query sequence and the database files must be converted to the GCG format ( protocol 6 ).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6 .

Alternate Protocol 1: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Nucleic Acid Query Sequence

Necessary Resources

Hardware
Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user

Software
GCG Wisconsin Package (v. 8.1 or higher)
BLAST program (unit 3.4 )In the GCG environment assumed for these examples, both BLAST and Framesearch are included.

Files
DNA sequence file of interest (this will be the query sequence; maximum length, 350 kb)
Protein database of sequences to which the DNA sequence will be compared

For example, contains the amino acid translations of all putative genes found in this bacterial genome by the lab where it was sequenced, as a single FASTA format text file ( appendix 1B ).Both the query sequence and the database files must be converted to the GCG format ( protocol 6 ).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6 .

Alternate Protocol 2: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Protein Query Sequence

Necessary Resources

Hardware
Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user

Software
GCG Wisconsin Package (v. 8.1 or higher)
BLAST program (unit 3.4 )In the GCG environment assumed for these examples, both BLAST and Framesearch are included.

Files
Protein sequence file of interest (this will be the query sequence)
Nucleic acid database of sequences to which the protein sequence will be compared

Alternate Protocol 3: Improving Speed of Framesearch by Using Specialized Hardware

Necessary Resources

Hardware
Any Unix or VMS system that has the Wisconsin Package installed

Software
GCG Wisconsin Package (v. 8.1 or higher; includes FROMFASTA)

Files
The files used in this example can be downloaded from the NCBI FTP server as described below, or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm)

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 3.2.1 Six‐frame‐translated search versus Framesearch.

View Image

Figure 3.2.2 Distribution of scores generated by using Framesearch to compare nucleotides 52500 through 55000 of gi‐15829254_55.seq with all peptide sequences from the example bacterial genome. Since the selected region comprises all of one gene and parts of two flanking genes, there are three very strong hits, highlighted by arrows above. There are also many lower‐quality hits with scores below 400. Most likely, hits with scores above 200 represent genes related to the three genes contained in this region, while hits with scores between 100 and 200 may represent borderline matches, but scores below 100 probably do not represent biologically significant matches.

View Image

Figure 3.2.3 The list of hits from a Framesearch run in which a nucleic acid sequence was used to search a number of peptide sequences. The name of the query sequence, the wildcard expression specifying the target sequences, and the name of the peptide sequence with the best match have been boldfaced in the sample output.

View Image

Figure 3.2.4 Alignment of a nucleotide query sequence against a peptide database sequence, generated by Framesearch. Note that the middle portion has been omitted here. The names of the query and database sequences, just above this alignment, have been boldfaced for emphasis.

View Image

Figure 3.2.5 The list of hits from a Framesearch run in which an amino acid sequence was used to search a number of nucleotide sequences. The name of the query sequence, the wildcard expression specifying the target sequences, and the name of the nucleotide sequence with the best match have been boldfaced in the sample output.

View Image

Figure 3.2.6 Alignment of an amino acid query sequence against a nucleotide database sequence, generated by Framesearch. Note that the middle portion has been omitted here. The names of the query and database sequences, just above this alignment, have been boldfaced for emphasis. Also note that following the name of the nucleotide sequence in this example is the string “/rev”, which means this alignment is to the reverse complement of this nucleotide sequence.

View Image

Figure 3.2.7 Illustration of how insertion and deletion errors affect alignments generated by the six‐frame‐translated Smith‐Waterman algorithm. Note that this example was generated on a DeCypher genomics accelerator, manufactured by TimeLogic. SSEARCH in the GCG environment would give very similar results, in a slightly different format. The nucleotides selected are the reverse complement of those nucleotides from the E. coli O157:H7 genome, NCBI REFSEQ number NC_002695, which correspond to amino acids 1 to 84 of the protein with NCBI gi number 13361126.

View Image

Figure 3.2.8 A Framesearch alignment between a nucleotide query sequence and a peptide target sequence, in the format generated by a TimeLogic DeCypher genomics accelerator system. Framesearch in the GCG environment would generate the same output, in a slightly different format. The nucleotides selected for this example are the reverse complement of those nucleotides from the E. coli O157:H7 genome, NCBI REFSEQ number NC_002695, which correspond to amino acids 1 to 84 of the protein with NCBI gi number 13361126. Compare this figure with Figure , which shows how Framesearch dynamically follows the correct reading frame despite the frameshift errors created when indel errors are deliberately introduced into the nucleotide sequence.

View Image

Figure 3.2.9 This is a continuation of Figure , and should be compared with it.

View Image

Videos

Literature Cited

Literature Cited
	Accelerys. 2001. Announcement of new features in SeqWeb version 2 http://www.accelerys.com/products/seqweb/whats_new2p0.html.
	NOTE: The text of this poster can be found at http://sulu.gcg.com/company/posters/framesearch.html.
	Edelman, I., Faigler, S., Mintz, E., Natan, A., and Devereux, J. 1995. Framesearch: A rigorous alignment program for searching protein databases with nucleic acid queries. Poster, Genome Sequence and analysis Conference, Hilton Head, South Carolina, 1995.
	NOTE: The GCG Transcript, subtitled “Bio‐Computing News for Users of the Wisconsin Package,” was published by the company for a number of years. The text of this issue, which features a discussion of the newly‐added Framesearch program, can be found at http://sulu.gcg.com/pub/newsletter/vol3_no2_nov95.html.
	GCG. 1995. GCG Transcript 3:2. Genetics Computing Group, Madison, Wisconsin.
	Halperin, E., Faigler, S., and Gill‐More, R. 1999. FramePlus: Aligning DNA to protein sequences. Bioinformatics 15(11):867‐873.
	TimeLogic. 2001. Manuals supplied with a DeCypher bioinformatics accelerator. TimeLogic Corporation, Incline Village, Nevada.
	Zhang, Z., Pearson, W.R., and Miller, W. 1997. Aligning a DNA sequence with a protein sequence. Journal of Computational Biology 4(3):339‐349.
Key References
	Edelman et al., 1995. See above.
	The key reference for the Framesearch algorithm is the poster by Edelman. The key reference for a particular implementation of Framesearch is the documentation supplied with that implementation.
Internet Resources
	http://www.accelerys.com/
	Web site of Accelerys, the corporate parent of GCG.
	http://www.cgen.com/
	Web site of the Compugen company.
	http://www.paracel.com/
	Web site of the Paracel company.
	http://www.timelogic.com
	Web site of the TimeLogic company.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Finding Homologs to Nucleic Acid or Protein Sequences Using the Framesearch Program

Abstract

Table of Contents

Materials

Basic Protocol 1: Framesearch Using a Nucleic Acid Query Sequence

Basic Protocol 2: Framesearch Using a Protein Query Sequence

Alternate Protocol 1: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Nucleic Acid Query Sequence

Alternate Protocol 2: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Protein Query Sequence

Alternate Protocol 3: Improving Speed of Framesearch by Using Specialized Hardware

Figures

Videos

Literature Cited