丁香实验_LOGO
登录
提问
我要登录
|免费注册
点赞
收藏
wx-share
分享

Finding Homologs to Nucleic Acid or Protein Sequences Using the Framesearch Program

互联网

1001
  • Abstract
  • Table of Contents
  • Materials
  • Figures
  • Literature Cited

Abstract

 

The Framesearch algorithm includes the possibility of a frameshift error in its alignment algorithm, and therefore can find alignments that span different reading frames. Protocols in this unit describe the use of Framesearch to search a protein sequence database for sequences that are similar to a query nucleotide sequence, and to search a nucleotide sequence database for sequences that are similar to a query protein sequence. Three alternate protocols describe ways to improve the speed of Framesearch and thus make it practical for routine use. Framesearch is especially appropriate for low?quality single?read nucleotide sequence data, such as ESTs (expressed sequence tags) or early drafts of genomic sequences; it does not offer any significant advantage over less CPU?intensive algorithms for relatively high?quality nucleotide sequences without many single?nucleotide insertion or deletion errors.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Basic Protocol 1: Framesearch Using a Nucleic Acid Query Sequence
  • Basic Protocol 2: Framesearch Using a Protein Query Sequence
  • Alternate Protocol 1: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Nucleic Acid Query Sequence
  • Alternate Protocol 2: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Protein Query Sequence
  • Alternate Protocol 3: Improving Speed of Framesearch by Using Specialized Hardware
  • Support Protocol 1: Downloading and Converting Sequence Files for the Examples Used in the Protocols
  • Guidelines for Understanding Results
  • Commentary
  • Figures
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Framesearch Using a Nucleic Acid Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • Files
  • DNA sequence file of interest (this will be the query sequence; maximum length, 350 kb)
  • Protein database of sequences to which the DNA sequence will be compared
For example, BA000007.faa contains the amino acid translations of all putative genes found in this bacterial genome by the lab where it was sequenced, as a single FASTA format text file ( appendix 1B ).Both the query sequence and the database files must be converted to the GCG format ( protocol 6 ).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6 .

Basic Protocol 2: Framesearch Using a Protein Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • Files
  • Protein sequence file of interest (this will be the query sequence)
  • Nucleic acid database of sequences to which the protein sequence will be compared
For example, BA000007.fna contains the nucleotide sequence of all putative genes found in this bacterial genome by the laboratory where it was sequenced, as a single FASTA format text file ( appendix 1B ).Both the query sequence and the database files must be converted to the GCG format ( protocol 6 ).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6 .

Alternate Protocol 1: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Nucleic Acid Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • BLAST program (unit 3.4 )In the GCG environment assumed for these examples, both BLAST and Framesearch are included.
  • Files
  • DNA sequence file of interest (this will be the query sequence; maximum length, 350 kb)
  • Protein database of sequences to which the DNA sequence will be compared
For example, contains the amino acid translations of all putative genes found in this bacterial genome by the lab where it was sequenced, as a single FASTA format text file ( appendix 1B ).Both the query sequence and the database files must be converted to the GCG format ( protocol 6 ).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6 .

Alternate Protocol 2: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Protein Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • BLAST program (unit 3.4 )In the GCG environment assumed for these examples, both BLAST and Framesearch are included.
  • Files
  • Protein sequence file of interest (this will be the query sequence)
  • Nucleic acid database of sequences to which the protein sequence will be compared
For example, BA000007.fna contains the nucleotide sequence of all putative genes found in this bacterial genome by the laboratory where it was sequenced, as a single FASTA format text file ( appendix 1B ).Both the query sequence and the database files must be converted to the GCG format ( protocol 6 ).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6 .

Alternate Protocol 3: Improving Speed of Framesearch by Using Specialized Hardware

  Necessary Resources
  • Hardware
  • Any Unix or VMS system that has the Wisconsin Package installed
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher; includes FROMFASTA)
  • Files
  • The files used in this example can be downloaded from the NCBI FTP server as described below, or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm)
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  •   Figure Figure 3.2.1 Six‐frame‐translated search versus Framesearch.
    View Image
  •   Figure Figure 3.2.2 Distribution of scores generated by using Framesearch to compare nucleotides 52500 through 55000 of gi‐15829254_55.seq with all peptide sequences from the example bacterial genome. Since the selected region comprises all of one gene and parts of two flanking genes, there are three very strong hits, highlighted by arrows above. There are also many lower‐quality hits with scores below 400. Most likely, hits with scores above 200 represent genes related to the three genes contained in this region, while hits with scores between 100 and 200 may represent borderline matches, but scores below 100 probably do not represent biologically significant matches.
    View Image
  •   Figure Figure 3.2.3 The list of hits from a Framesearch run in which a nucleic acid sequence was used to search a number of peptide sequences. The name of the query sequence, the wildcard expression specifying the target sequences, and the name of the peptide sequence with the best match have been boldfaced in the sample output.
    View Image
  •   Figure Figure 3.2.4 Alignment of a nucleotide query sequence against a peptide database sequence, generated by Framesearch. Note that the middle portion has been omitted here. The names of the query and database sequences, just above this alignment, have been boldfaced for emphasis.
    View Image
  •   Figure Figure 3.2.5 The list of hits from a Framesearch run in which an amino acid sequence was used to search a number of nucleotide sequences. The name of the query sequence, the wildcard expression specifying the target sequences, and the name of the nucleotide sequence with the best match have been boldfaced in the sample output.
    View Image
  •   Figure Figure 3.2.6 Alignment of an amino acid query sequence against a nucleotide database sequence, generated by Framesearch. Note that the middle portion has been omitted here. The names of the query and database sequences, just above this alignment, have been boldfaced for emphasis. Also note that following the name of the nucleotide sequence in this example is the string “/rev”, which means this alignment is to the reverse complement of this nucleotide sequence.
    View Image
  •   Figure Figure 3.2.7 Illustration of how insertion and deletion errors affect alignments generated by the six‐frame‐translated Smith‐Waterman algorithm. Note that this example was generated on a DeCypher genomics accelerator, manufactured by TimeLogic. SSEARCH in the GCG environment would give very similar results, in a slightly different format. The nucleotides selected are the reverse complement of those nucleotides from the E. coli O157:H7 genome, NCBI REFSEQ number NC_002695, which correspond to amino acids 1 to 84 of the protein with NCBI gi number 13361126.
    View Image
  •   Figure Figure 3.2.8 A Framesearch alignment between a nucleotide query sequence and a peptide target sequence, in the format generated by a TimeLogic DeCypher genomics accelerator system. Framesearch in the GCG environment would generate the same output, in a slightly different format. The nucleotides selected for this example are the reverse complement of those nucleotides from the E. coli O157:H7 genome, NCBI REFSEQ number NC_002695, which correspond to amino acids 1 to 84 of the protein with NCBI gi number 13361126. Compare this figure with Figure , which shows how Framesearch dynamically follows the correct reading frame despite the frameshift errors created when indel errors are deliberately introduced into the nucleotide sequence.
    View Image
  •   Figure Figure 3.2.9 This is a continuation of Figure , and should be compared with it.
    View Image

Videos

Literature Cited

Literature Cited
   Accelerys. 2001. Announcement of new features in SeqWeb version 2 http://www.accelerys.com/products/seqweb/whats_new2p0.html.
   NOTE: The text of this poster can be found at http://sulu.gcg.com/company/posters/framesearch.html.
   Edelman, I., Faigler, S., Mintz, E., Natan, A., and Devereux, J. 1995. Framesearch: A rigorous alignment program for searching protein databases with nucleic acid queries. Poster, Genome Sequence and analysis Conference, Hilton Head, South Carolina, 1995.
   NOTE: The GCG Transcript, subtitled “Bio‐Computing News for Users of the Wisconsin Package,” was published by the company for a number of years. The text of this issue, which features a discussion of the newly‐added Framesearch program, can be found at http://sulu.gcg.com/pub/newsletter/vol3_no2_nov95.html.
   GCG. 1995. GCG Transcript 3:2. Genetics Computing Group, Madison, Wisconsin.
   Halperin, E., Faigler, S., and Gill‐More, R. 1999. FramePlus: Aligning DNA to protein sequences. Bioinformatics 15(11):867‐873.
   TimeLogic. 2001. Manuals supplied with a DeCypher bioinformatics accelerator. TimeLogic Corporation, Incline Village, Nevada.
   Zhang, Z., Pearson, W.R., and Miller, W. 1997. Aligning a DNA sequence with a protein sequence. Journal of Computational Biology 4(3):339‐349.
Key References
   Edelman et al., 1995. See above.
   The key reference for the Framesearch algorithm is the poster by Edelman. The key reference for a particular implementation of Framesearch is the documentation supplied with that implementation.
Internet Resources
   http://www.accelerys.com/
   Web site of Accelerys, the corporate parent of GCG.
   http://www.cgen.com/
   Web site of the Compugen company.
   http://www.paracel.com/
   Web site of the Paracel company.
   http://www.timelogic.com
   Web site of the TimeLogic company.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library
 
提问
扫一扫
丁香实验小程序二维码
实验小助手
丁香实验公众号二维码
关注公众号
反馈
TOP
打开小程序