Constructing and Refining Multiple Sequence Alignments with PileUp, SeqLab, and the GCG Suite
互联网
- Abstract
- Table of Contents
- Materials
- Figures
- Literature Cited
Abstract
This unit discusses how the Accelrys GCG Wisconsin Package SeqLab graphical user interface can be used to align, annotate, analyze, and export into alternative formats, multiple biological sequence data. The emphasis is on discovering and recognizing common elements within the dataset. The GCG programs, or implementations of public domain programs thereof, investigated include: LookUp, PileUp, PlotSimilarity, FASTA, Motifs, MEME/MotifSearch, the Profile Package, the HMMER Package, PAUPSearch, and ToFastA. ReadSeq, a non?GCG, public domain program is also used.
Table of Contents
- Basic Protocol 1: Multiple Sequence Alignment Using Pileup Within SeqLab
- Support Protocol 1: Using Lookup to Assemble a Dataset
- Support Protocol 2: Similarity Searching to Increase (or Decrease) Dataset Size
- Support Protocol 3: Using Plotsimilarity and SeqLab to Improve and Edit the Multiple Sequence Alignment
- Support Protocol 4: Consensus and Masking Issue: GCG's Mask Operation
- Support Protocol 5: Convert a Multiple Sequence Alignment to PAUundefined Format for Phylogenetic Analysis
- Support Protocol 6: Convert a GCG Multiple Sequence Alignment to PHYLIP Format for Phylogenetic Analysis
- Basic Protocol 2: Searching Prosite: GCG'S Motifs—A Quick and Dirty Method
- Basic Protocol 3: Searching MEME Within GCG to Identify Motifs
- Basic Protocol 4: Profile‐Analysis: Position‐Specific Weighted Score Matrices of Multiple Sequence Alignments
- Commentary
- Figures
- Tables
Materials
Basic Protocol 1: Multiple Sequence Alignment Using Pileup Within SeqLab
Necessary Resources
Support Protocol 1: Using Lookup to Assemble a Dataset
Necessary Resources
Support Protocol 2: Similarity Searching to Increase (or Decrease) Dataset Size
Necessary Resources
Support Protocol 3: Using Plotsimilarity and SeqLab to Improve and Edit the Multiple Sequence Alignment
Necessary Resources
Support Protocol 4: Consensus and Masking Issue: GCG's Mask Operation
Necessary Resources
Support Protocol 5: Convert a Multiple Sequence Alignment to PAUundefined Format for Phylogenetic Analysis
Necessary Resources
Support Protocol 6: Convert a GCG Multiple Sequence Alignment to PHYLIP Format for Phylogenetic Analysis
Necessary Resources
Basic Protocol 2: Searching Prosite: GCG'S Motifs—A Quick and Dirty Method
Necessary Resources
Basic Protocol 3: Searching MEME Within GCG to Identify Motifs
Necessary Resources
|
Figures
-
Figure 3.6.1 The SeqLab Editor window with a LookUp dataset loaded and ready to analyze. View Image -
Figure 3.6.2 An abridged GCG PileUp output MSF file. The format holds the file name, type, date, and checksum, as well as sequence names, checksums, lengths, and weights, and the aligned sequence data in an interleaved fashion. View Image -
Figure 3.6.3 PileUp's similarity dendrogram. The PileUp program automatically plots a cluster dendrogram of the similarities between the sequences of the dataset. The lengths of the vertical lines are proportional to those similarities. This is not an evolutionary tree and should never be presented as one. View Image -
Figure 3.6.4 The PileUp alignment of elongation factor, loaded into the SeqLab Editor, displayed using Residue Coloring. View Image -
Figure 3.6.5 SeqLab can use “cartoons” to graphically display the feature annotation contained in sequence database entries and produced by programs such as Motifs (see ). SeqLab merges this annotation with existing datasets using the Add to Editor and Overwrite Old with New function. It also allows the user to zoom in or out on a dataset to see its entire length. This figure shows the PileUp aligned dataset visualized with SeqLab's Graphic Features annotation and a 4:1 zoom ratio. Aligned annotation now includes original database Feature Table sites, plus output from the program Motifs, and from the program pair MEME/MotifSearch. View Image -
Figure 3.6.6 The Wisconsin Package SeqLab LookUp window. LookUp is an SRS derivative that allows for the construction of complex, text‐based sequence‐database queries. It produces GCG list file format output. View Image -
Figure 3.6.7 Abridged screen trace of GCG's LookUp output file. Notice the “list file” format that can be read by Wisconsin Package interfaces and programs, such as SeqLab and PileUp. View Image -
Figure 3.6.8 An abridged output list file from GCG's implementation of FASTA. A histogram of score distributions is plotted preceding the list portion of the file where hits are ranked statistically by E ‐value. Normally a pairwise alignment section would follow the list, but that was turned off in this run with the ‐NoAlign option. View Image -
Figure 3.6.9 The SeqLab editor loaded with sorted FASTA output. FASTA can be used as a tool to sort a list into ranked order based on similarity to a particular query. All or any desired portion of this output can then be loaded into the SeqLab editor for further analysis. View Image -
Figure 3.6.10 GCG PlotSimilarity draws a graph of the running similarity along the length of a multiple sequence alignment using a sliding window averaging approach. Peaks are conserved regions, while valleys are dissimilar areas. The ordinate scale comes from the similarity matrix used (by default the BLOSUM62 table). View Image -
Figure 3.6.11 PlotSimilarity can produce a color mask that can be superimposed over an open alignment in the editor. Dark regions now correspond to conserved peaks, whereas valleys are represented by white areas. View Image -
Figure 3.6.12 SeqLab Consensus display of a region near the carboxy termini of the author's EF‐1αexample using the BLOSUM30 matrix, 33% required for majority (plurality), and a cutoff value of 4 for the minimum score that represents a match (threshold). View Image -
Figure 3.6.13 SeqLab Consensus mask display of the carboxy terminal region of the author's EF‐1αexample using a weight mask generated from the BLOSUM30 matrix, a plurality of 15%, and a threshold of 4. View Image -
Figure 3.6.14 The PAUPSearch program can reliably and quickly extract NEXUS format from GCG multiple sequence alignments using the ‐NoRun option. Zero mask weighted columns are excluded from the file. View Image -
Figure 3.6.15 The GCG To ToFastA program reliably converts GCG multiple sequence alignments into Pearson FASTA format ( APPENDIX ). This conversion takes advantage of the mask sequence to exclude columns with zero weights and changes gap periods and tildes to hyphens. View Image -
Figure 3.6.16 A ReadSeq sample screen trace with user responses highlighted in bold. View Image -
Figure 3.6.17 The beginning of the author's sample dataset in PHYLIP format produced by ReadSeq from a FASTA format file ( APPENDIX ). ToFastA stripped zero weight columns and changed gap periods and tildes to hyphens; the PHYLIP file reflects this. View Image -
Figure 3.6.18 PROSITE patterns found by the GCG Motifs program in the example elongation factor dataset. Notice the extensive reference discussion for each PROSITE pattern found. View Image -
Figure 3.6.19 Motifs can create an RSF file with the location of PROSITE patterns annotated by color and shape. The display now shows annotation from the database, from Motifs, and from MEME/MotifSearch, using Features Coloring. View Image -
Figure 3.6.20 The unaligned dataset shown using Graphic Features and a 4:1 zoom ratio. Annotations now include the original database Feature Table entries as well as conserved elements discovered by MEME/MotifSearch. View Image -
Figure 3.6.21 A greatly abridged GCG ProfileSearch output list file. Most of the known elongation factors have been edited from the file, although several distant homologues are left intact. View Image -
Figure 3.6.22 The t‐RNA binding region of a ProfileSegments ‐MSF ‐Global alignment of selected near and distant EF‐1αhomologues aligned against the author's example EF‐1αprofile. View Image -
Figure 3.6.23 EF‐1αprimitive dataset aligned to the Thermus aquaticus EF‐Tu sequence by HmmerAlign. Inferred α‐helices based on the Thermus structure are displayed by Features Coloring (medium grey here). Text annotation lines have also been added to the display where the location of the helices is noted. View Image -
Figure 3.6.24 Screen snapshot of the author's sample alignment showing the same region as Figure , but now including additional HmmerPfam annotation and displayed with Graphic Features. Inferred α‐helices are now seen as transparent red coils (seen here as open box zigzags). View Image -
Figure 3.6.25 A multiple sequence alignment and simple motif consensus of elongation factor Tu/1αfrom several different organisms illustrates the conservation of the first of several GTP‐binding domains, that region around position twenty here, the P‐Loop. View Image -
Figure 3.6.26 A traditional Gribskov‐style profile of the elongation factor 1α/Tu P‐Loop region. The horizontal axis contains all possible amino acid residues and the two gap penalties. The vertical axis lists the consensus positions along the profile. Noted positions from the text are highlighted with gray boxes. View Image -
Figure 3.6.27 SeqLab can be used to align DNA sequences against an already aligned dataset of its translational products. This is sometimes very helpful, especially when phylogenetic inference is the eventual goal. View Image -
Figure 3.6.28 Protein Data Bank 1EFT. The Thermus aquaticus elongation factor Tu structure in its GTP conformation (Kjeldgaard et al., ). Structural visualization by RasMol (Sayle and Milner‐White, ). View Image
Videos
Literature Cited
Literature Cited | |
Altschul, S.F., Gish, W., Miller, W., Myers, E. W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410. | |
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402. | |
Bailey, T.L. and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (R. Altman, D. Brutlag, P. Karp, R. Lathrop, and D. Searls, eds.), pp.28‐36. AAAI Press, Menlo Park, Calif. | |
Bailey, T.L. and Gribskov, M. 1998. Combining evidence using p‐values: Application to sequence homology searches. Bioinfor. 14:48‐54. | |
Bairoch, A. 1992. PROSITE: A dictionary of sites and patterns in proteins. Nucl. Acids Res. 20:2013‐2018. | |
Dobzhansky, T., Ayala, F.J., Stebbins, G.L., and Valentine, J.W. 1977. Evolution. W.H. Freeman and Co. San Francisco, Calif. [The source of the original 1973 quote is obscure though it has been cited as being transcribed from the American Biology Teacher, March 1973, 35:125‐129]. | |
Doolittle, R.F. 1986. Of URFs and ORFs: A primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley, Calif. | |
Eddy, S.R. 1996. Hidden Markov models. Cur. Opin. Struc. Bio. 6:361‐365. | |
Eddy, S.R. 1998. Profile hidden Markov models. Bioinfo. 14:755‐763. | |
Etzold, T. and Argos, P. 1993. SRS — An indexing and retrieval tool for flat file data libraries. Comp. App. Biosci. 9:49‐57. | |
Feng, D.F. and Doolittle, R. F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351‐360. | |
Gribskov, M., McLachlan, M., Eisenberg, D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Nat. Acad. Sci. U.S.A. 84:4355‐4358. | |
Gribskov, M., Luethy, R., and Eisenberg, D. 1989. Profile analysis. Meth. Enzym. 183:146‐159. | |
Hasegawa, M., Hashimoto, T., Adachi, J., Iwabe, N., and Miyata, T. 1993. Early branchings in the evolution of Eukaryotes: Ancient divergence of Entamoeba that lacks mitochondria revealed by protein sequence data. J. Mol. Evol. 36:380‐388. | |
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. U.S.A. 89:10915‐10919. | |
Iwabe, N., Kuma, E.‐I., Hasegawa, M., Osawa, S., and Miyata, T. 1989. Evolutionary relationship of Archaebacteria, Eubacteria, and Eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Nat. Acad. Sci. U.S.A. 86:9355‐9359. | |
Kjeldgaard, M., Nissen, P., Thirup, S., and Nyborg, J. 1993. The crystal structure of elongation factor EF‐Tu from Thermus aquaticus in the GTP conformation. Structure 1:35‐50. | |
Madsen, H.O., Poulsen, K., Dahl, O., Clark, B.F., and Hjorth, J.P. 1990. Retropseudogenes constitute the major part of the human elongation factor 1 alpha gene family. Nuc. Acids Res. 18:1513‐1516. | |
Pearson, W.B. 1998. Empirical statistical estimates for sequence similarity searches. J. Mol. Bio. 276:71‐84. | |
Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence analysis. Proc. Nat. Acad. Sci. U.S.A. 85:2444‐2448. | |
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. 1994. The status of online Mendelian inheritance in man (OMIM). Nuc. Acids Res. 22:3470‐3473. | |
Rivera, M.C. and Lake, J.A. 1992. Evidence that Eukaryotes and Eocyte Prokaryotes are immediate relatives. Sci. 257:74‐76. | |
Saraste, M., Sibbald, P.R., and Wittinghofer, A. 1990. he P‐loop—a common motif in ATP‐ and GTP‐binding proteins. T.I.B.S. 15:430‐434. | |
Sayle, R. and Milner‐White, E.J. 1995. RasMol: Biomolecular graphics for all. T.I.B.S. 20:374‐376. | |
Schwartz, R.M. and Dayhoff, M.O. 1979. Matrices for detecting distant relationships. In Atlas of Protein Sequences and Structure, Vol.5 (M.O. Dayhoff, ed.) pp.353‐358. National Biomedical Research Foundation, Washington, D.C. | |
Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. 1994. The genetic data environment, an expandable GUI for multiple sequence analysis. Comp. App. Biosci. 10:671‐675. | |
Sogin, M.L., Morrison, H.G., Hinkle, G., and Silberman, J.D. 1996. Ancestral relationships of the major eukaryotic lineages. Microbiolgia Sem. 12:17‐28. | |
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. ClustalW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions‐specific gap penalties and weight matrix choice. Nuc. Acids Res. 22:4673‐4680. | |
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., and Higgins, D.G. 1997. The ClustalX windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nuc. Acids Res. 24:4876‐4882. | |
von Heijne, G. 1987. Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, San Diego. | |
Internet Resources | |
http://evolution.genetics.washington.edu/phylip.html | |
The Phylogeny Inference Package (PHYLIP), version 3.5+, is public domain software distributed by the author, J. Felsenstein. It is available online from the Department of Genomes Sciences, University of Washington, Seattle. | |
http://www.uni‐giessen.de/∼gx1052/ECDC/ecdc.htm | |
The E. coli Database Collection (ECDC). The K12 chromosome. Available online from Justus‐Liebig‐Universitaet, Giessen, Germany. | |
http://www.accelrys.com/products/gcg_wisconsin_package/index.html. | |
The Wisconsin package, version 10.3, is available from the Genetics Computer Group (GCG), a part of Accelrys, which is in turn a subsidiary of Pharmacopeia, and is copyright protected (1982–2002). The home page includes a copy of the program manual. | |
http://iubio.bio.indiana.edu/soft/molbio/readseq | |
The Wisconsin Package provides a comprehensive toolkit of almost 150 integrated DNA and protein analysis programs, from database, pattern and motif searching, fragment assembly, mapping, and sequence comparison, to gene finding, protein and evolutionary analysis, primer selection, and DNA and RNA secondary structure prediction. The powerful SeqLab X‐windows based graphical user interface (GUI) is a front end to the package. It provides an intuitive alternative to the Unix command line by allowing menu‐driven access to most of GCG's programs. SeqLab is based on Steve Smith and collaborators' () genetic data environment (GDE) and makes running the Wisconsin Package much easier by providing a common editing interface from which most programs can be launched and alignments manipulated. | |
http://www.ncbi.nlm.nih.gov/Entrez | |
ReadSeq is public domain software distributed by the author, D.G. Gilbert, and is available from the Bioinformatics Group at the Biology Department of Indiana University, Bloomington. | |
http://www.ncbi.nlm.nih.gov/omim | |
Entrez is public domain software distributed by the authors and available from the National Center for Biotechnology Information (NCBI) at the National Library of Medicine, National Institutes of Health (NIH), Bethesda, Maryland. | |
http://www.sinauer.com | |
Online Mendelian Inheritance in Man (OMIM). Available from the Center for Medical Genetics, Johns Hopkins University, Baltimore, Maryland, and the NCBI at the National Library of Medicine, NIH, Bethesda, Maryland. Also see Pearson et al. (). | |
http://paup.csit.fsu.edu | |
Phylogenetic Analysis Using Parsimony (PAUundefined) was developed by D.L. Swofford (copyright, 1989–2002). The official homepage is located at Florida State University (see below). A 4.0 beta version is available at the time of this writing, and is distributed by Sinauer Associates. |