Constructing and Refining Multiple Sequence Alignments with PileUp, SeqLab, and the GCG Suite

互联网2013-12-31

1474

Abstract
Table of Contents
Materials
Figures
Literature Cited

Abstract

This unit discusses how the Accelrys GCG Wisconsin Package SeqLab graphical user interface can be used to align, annotate, analyze, and export into alternative formats, multiple biological sequence data. The emphasis is on discovering and recognizing common elements within the dataset. The GCG programs, or implementations of public domain programs thereof, investigated include: LookUp, PileUp, PlotSimilarity, FASTA, Motifs, MEME/MotifSearch, the Profile Package, the HMMER Package, PAUPSearch, and ToFastA. ReadSeq, a non?GCG, public domain program is also used.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Basic Protocol 1: Multiple Sequence Alignment Using Pileup Within SeqLab
Support Protocol 1: Using Lookup to Assemble a Dataset
Support Protocol 2: Similarity Searching to Increase (or Decrease) Dataset Size
Support Protocol 3: Using Plotsimilarity and SeqLab to Improve and Edit the Multiple Sequence Alignment
Support Protocol 4: Consensus and Masking Issue: GCG's Mask Operation
Support Protocol 5: Convert a Multiple Sequence Alignment to PAUundefined Format for Phylogenetic Analysis
Support Protocol 6: Convert a GCG Multiple Sequence Alignment to PHYLIP Format for Phylogenetic Analysis
Basic Protocol 2: Searching Prosite: GCG'S Motifs—A Quick and Dirty Method
Basic Protocol 3: Searching MEME Within GCG to Identify Motifs
Basic Protocol 4: Profile‐Analysis: Position‐Specific Weighted Score Matrices of Multiple Sequence Alignments
Commentary
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Multiple Sequence Alignment Using Pileup Within SeqLab

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- SeqLab (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are provided in appendix 1D . If the user is unsure of these procedures assistance from local computer support personnel should be sought.
Files
- Protein or DNA sequences of interest in GCG format (e.g., from LookUp in the GCG package and/or FASTA; see Support Protocols protocol 21 and protocol 32 )

Support Protocol 1: Using Lookup to Assemble a Dataset

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- LookUp (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are given in appendix 1D . If the user is unsure of these procedures, assistance from local computer support personnel should be sought.
Files
- None

Support Protocol 2: Similarity Searching to Increase (or Decrease) Dataset Size

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- SeqLab (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are given in appendix 1D . If the user is unsure of these procedures ask for assistance from local computer support personnel.
Files
- Protein or DNA sequences of interest in GCG format (e.g., from LookUp in the GCG package; see protocol 2 )

Support Protocol 3: Using Plotsimilarity and SeqLab to Improve and Edit the Multiple Sequence Alignment

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG Software
Software
- PlotSimilarity and SeqLab (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are provided in appendix 1D . If the user is unsure of these procedures ask for assistance from local computer support personnel.
Files
- Multiple sequence alignment in GCG format (see protocol 2 )

Support Protocol 4: Consensus and Masking Issue: GCG's Mask Operation

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- SeqLab (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are discussed in appendix 1D . If the user is unsure of these procedures, ask for assistance from local computer support personnel.
Files
- Multiple sequence alignment in GCG format (see protocol 2 )

Support Protocol 5: Convert a Multiple Sequence Alignment to PAUundefined Format for Phylogenetic Analysis

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- SeqLab (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are provided in appendix 1D . If the user is unsure of these procedures, ask for assistance from local computer support personnel.
Files
- Multiple sequence alignment loaded into SeqLab (see protocol 1 )

Support Protocol 6: Convert a GCG Multiple Sequence Alignment to PHYLIP Format for Phylogenetic Analysis

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- SeqLab (GCG Wisconsin Package; see Internet Resources )
- ReadSeq (D.G. Gilbert; see Internet Resources ; appendix 1E )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are provided in appendix 1D . If the user is unsure of these procedures, ask for assistance from local computer support personnel.
Files
- Multiple sequence alignment loaded into SeqLab (see protocol 1 ).

Basic Protocol 2: Searching Prosite: GCG'S Motifs—A Quick and Dirty Method

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- MotifSearch and SeqLab (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are provided in appendix 1D . If the user is unsure of these procedures, ask for assistance from local computer support personnel.
Files
- Protein or DNA sequences of interest in GCG format (e.g., from LookUp in the GCG package; see protocol 2 ; also see Internet Resources )

Basic Protocol 3: Searching MEME Within GCG to Identify Motifs

Necessary Resources

Hardware
- Terminal or personal workstation with access to a Unix server running commercial GCG software
Software
- SeqLab (GCG Wisconsin Package; see Internet Resources )
- X‐server graphics communications software ( appendix 1D )
X‐server emulation software needs to be installed separately on personal‐style Microsoft Windows or Macintosh machines, but genuine X‐Windowing comes standard with most Unix operating systems. Microsoft Windows machines are often set up with either XWin32 or eXceed to provide this function, while Macintoshes are often loaded with either MacX or eXodus software. The details of X are provided in appendix 1D . If the user is unsure of these procedures, ask for assistance from local computer support personnel.
Files
- Multiple sequence alignment loaded into SeqLab (see protocol 1 ).

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 3.6.1 The SeqLab Editor window with a LookUp dataset loaded and ready to analyze.

View Image

Figure 3.6.2 An abridged GCG PileUp output MSF file. The format holds the file name, type, date, and checksum, as well as sequence names, checksums, lengths, and weights, and the aligned sequence data in an interleaved fashion.

View Image

Figure 3.6.3 PileUp's similarity dendrogram. The PileUp program automatically plots a cluster dendrogram of the similarities between the sequences of the dataset. The lengths of the vertical lines are proportional to those similarities. This is not an evolutionary tree and should never be presented as one.

View Image

Figure 3.6.4 The PileUp alignment of elongation factor, loaded into the SeqLab Editor, displayed using Residue Coloring.

View Image

Figure 3.6.5 SeqLab can use “cartoons” to graphically display the feature annotation contained in sequence database entries and produced by programs such as Motifs (see ). SeqLab merges this annotation with existing datasets using the Add to Editor and Overwrite Old with New function. It also allows the user to zoom in or out on a dataset to see its entire length. This figure shows the PileUp aligned dataset visualized with SeqLab's Graphic Features annotation and a 4:1 zoom ratio. Aligned annotation now includes original database Feature Table sites, plus output from the program Motifs, and from the program pair MEME/MotifSearch.

View Image

Figure 3.6.6 The Wisconsin Package SeqLab LookUp window. LookUp is an SRS derivative that allows for the construction of complex, text‐based sequence‐database queries. It produces GCG list file format output.

View Image

Figure 3.6.7 Abridged screen trace of GCG's LookUp output file. Notice the “list file” format that can be read by Wisconsin Package interfaces and programs, such as SeqLab and PileUp.

View Image

Figure 3.6.8 An abridged output list file from GCG's implementation of FASTA. A histogram of score distributions is plotted preceding the list portion of the file where hits are ranked statistically by E ‐value. Normally a pairwise alignment section would follow the list, but that was turned off in this run with the ‐NoAlign option.

View Image

Figure 3.6.9 The SeqLab editor loaded with sorted FASTA output. FASTA can be used as a tool to sort a list into ranked order based on similarity to a particular query. All or any desired portion of this output can then be loaded into the SeqLab editor for further analysis.

View Image

Figure 3.6.10 GCG PlotSimilarity draws a graph of the running similarity along the length of a multiple sequence alignment using a sliding window averaging approach. Peaks are conserved regions, while valleys are dissimilar areas. The ordinate scale comes from the similarity matrix used (by default the BLOSUM62 table).

View Image

Figure 3.6.11 PlotSimilarity can produce a color mask that can be superimposed over an open alignment in the editor. Dark regions now correspond to conserved peaks, whereas valleys are represented by white areas.

View Image

Figure 3.6.12 SeqLab Consensus display of a region near the carboxy termini of the author's EF‐1αexample using the BLOSUM30 matrix, 33% required for majority (plurality), and a cutoff value of 4 for the minimum score that represents a match (threshold).

View Image

Figure 3.6.13 SeqLab Consensus mask display of the carboxy terminal region of the author's EF‐1αexample using a weight mask generated from the BLOSUM30 matrix, a plurality of 15%, and a threshold of 4.

View Image

Figure 3.6.14 The PAUPSearch program can reliably and quickly extract NEXUS format from GCG multiple sequence alignments using the ‐NoRun option. Zero mask weighted columns are excluded from the file.

View Image

Figure 3.6.15 The GCG To ToFastA program reliably converts GCG multiple sequence alignments into Pearson FASTA format ( APPENDIX ). This conversion takes advantage of the mask sequence to exclude columns with zero weights and changes gap periods and tildes to hyphens.

View Image

Figure 3.6.16 A ReadSeq sample screen trace with user responses highlighted in bold.

View Image

Figure 3.6.17 The beginning of the author's sample dataset in PHYLIP format produced by ReadSeq from a FASTA format file ( APPENDIX ). ToFastA stripped zero weight columns and changed gap periods and tildes to hyphens; the PHYLIP file reflects this.

View Image

Figure 3.6.18 PROSITE patterns found by the GCG Motifs program in the example elongation factor dataset. Notice the extensive reference discussion for each PROSITE pattern found.

View Image

Figure 3.6.19 Motifs can create an RSF file with the location of PROSITE patterns annotated by color and shape. The display now shows annotation from the database, from Motifs, and from MEME/MotifSearch, using Features Coloring.

View Image

Figure 3.6.20 The unaligned dataset shown using Graphic Features and a 4:1 zoom ratio. Annotations now include the original database Feature Table entries as well as conserved elements discovered by MEME/MotifSearch.

View Image

Figure 3.6.21 A greatly abridged GCG ProfileSearch output list file. Most of the known elongation factors have been edited from the file, although several distant homologues are left intact.

View Image

Figure 3.6.22 The t‐RNA binding region of a ProfileSegments ‐MSF ‐Global alignment of selected near and distant EF‐1αhomologues aligned against the author's example EF‐1αprofile.

View Image

Figure 3.6.23 EF‐1αprimitive dataset aligned to the Thermus aquaticus EF‐Tu sequence by HmmerAlign. Inferred α‐helices based on the Thermus structure are displayed by Features Coloring (medium grey here). Text annotation lines have also been added to the display where the location of the helices is noted.

View Image

Figure 3.6.24 Screen snapshot of the author's sample alignment showing the same region as Figure , but now including additional HmmerPfam annotation and displayed with Graphic Features. Inferred α‐helices are now seen as transparent red coils (seen here as open box zigzags).

View Image

Figure 3.6.25 A multiple sequence alignment and simple motif consensus of elongation factor Tu/1αfrom several different organisms illustrates the conservation of the first of several GTP‐binding domains, that region around position twenty here, the P‐Loop.

View Image

Figure 3.6.26 A traditional Gribskov‐style profile of the elongation factor 1α/Tu P‐Loop region. The horizontal axis contains all possible amino acid residues and the two gap penalties. The vertical axis lists the consensus positions along the profile. Noted positions from the text are highlighted with gray boxes.

View Image

Figure 3.6.27 SeqLab can be used to align DNA sequences against an already aligned dataset of its translational products. This is sometimes very helpful, especially when phylogenetic inference is the eventual goal.

View Image

Figure 3.6.28 Protein Data Bank 1EFT. The Thermus aquaticus elongation factor Tu structure in its GTP conformation (Kjeldgaard et al., ). Structural visualization by RasMol (Sayle and Milner‐White, ).

View Image

Videos

Literature Cited

Literature Cited
	Altschul, S.F., Gish, W., Miller, W., Myers, E. W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402.
	Bailey, T.L. and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (R. Altman, D. Brutlag, P. Karp, R. Lathrop, and D. Searls, eds.), pp.28‐36. AAAI Press, Menlo Park, Calif.
	Bailey, T.L. and Gribskov, M. 1998. Combining evidence using p‐values: Application to sequence homology searches. Bioinfor. 14:48‐54.
	Bairoch, A. 1992. PROSITE: A dictionary of sites and patterns in proteins. Nucl. Acids Res. 20:2013‐2018.
	Dobzhansky, T., Ayala, F.J., Stebbins, G.L., and Valentine, J.W. 1977. Evolution. W.H. Freeman and Co. San Francisco, Calif. [The source of the original 1973 quote is obscure though it has been cited as being transcribed from the American Biology Teacher, March 1973, 35:125‐129].
	Doolittle, R.F. 1986. Of URFs and ORFs: A primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley, Calif.
	Eddy, S.R. 1996. Hidden Markov models. Cur. Opin. Struc. Bio. 6:361‐365.
	Eddy, S.R. 1998. Profile hidden Markov models. Bioinfo. 14:755‐763.
	Etzold, T. and Argos, P. 1993. SRS — An indexing and retrieval tool for flat file data libraries. Comp. App. Biosci. 9:49‐57.
	Feng, D.F. and Doolittle, R. F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351‐360.
	Gribskov, M., McLachlan, M., Eisenberg, D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Nat. Acad. Sci. U.S.A. 84:4355‐4358.
	Gribskov, M., Luethy, R., and Eisenberg, D. 1989. Profile analysis. Meth. Enzym. 183:146‐159.
	Hasegawa, M., Hashimoto, T., Adachi, J., Iwabe, N., and Miyata, T. 1993. Early branchings in the evolution of Eukaryotes: Ancient divergence of Entamoeba that lacks mitochondria revealed by protein sequence data. J. Mol. Evol. 36:380‐388.
	Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. U.S.A. 89:10915‐10919.
	Iwabe, N., Kuma, E.‐I., Hasegawa, M., Osawa, S., and Miyata, T. 1989. Evolutionary relationship of Archaebacteria, Eubacteria, and Eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Nat. Acad. Sci. U.S.A. 86:9355‐9359.
	Kjeldgaard, M., Nissen, P., Thirup, S., and Nyborg, J. 1993. The crystal structure of elongation factor EF‐Tu from Thermus aquaticus in the GTP conformation. Structure 1:35‐50.
	Madsen, H.O., Poulsen, K., Dahl, O., Clark, B.F., and Hjorth, J.P. 1990. Retropseudogenes constitute the major part of the human elongation factor 1 alpha gene family. Nuc. Acids Res. 18:1513‐1516.
	Pearson, W.B. 1998. Empirical statistical estimates for sequence similarity searches. J. Mol. Bio. 276:71‐84.
	Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence analysis. Proc. Nat. Acad. Sci. U.S.A. 85:2444‐2448.
	Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. 1994. The status of online Mendelian inheritance in man (OMIM). Nuc. Acids Res. 22:3470‐3473.
	Rivera, M.C. and Lake, J.A. 1992. Evidence that Eukaryotes and Eocyte Prokaryotes are immediate relatives. Sci. 257:74‐76.
	Saraste, M., Sibbald, P.R., and Wittinghofer, A. 1990. he P‐loop—a common motif in ATP‐ and GTP‐binding proteins. T.I.B.S. 15:430‐434.
	Sayle, R. and Milner‐White, E.J. 1995. RasMol: Biomolecular graphics for all. T.I.B.S. 20:374‐376.
	Schwartz, R.M. and Dayhoff, M.O. 1979. Matrices for detecting distant relationships. In Atlas of Protein Sequences and Structure, Vol.5 (M.O. Dayhoff, ed.) pp.353‐358. National Biomedical Research Foundation, Washington, D.C.
	Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. 1994. The genetic data environment, an expandable GUI for multiple sequence analysis. Comp. App. Biosci. 10:671‐675.
	Sogin, M.L., Morrison, H.G., Hinkle, G., and Silberman, J.D. 1996. Ancestral relationships of the major eukaryotic lineages. Microbiolgia Sem. 12:17‐28.
	Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. ClustalW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions‐specific gap penalties and weight matrix choice. Nuc. Acids Res. 22:4673‐4680.
	Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., and Higgins, D.G. 1997. The ClustalX windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nuc. Acids Res. 24:4876‐4882.
	von Heijne, G. 1987. Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, San Diego.
Internet Resources
	http://evolution.genetics.washington.edu/phylip.html
	The Phylogeny Inference Package (PHYLIP), version 3.5+, is public domain software distributed by the author, J. Felsenstein. It is available online from the Department of Genomes Sciences, University of Washington, Seattle.
	http://www.uni‐giessen.de/∼gx1052/ECDC/ecdc.htm
	The E. coli Database Collection (ECDC). The K12 chromosome. Available online from Justus‐Liebig‐Universitaet, Giessen, Germany.
	http://www.accelrys.com/products/gcg_wisconsin_package/index.html.
	The Wisconsin package, version 10.3, is available from the Genetics Computer Group (GCG), a part of Accelrys, which is in turn a subsidiary of Pharmacopeia, and is copyright protected (1982–2002). The home page includes a copy of the program manual.
	http://iubio.bio.indiana.edu/soft/molbio/readseq
	The Wisconsin Package provides a comprehensive toolkit of almost 150 integrated DNA and protein analysis programs, from database, pattern and motif searching, fragment assembly, mapping, and sequence comparison, to gene finding, protein and evolutionary analysis, primer selection, and DNA and RNA secondary structure prediction. The powerful SeqLab X‐windows based graphical user interface (GUI) is a front end to the package. It provides an intuitive alternative to the Unix command line by allowing menu‐driven access to most of GCG's programs. SeqLab is based on Steve Smith and collaborators' () genetic data environment (GDE) and makes running the Wisconsin Package much easier by providing a common editing interface from which most programs can be launched and alignments manipulated.
	http://www.ncbi.nlm.nih.gov/Entrez
	ReadSeq is public domain software distributed by the author, D.G. Gilbert, and is available from the Bioinformatics Group at the Biology Department of Indiana University, Bloomington.
	http://www.ncbi.nlm.nih.gov/omim
	Entrez is public domain software distributed by the authors and available from the National Center for Biotechnology Information (NCBI) at the National Library of Medicine, National Institutes of Health (NIH), Bethesda, Maryland.
	http://www.sinauer.com
	Online Mendelian Inheritance in Man (OMIM). Available from the Center for Medical Genetics, Johns Hopkins University, Baltimore, Maryland, and the NCBI at the National Library of Medicine, NIH, Bethesda, Maryland. Also see Pearson et al. ().
	http://paup.csit.fsu.edu
	Phylogenetic Analysis Using Parsimony (PAUundefined) was developed by D.L. Swofford (copyright, 1989–2002). The official homepage is located at Florida State University (see below). A 4.0 beta version is available at the time of this writing, and is distributed by Sinauer Associates.