丁香实验_LOGO
登录
提问
我要登录
|免费注册
点赞
收藏
wx-share
分享

Using the Gibbs Motif Sampler to Find Conserved Domains in DNA and Protein Sequences

互联网

1252
  • Abstract
  • Table of Contents
  • Figures
  • Literature Cited

Abstract

 

The Gibbs Motif Sampler (Gibbs) is a software package for discovering conserved elements in biopolymer sequences. This unit describes the basic operation of the Web?based interface to Gibbs, along with advanced examples of its use, and the Web interface to dscan, a sequence database search program.

Keywords: Gibbs sampling; Transcription factor binding site; Sequence Alignment; Motif; DNA; Protein; Phylogentic Footprinting; Stochastic Algorithm; Markov Chain Monte?Carlo; Bayesian statistics

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Basic Protocol 1: Running the Gibbs Motif Sampler
  • Basic Protocol 2: Searching for Other Sequences Containing Similar Motifs Using dscan
  • Guidelines for Understanding Results
  • Commentary
  • Appendix A
  • Appendix B
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  •   Figure 2.8.1 The contents of the file crp.dat.
    View Image
  •   Figure 2.8.2 The Gibbs home page.
    View Image
  •   Figure 2.8.3 The basic Gibbs entry form. The fields marked with an asterisk * are the minimum required entries to run Gibbs in Site Sampler mode.
    View Image
  •   Figure 2.8.4 dscan motif models. (A ) dscan model consisting of a collection of sites produced by Gibbs. (B ) The same data displayed as a count matrix. (C ) The same data displayed as a probability matrix.
    View Image
  •   Figure 2.8.5 The dscan entry form.
    View Image
  •   Figure 2.8.6 The output produced by running Gibbs on the crp.dat data file with a conserved motif width of 16, estimated number of sites of 22 and allowing 0, 1 or 2 sites per sequence. (A ) Gibbs program options. (B ) Gibbs output, showing a listing of the input FASTA sequence headers. (C ) Maximum MAP output.
    View Image
  •   Figure 2.8.7 dscan output from scanning the database of E. coli intergenic sequences.
    View Image
  •   Figure 2.8.8 Gibbs advanced options page for DNA data with default options selected.
    View Image
  •   Figure 2.8.9 Gibbs advanced options screen for protein data.
    View Image
  •   Figure 2.8.10 Restriction site for the enzyme Eco RI illustrating its palindromic nature. The GAA at the 5′ end, at positions 1 through 3, is complementary to the TTC at the 3′ end at positions 6 through 4.
    View Image
  •   Figure 2.8.11 Background composition. The figure shows the distribution of the probabilities of each nucleotide at each position, as generated by the Bayesian segmentation algorithm (Liu and Lawrence, ) for a 131‐bp region upstream of the Haemophilus influenzae purA gene.
    View Image
  •   Figure 2.8.12 Output from a Gibbs run with the Wilcoxon signed‐rank test option enabled. (A ) The 18 E. coli CRP regulated promoter sequences have been supplemented with 18 shuffled sequences. (B ) Maximum MAP alignment and the results of the Wilcoxon signed‐rank test. The p value of 0.000671 indicates that the alignment is highly significant despite the inclusion of three shuffled sequences in the alignment.
    View Image
  •   Figure 2.8.13 Sample spacing distribution. (A ) The probability distribution of the distances of sites from the start codon for the default spacing model for prokaryotic DNA sequences. This is the model used when the option Prokaryotic Defaults is selected. (B ) Values for the spacing distribution shown in Figure A.
    View Image
  •   Figure 2.8.14 Sample prior information file. Prior pseudocounts are shown for the CRP TFBS model. The model is 16 rows by 4 columns; the order of the columns is A, T, C, G. In this example, each row sums to 10, although this is not a requirement. Rows may have different sums. By default, each table entry is multiplied by 0.1, resulting in 1.0 total pseudocounts for each position. Prior probabilities for 0, 1, or 2 sites per sequence are included, with a weight of 0.1.
    View Image
  •   Figure 2.8.15 Gibbs more advanced options screen for DNA. This screen includes options for controlling program performance.
    View Image
  •   Figure 2.8.16 Alignments from phylogenetic footprinting. (A ) Alignment from the phylogenetic footprinting of the E. coli purL gene and six orthologous genes from related species. (B ) Alignment from the phylogenetic footprinting of the E. coli glnA gene and six orthologous genes from related species.
    View Image
  •   Figure 2.8.17 (A ) Alignment from the analysis of seven intergenic sequences that contain the ten M. tuberculosis promoters. (B ) Sequence logo (Schneider and Stephens, ) of the alignment of seven intergenic sequences that contain the ten M. tuberculosis promoters. A sequence logo is a graphical representation of a multiple sequence alignment. The overall height of the letters at a position indicates the sequence conservation at that position. The height of the individual letters at a position indicates the relative frequency of the nucleotide at that position.
    View Image
  •   Figure 2.8.18 (A ) Alignments for motifs a and b for the M. tuberculosis hypoxia microarray data. (B,C ) Sequence logos (Schneider and Stephens, ) of the alignments of motifs a and b respectively for the M. tuberculosis hypoxia microarray data.
    View Image

Videos

Literature Cited

   Altschul, S.F. and Lipman, D.J. 1990. Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. U.S.A. 87:5509‐5513.
   Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. 1994. Issues in searching molecular sequence databases. Nat. Genet. 6:119.
   Bailey, T.L. and Elkan, C. 1995. Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning 21:51‐80.
   Claverie, J.M. and States, D.J. 1993. Information enhancement methods for large scale sequence analysis. Comput. Chem. 17:191‐201.
   Florczyk, M.A., McCue, L.A., Purkayastha, A., Currenti, E., Wolin, M.J., and McDonough, K.A. 2003. A family of acr‐coregulated mycobacterium tuberculosis genes shares a common DNA motif and requires Rv3133c (dosR or devR) for expression. Infect. Immun. 71:5332‐5343.
   Lawrence, C.E. and Reilly, A.A. 1990. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct. Funct. Genet. 7:41‐51.
   Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., and Wootton, J. 1993. Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science 262:208‐214.
   Liu, J.S. and Lawrence, C.E. 1999. Bayesian inference on biopolymer models. Bioinformatics 15:38‐52.
   Liu, J., Neuwald, A., and Lawrence, C. 1995. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc. 432:1156‐1170.
   Liu, X., Brutlag, D.L., and Liu, J.S. 2001. BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co‐expressed genes. In Proceedings of the Pacific Symposium on Biocomputing, pp. 127‐138. World Scientific Press, Hawaii.
   McCue, L., Thompson, W., Carmack, C., Ryan, M.P., Liu, J.S., Derbyshire, V., and Lawrence, C.E. 2001. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucl. Acids Res. 29:774‐782.
   McCue, L.A., Thompson, W., Carmack, C.S., and Lawrence, C.E. 2002. Factors influencing the identification of transcription factor binding sites by cross‐species comparison. Genome Res. 12:1523‐1532.
   Neuwald, A., Liu, J., and Lawrence, C. 1995. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci. 4:1618‐1632.
   Schneider, T.D. and Stephens, R.M. 1990. Sequence logos: A new way to display consensus sequences. Nucl. Acids Res. 18:6097‐6100.
   Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Human‐mouse alignments with BLASTZ. Genome Res. 13:103‐107.
   Sherman, D.R., Voskuil, M., Schnappinger, D., Liao, R., Harrell, M.I., and Schoolnik, G.K. 2001. Regulation of the Mycobacterium tuberculosis hypoxic response gene encoding alpha‐crystallin. Proc. Natl. Acad. Sci. U.S.A. 98:7534‐7539.
   Thompson, W., Rouchka, E.C., and Lawrence, C.E. 2003. Gibbs Recursive Sampler: Finding transcription factor binding sites. Nucl. Acids. Res. 31:3580‐3585.
   Thompson, W., Palumbo, M.J., Wasserman, W.W., Liu, J.S., and Lawrence, C.E. 2004. Decoding human regulatory circuits. Genome Res. 14:1967‐1974.
   Wanner, B.L. 1996. Phosphorus assimilation and control of the phosphate regulon. In Escherichia coli and Salmonella: Cellular and Molecular Biology (F.C. Neihdhardt, ed.), pp. 1357‐1381. ASM Press, Washington, D.C.
   Webb, B.J., Liu, J.S., and Lawrence, C.E. 2002. BALSA: Bayesian algorithm for local sequence alignment. Nucl. Acids Res. 30:1268‐1277.
Internet Resources
   http://bayesweb.wadsworth.org/gibbs/gibbs.html
   Web sites for running the Gibbs sample
   http://www.bioinfo.rpi.edu/applications/bayesian/gibbs/gibbs.html
   The above sites provide information about obtaining Gibbs.
   http://bayesweb.wadsworth.org/GIBBS‐SAMPLER‐ACADEMIC.htm
   Auxiliary data for running the examples
   http://bayesweb.wadsworth.org/GIBBS‐SAMPLER‐COMMERCIAL.htm
   IUPAC amino acid codes
   http://bayesweb.wadsworth.org/gibbs/module
   Annotated examples using Gibbs to analyze bacterial data
   http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
   http://bayesweb.wadsworth.org/web_help.PF.html
   http://bayesweb.wadsworth.org/web_help_text.CE.htm
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library
 
提问
扫一扫
丁香实验小程序二维码
实验小助手
丁香实验公众号二维码
关注公众号
反馈
TOP
打开小程序