Using PhyloCon to Identify Conserved Regulatory Motifs
互联网
- Abstract
- Table of Contents
- Figures
- Literature Cited
Abstract
Understanding gene regulation has been and remains one of the major challenges for the molecular biology community. Gene regulation is mediated by a variety of short DNA sequences called regulatory elements, which include transcription factor binding sites. A first step toward understanding gene regulation is the identification of regulatory elements present in the genome. This challenge has been defined as the ?motif finding problem? in the field of computational biology. Over the past 20 years, many algorithms have been developed to tackle the motif finding problem computationally. The PhyloCon algorithm, developed in 2003, is one of the first motif finding algorithms that take advantage of two important data resources, i.e., phylogenetic conservation and gene co?regulation, to improve the efficiency of motif identification in biological datasets. This unit presents basic protocols to obtain, install, and apply the PhyloCon program and discusses the underlying algorithm and how to interpret the results. Curr. Protoc. Bioinform. 19:2.12.1?2.12.29. © 2007 by John Wiley & Sons, Inc.
Keywords: motif discovery; comparative genomics; algorithm
Table of Contents
- Introduction
- Basic Protocol 1: Running the PhyloCon Program
- Basic Protocol 2: Post‐Processing PhyloCon Results with Auxiliary Scripts
- Support Protocol 1: Obtaining and Installing the PhyloCon Software
- Support Protocol 2: Understanding PhyloCon's File Format
- Guidelines for Understanding Results
- Commentary
- Literature Cited
- Figures
- Tables
Materials
Figures
-
Figure 2.12.1 The contents of the LEU3.pcons file. View Image -
Figure 2.12.2 Running the first PhyloCon operation. This figure shows steps 2 to 5 in , including displaying command‐line options and running of one example. View Image -
Figure 2.12.3 Parsing the PhyloCon output and displaying consensus pattern. This figure shows the operation of using auxiliary scripts parseConsSite.pl and parseConsSite2.pl to display PhyloCon predicted motif as consensus pattern. View Image -
Figure 2.12.4 Extracting motif record from PhyloCon output. This figure shows the operation of using getConsMatrix.pl to extract a full motif record or the position specific count matrix of a motif predicted by PhyloCon. View Image -
Figure 2.12.5 Masking a motif from a sequence file. This figure shows the operation of using motifMasker.pl to replace matching sites of a given motif with N in a given sequence file. View Image -
Figure 2.12.6 The contents of the LEU3_2.pcons file. View Image -
Figure 2.12.7 The contents of two sample alphabet files. View Image -
Figure 2.12.8 Structure of PhyloCon sequence file format. Each sequence contains a description line and actual sequence. The description line contains optional modifier, sequence name, and any additional description. The actual sequence follows the description line, and begins and ends with a “\” character. Sequences of the same orthologous group are placed consecutively. The end of each orthologous group is marked by “\\”. View Image -
Figure 2.12.9 Output from applying PhyloCon to sample file LEU3.pcons. (A ) Command‐line and algorithmic parameters. (B ) Sequence information and sequence file statistics. (C ) Runtime parameters and statistics. (D ) Top four predicted motifs from cycle number 1. (E ) Top four predicted motifs from cycle number 2. (F ) Top two motifs from all cycles. View Image -
Figure 2.12.10 Reapplying PhyloCon to modified sequence data. This figure shows the operation of iteratively running PhyloCon to discover additional motif signals. View Image -
Figure 2.12.11 Outline of the PhyloCon algorithm. (A ) A diagram of how PhyloCon organizes and processes data. Sequences are grouped based on orthology. Many initial profiles are generated for conserved regions. Comparison of profiles from different orthologous groups reveals common motifs. (B ) Alignments of orthologous sequences of four yeast species show high conservation in the 5′UTR of three genes. Asterisks indicate positions where at least 3 out of 4 letters are identical. Conservation extends beyond the true motifs (LEU3), making it difficult to identify the motif by simply examining the phylogenetic relationship. (C ) The motif emerges after comparing profiles from different orthologous groups. (D ) Sequence Logo of predicted motif which bears significant resemblance of the Leu3 binding site. View Image
Videos
Literature Cited
Literature Cited | |
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410. | |
Berg, O.G., and von Hippel, P.H., 1987. Selection of DNA binding sites by regulatory proteins. Statistical‐mechanical theory and application to operators and promoters. J. Mol. Biol. 193:723‐750. | |
Cliften, P.F., Hillier, L.W., Fulton, L., Graves, T., Miner, T., Gish, W.R., Waterston, R.H., and Johnston, M. 2001. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res. 11:1175‐1186. | |
Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B.A., and Johnston, M. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301:71‐76. | |
Hertz, G.Z. and Stormo, G.D. 1999. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563‐577. | |
Hertz, G.Z., Hartzell, G.W.,3rd, and Stormo, G.D., 1990. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci. 6:81‐92. | |
Liu, J., Tan, K., and Stormo, G.D. 2003. Computational identification of the Spo0A‐phosphate regulon that is essential for the cellular differentiation and development in Gram‐positive spore‐forming bacteria. Nucl. Acids Res. 31:6891‐6903. | |
MacIsaac, K.D., Wang, T., Gordon, D.B., Gifford, D.K., Stormo, G.D., and Fraenkel, E. 2006. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7:113 | |
Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel‐Margoulis, O.V., Kloos, D.U., Land, S., Lewicki‐Potapov, B., Michael, H., Münch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., and Wingender, E. 2003. TRANSFAC: Transcriptional regulation, from patterns to profiles. Nucl. Acids Res. 31:374‐378. | |
Pietrokovski, S. 1996. Searching databases of conserved sequence regions by aligning protein multiple‐alignments. Nucl. Acids Res. 24:3836‐3845. | |
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197. | |
Stormo, G.D. 2000. Identification of coordinated gene expression and regulatory sequences. Pac. Symp. Biocomput. 12:416‐417. | |
Stormo, G.D. and Fields, D.S. 1998. Specificity, free energy and information content in protein‐DNA interactions. Trends Biochem. Sci. 23:109‐113. | |
Stormo, G.D., and Hartzell, G.W., 3rd. 1989. Identifying protein‐binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. U.S.A. 86:1183‐1187. | |
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choice. Nucl. Acids Res. 22:4673‐4680. | |
Wang, T. and Stormo, G.D. 2003. Combining phylogenetic data with co‐regulated genes to identify regulatory motifs. Bioinformatics 19:2369‐2380. | |
Wang, T. and Stormo, G.D. 2005. Identifying the conserved network of cis‐regulatory sites of a eukaryotic genome. Proc. Natl. Acad. Sci. U.S.A. 102:17400‐17405. |