Using the Ensembl Genome Server to Browse Genomic Sequence Data

互联网2013-12-31

1220

Abstract
Table of Contents
Figures
Literature Cited

Abstract

The Ensembl project provides a comprehensive source of automatic annotation of the human genome sequence, as well as other species of biomedical interest, with confirmed gene predictions that have been integrated with external data sources. This unit describes how to use the Ensembl genome browser (http://www.ensembl.org/), the public interface of the project. It describes how to find a gene or protein of interest, how to get additional information and external links, and how to use the comparative genomic data. Curr. Protoc. Bioinform. 30:1.15.1?1.15.48. © 2010 by John Wiley & Sons, Inc.

Keywords: computer graphics; databases; genetic; genetic variation; genomics; sequence homology; genome; genome sequence

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
Basic Protocol 1: Search by Text/Keyword/Gene Name
Basic Protocol 2: Examining a Gene
Basic Protocol 3: Examining a Genomic Location
Support Protocol 1: Comparative Genomics: Gene Trees, Orthologues, and Paralogues
Support Protocol 2: Comparative Genomics: Pairwise Whole Genome Alignments
Support Protocol 3: Comparative Genomics: Multiple Whole Genome Alignments
Guidelines for Understanding Results
Commentary
Literature Cited
Figures

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 1.15.1 The Ensembl home page provides a gateway to genome information.

View Image

Figure 1.15.2 Ensembl search results page for a search with tgf beta receptor showing the entry for the Ensembl protein‐coding gene ENSG0000069702 (TGFBR3). Red boxes indicate matches of the search terms in the gene entry.

View Image

Figure 1.15.3 Ensembl Gene Splice variants page for the human SMAD2 gene showing transcript models for the three alternative splice variants annotated for this gene. While exon‐intron structures are drawn to scale in the top region, each transcript is joined to a zoomed representation below focusing on exons. Conserved protein domains are mapped onto the coding sequence and allow for correlating splice variants with conserved protein domains.

View Image

Figure 1.15.4 Ensembl Gene Variation Image pages, like most Ensembl data displays, can be reconfigured to change certain display options. Upon clicking the “Configure this page” button in the left hand navigation column, a modal window appears in the center of the screen. The configuration options are organized into tabs in the top row and sections in the navigation column to the left. Once display options are set, click the “Save and close” button in the upper right corner.

View Image

Figure 1.15.5 Ensembl Gene Variation image page for the human SMAD2 gene showing transcript models for the three alternative splice variants annotated for this gene. While exon‐intron structures are drawn to scale in the top region, each transcript is joined to a zoomed representation below focusing on exons. Conserved protein domains and more importantly, genetic variation is mapped onto the transcript structures or a configurable amount of intron context and color‐coded according to its biological consequences.

View Image

Figure 1.15.6 Ensembl Variation summary pages provide more details about a particular genetic variation. Here the page for NCBI dbSNP entry rs1051066 is shown, which overlaps with the coding sequence of SMAD2 transcripts.

View Image

Figure 1.15.7 Ensembl Gene summary pages display a table of alternative splice variants of a particular gene, here the human SMAD2 gene, in the top region. The main display is a transcript ideogram that annotates individual transcripts of the human SMAD2 gene and neighboring genes if they are close enough. Context menus are available upon clicking onto transcripts and provide stable identifiers, which link to further Ensembl displays, as well as information about the annotation procedure.

View Image

Figure 1.15.8 Ensembl Gene Supporting evidence pages show on a per exon basis which sequence records from external databases have been used for the annotation of a particular transcript. Supporting evidence for the delta‐exon 3 splice variant (ENST0000356825) of the human SMAD2 gene is shown here.

View Image

Figure 1.15.9 Ensembl Transcript Exons pages provide exon information on the sequence level. Upstream and downstream sequences are indicated in green letters, untranslated regions (UTRs) in pink, and translated regions in black color. By default, for introns only a context sequence flanking the exons is shown in blue color. This display is configurable so that the amount of up‐ and downstream sequence can be changed, as can the amount of intron sequence context, up to full intron sequences.

View Image

Figure 1.15.10 Ensembl Location view pages provide a high‐level view of the genome sequence assembly and its annotation in graphical form. Genes, transcripts, and many other feature types are color‐coded. At the top is the Chromosome panel, representing an entire chromosome, while further below the Top panel displays a chromosome region of up to 1 Mb. The Main panel in the bottom region is highly customizable both by showing a broad range of features within an adjustable sequence region. This display is focused on the human SMAD2 locus.

View Image

Figure 1.15.11 Ensembl Location view pages are highly customizable. The Top panel has been reconfigured to show syntenic regions, i.e., regions of sequence and gene order conservation between human and Mus musculus (mouse), as well as Rattus norvegicus (rat). The locations of sequence‐tagged site (STS) markers are annotated in pink. Context menus providing additional information are available upon clicking a feature.

View Image

Figure 1.15.12 The Main panel of Ensembl Location view displays can be adjusted in several ways to focus on a particular genomic region of interest. A navigation bar on top of the panel has two buttons in form of magnifying glasses with plus and minus signs. They can be used to zoom into or out of the current region by a factor of two, respectively. A zooming ladder allows for quick adjustments to predefined widths. The fastest method for focusing is dragging a box around a particular feature or region and selecting the “Jump to region” option from the context menu.

View Image

Figure 1.15.13 The Main panel of Ensembl Location view has been reconfigured to display genome alignments of mammalian UniProtKB protein records in Normal display option, human cDNA entries from NCBI RefSeq and EMBL, as well as human ESTs in “Stacked unlimited” option. Sequence alignments displayed in these tracks form the biological evidence for transcript annotation. Aligned blocks represented by filled boxes or dark shading should correlate with exons of the Ensembl‐Havana transcripts.

View Image

Figure 1.15.14 Ensembl Location view displays can be configured to show regulatory features alongside gene and transcript annotation. Three regulatory features have been annotated at the 5′ end of the SMAD2 transcripts. The context menus for two of the regulatory features shown here (ENSR0000001421 and ENSR0000001422), opened by clicking the features in the panel, provide additional information.

View Image

Figure 1.15.15 Ensembl Regulation details pages show regulatory features in their genomic context similar to Location view pages. The selected regulatory feature has green background shading and thus can be distinguished from neighboring or overlapping regulatory features. Below the Reg. Feats track, additional tracks provide the core evidence on which this regulatory feature has been annotated, as well as groups of further experimental evidence. Once particular supporting features have been clicked upon, more information becomes available via context menus.

View Image

Figure 1.15.16 Ensembl Location view has been reconfigured to show microarray probe sets from the Affymetrix Human Genome U133 Plus 2.0 Array platform. Individual probes are aligned to the genome and clustered into probe sets before probe sets are associated with transcript predictions. This display annotates the probe sets directly in the genome.

View Image

Figure 1.15.17 Ensembl Gene Tree pages provide a graphical representation of a gene tree and the protein alignments that this tree is based upon. This tree is focused on the human SMAD2 gene, which is indicated in red. Orthologues (in other species) are shown on neighboring branches. Paralogues (in the same species) are labeled in blue. Triangles indicate collapsed branches of the gene tree.

View Image

Figure 1.15.18 The reconfigured Ensembl Gene Tree display shows all orthologues in the primate branch expanded and the context menu for the Euteleostomi (bony vertebrates) duplication node between human SMAD2 and SMAD3.

View Image

Figure 1.15.19 The multiple sequence alignment viewer and editor Jalview (Waterhouse et al., ) showing a sub‐alignment of the SMAD2 and SMAD3 branches of the gene tree.

View Image

Figure 1.15.20 Ensembl Location view displays are highly customizable and can show genome‐wide pairwise alignments between the current species and one or more additional species. Pairwise alignments with the genomes of mouse, rat, and the Western clawed frog are shown here for the human SMAD2 locus.

View Image

Figure 1.15.21 Ensembl Location Multi‐species pages are based on genome‐wide pairwise alignments and can display genome representations of two or more species, concurrently. A comparison of the human BRCA2 and mouse Brca2 loci indicates conserved regions between human and mouse as a pink block in the “H.sap‐M.mus BLASTZ” track. Orthologous blocks between two species are joined with green parallelograms.

View Image

Figure 1.15.22 Ensembl Location Alignment (image) pages provide graphical representations of pairwise or multiple genome alignments. A pairwise alignment of the human BRCA2 locus with the orthologous region in the chimpanzee genome indicates good overall sequence coverage. Alignment blocks in chimpanzee are arranged according to the human genome, which serves as reference.

View Image

Figure 1.15.23 Ensembl Location Synteny pages provide a graphical map of syntenic regions, i.e., regions of sequence and gene order conservation in a species pair. A panel in the upper section of the page shows the chromosome of the primary species (human chromosome 13) in the center. A red box on this chromosome indicates the location context (human BRCA2 locus). Colored boxes along the central chromosome represent syntenic regions, which are joined to their corresponding location on chromosomes of the secondary species (mouse). A pink box overlapping the red box indicates synteny beyond the human BRCA2 locus. A black line joins the region to its corresponding region on mouse chromosome 5. Upon clicking a synteny region, a context menu shows the exact location and extent of the region and provides links to corresponding Location Overview displays.

View Image

Figure 1.15.24 Ensembl Location View displays can be configured to show multiple whole genome alignment tracks. This image displays three tracks related to a 13‐way multiple genome alignment of the human BRCA2 locus. The salmon‐colored block in the “13 amniota vertebrate” track indicates that the entire region is represented in a multiple alignment. The “13 way GERP score” track plots the sequence conservation in the alignment at any position against the genomic location. Finally, the “GERP elements” track refines the GERP scores into regions of particularly high sequence conservation. Please note the correlation between conserved elements and coding exons of the human BRCA2 transcript.

View Image

Figure 1.15.25 Ensembl Location Alignments (Text) displays show multiple sequence alignments in text form and can be configured to also annotate these alignments. The image shows a 13‐way amniota vertebrate (PECAN) alignment in a region corresponding to a GERP element, i.e., a region of high sequence conservation. Red characters indicate the location of exons annotated in the genomes overlapping the alignment region. Blue background shading highlights conserved residues. Please note the particularly high sequence conservation in this protein coding BRCA2 exon and the conservation of the exon boundaries.

View Image

Videos

Literature Cited

	Bruford, E.A., Lush, M.J., Wright, M.W., Sneddon, T.P., Povey, S., and Birney, E. 2008. The HGNC Database in 2008: A resource for the human genome. Nucleic Acids Res. 36:D445‐D448.
	Curwen, V., Eyras, E., Andrews, T.D., Clarke, L., Mongin, E., Searle, S.M., and Clamp, M. 2004. The Ensembl automatic gene annotation system. Genome Res. 14:942‐950.
	Flicek, P., Aken, B.L., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Fernandez‐Banet, J., Gordon, L., Graf, S., Haider, S., Hammond, M., Howe, K., Jenkinson, A., Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F., Koscielny, G., Kulesha, E., Lawson, D., Longden, I., Massingham, T., McLaren, W., Megy, K., Overduin, B., Pritchard, B., Rios, D., Ruffier, M., Schuster, M., Slater, G., Smedley, D., Spudich, G., Tang, Y.A., Trevanion, S., Vilella, A., Vogel, J., White, S., Wilder, S.P., Zadissa, A., Birney, E., Cunningham, F., Dunham, I., Durbin, R., Fernandez‐Suarez, X.M., Herrero, J., Hubbard, T.J., Parker, A., Proctor, G., Smith, J., and Searle, S.M. 2009. Ensembl's 10th year. Nucleic Acids Res. 38:D557‐ D562.
	Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y., and Zhang, J. 2004. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5:R80.
	Haider, S., Ballester, B., Smedley, D., Zhang, J., Rice, P., and Kasprzyk, A. 2009. BioMart central portal–unified access to biological data. Nucleic Acids Res. 37:W23‐W27.
	Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo, C., Quinn, A.F., Selengut, J.D., Sigrist, C.J., Thimma, M., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H., and Yeats, C. 2009. InterPro: The integrative protein signature database. Nucleic Acids Res. 37:D211‐D215.
	Jenkinson, A.M., Albrecht, M., Birney, E., Blankenburg, H., Down, T., Finn, R.D., Hermjakob, H., Hubbard, T.J., Jimenez, R.C., Jones, P., Kahari, A., Kulesha, E., Macias, J.R., Reeves, G.A., and Prlic, A. 2008. Integrating biological data‐the Distributed Annotation System. BMC Bioinformatics 9:S3.
	Leinonen, R., Akhtar, R., Birney, E., Bonfield, J., Bower, L., Corbett, M., Cheng, Y., Demiralp, F., Faruque, N., Goodgame, N., Gibson, R., Hoad, G., Hunter, C., Jang, M., Leonard, S., Lin, Q., Lopez, R., Maguire, M., McWilliam, H., Plaister, S., Radhakrishnan, R., Sobhany, S., Slater, G., Ten Hoopen, P., Valentin, F., Vaughan, R., Zalunin, V., Zerbino, D., and Cochrane, G. 2009. Improvements to services at the European Nucleotide Archive. Nucleic Acids Res. 38:D39‐D45.
	Paten, B., Herrero, J., Beal, K., Fitzgerald, S., and Birney, E. 2008a. Enredo and Pecan: Genome‐wide mammalian consistency‐based multiple alignment with paralogs. Genome Res. 18:1814‐1828.
	Paten, B., Herrero, J., Fitzgerald, S., Beal, K., Flicek, P., Holmes, I., and Birney, E. 2008b. Genome‐wide nucleotide‐level mammalian ancestor reconstruction. Genome Res. 18:1829‐1843.
	Paten, B., Herrero, J., Beal, K., and Birney, E. 2009. Sequence progressive alignment, a framework for practical large‐scale probabilistic consistency alignment. Bioinformatics 25:295‐301.
	Potter, S.C., Clarke, L., Curwen, V., Keenan, S., Mongin, E., Searle, S.M., Stabenau, A., Storey, R., and Clamp, M. 2004. The Ensembl analysis pipeline. Genome Res. 14:934‐941.
	Pruitt, K.D., Tatusova, T., Klimke, W., and Maglott, D.R. 2009. NCBI Reference Sequences: Current status, policy and new initiatives. Nucleic Acids Res. 37:D32‐D36.
	Ruan, J., Li, H., Chen, Z., Coghlan, A., Coin, L.J., Guo, Y., Heriche, J.K., Hu, Y., Kristiansen, K., Li, R., Liu, T., Moses, A., Qin, J., Vang, S., Vilella, A.J., Ureta‐Vidal, A., Bolund, L., Wang, J., and Durbin, R. 2008. TreeFam: 2008 Update. Nucleic Acids Res. 36:D735‐D740.
	Sayers, E.W., Barrett, T., Benson, D.A., Bolton, E., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., Dicuccio, M., Federhen, S., Feolo, M., Geer, L.Y., Helmberg, W., Kapustin, Y., Landsman, D., Lipman, D.J., Lu, Z., Madden, T.L., Madej, T., Maglott, D.R., Marchler‐Bauer, A., Miller, V., Mizrachi, I., Ostell, J., Panchenko, A., Pruitt, K.D., Schuler, G.D., Sequeira, E., Sherry, S.T., Shumway, M., Sirotkin, K., Slotta, D., Souvorov, A., Starchenko, G., Tatusova, T.A., Wagner, L., Wang, Y., John Wilbur, W., Yaschenko, E., and Ye, J. 2009. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 38:D5‐D16.
	UniProt Consortium. 2008. The universal protein resource (UniProt). Nucleic Acids Res. 36:D190‐D195.
	Vilella, A.J., Severin, J., Ureta‐Vidal, A., Heng, L., Durbin, R., and Birney, E. 2009. EnsemblCompara GeneTrees: Complete, duplication‐aware phylogenetic trees in vertebrates. Genome Res. 19:327‐335.
	Waterhouse, A.M., Procter, J.B., Martin, D.M., Clamp, M., and Barton, G.J. 2009. Jalview Version 2–a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189‐1191.
	Wilming, L.G., Gilbert, J.G., Howe, K., Trevanion, S., Hubbard, T., and Harrow, J.L. 2008. The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 36:D753‐D760.
Internet Resources
	http://www.ensembl.org/
	Ensembl project home page
	http://www.biomart.org/
	BioMart Project
	http://vega.sanger.ac.uk/
	Vertebrate Genome Annotation (VEGA) at Sanger Institute
	http://www.ebi.ac.uk/interpro/
	InterPro
	http://www.genenames.org/
	HUGO Gene Nomenclature Committee (HGNC)
	http://www.geneontology.org/
	Gene Ontology Consortium
	http://biodas.org/
	Distributed Annotation System (DAS) and BioDAS
	http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unists
	UniSTS
	http://www.sanger.ac.uk/Software/analysis/eponine
	Eponine
	http://emboss.sourceforge.net/
	The cpgreport program written by Gos Micklem is available on this site.
	http://www.repeatmasker.org/
	RepeatMasker program
	http://www.treefam.org/
	TreeFam
	http://phylomedb.bioinfo.cipf.es/
	PhylomeDB