
Using Chado to Store Genome Annotation Data


Chado is a relational database schema that can be used to manage a wide variety of biological information, including genome annotation, genetic, phenotypic, and expression data. Its flexibility comes from its use of ?ontologies,? which are controlled vocabularies that describe data types and the relationships among them. By changing its ontologies, Chado can be customized to suit many different needs. Another aspect that gives Chado its flexibility is its use of a modular design, which means that users can choose to use only those features of Chado that are suitable for their needs. XORT is the main software tool used to move data in and out of Chado databases. XORT uses an XML?based file format for data import and export; this format is called ChadoXML, The protocols described in this chapter show how to use XORT and related software to import genome annotation data into Chado databases, and how to export data stored in Chado databases into different file formats for report and data mining purposes.

Keywords: Chado; genome; annotation; database; XORT; GAME; GMOD

  • Basic Protocol 1: Installing Chado and XORT in the Unix/Linux Environment
  • Basic Protocol 2: Building a Chado Annotation Database
  • Basic Protocol 3: Loading a GenBank File
  • Basic Protocol 4: Querying a Chado Annotation Database Using SQL
  • Basic Protocol 5: Generating Standard Reports from a Chado Annotation Database
  • Support Protocol 1: Installing Software for a Unix‐Like Environment on a PC
  •   Figure 9.6.1 A schematic representation of the protocols and the organizational relationship between the protocols and data flow for Chado.
  •   Figure 9.6.2 GAME XML format, which is one of the input formats for the annotation editor Apollo.
  •   Figure 9.6.3 Structure for ChadoXML, which serves as intermediate format between Chado database and other file formats.
  •   Figure 9.6.4 FEATURES section of GenBank record to be loaded into Chado.
  •   Figure 9.6.5 FEATURES section of GenBank record to be loaded into Chado, modified to reflect chromosomal coordinates.
  •   Figure 9.6.6 Example query to retrieve location information for the gene oaf .
  •   Figure 9.6.7 Results returned for the query depicted in Figure .
  •   Figure 9.6.8 Example query to get transcripts and their locations for the gene oaf .
  •   Figure 9.6.9 Results returned for the query depicted in Figure .
  •   Figure 9.6.10 Example query to get exons and their locations for a given transcript, “oaf‐RB.”.
  •   Figure 9.6.11 Results returned for the query depicted in Figure .
  •   Figure 9.6.12 Example query to get exons and their locations for the gene oaf .
  •   Figure 9.6.13 Results returned for the query depicted in Figure .
  •   Figure 9.6.14 Example query to list types of analysis available and sets of data used in the analysis for a given genomic region (arm 2L, bases 1 to 49,999).
  •   Figure 9.6.15 Results returned for the query depicted in Figure .
  •   Figure 9.6.16 Example query to list aligned objects for a given genomic region (arm 2L, bases 1 to 49,999).
  •   Figure 9.6.17 Results returned for the query depicted in Figure .
  •   Figure 9.6.18 Example query to retrieve the alignment details for the alignment of a given sequence against the chromosome arm (e.g., GenBank record “AY129461”).
  •   Figure 9.6.19 Results returned for the query depicted in Figure .
  •   Figure 9.6.20 Examples of conf/bulkfiles/fbreleases.xml and conf/bulkfiles/fbbulk‐hetr3.xml files modified to reflect the database.
  •   Figure 9.6.21 Example of commands used for generating report files.
  •   Figure 9.6.22 Screen shot of the Cygwin setup window.
  •   Figure 9.6.23 Groups that must be installed in order to install Cygwin.
  •   Figure 9.6.24 The Central Dogma model for a protein‐coding gene with one known spliced transcript. The dashed lines denote the featureloc records of features aligned to the genomic contig, while the solid lines denote the feature_relationship records between two features (subject and object).
  •   Figure 9.6.25 Data implementation of prediction and alignment evidence in Chado to support genome annotation. The dashed line denotes the featureloc of features aligned to genomic contig, while solid line denotes the feature relationship between two features.
  •   Figure 9.6.26 The “rebase” error message from Cygwin.
