An Overview of Gene Identification: Approaches, Strategies, and Considerations
互联网
- Abstract
- Table of Contents
- Figures
- Literature Cited
Abstract
Modern biology is on the verge of officially ushering in a new era in science with the completion of the sequencing of the human genome in April 2003. While often erroneously called the ?post?genome era?, this will actually truly mark the beginning of the ?genome era,? a time in which the availability of sequence data for many genomes will have a significant effect on how science is performed in the 21st century. This unit offers an overview of many of the gene prediction methods that are currently available and offers a general assessment of how well the methods work for various problems.
Table of Contents
- Remembering Biology in Deducing Gene Structure
- Categorizing the Methods
- How Well do the Methods Work?
- Strategies and Considerations
- Future Directions
- Acknowledgments
- Literature Cited
- Figures
- Tables
Materials
Figures
-
Figure 4.1.1 The central dogma of molecular biology. Proceeding from the DNA through the RNA to the protein level, various sequence features and modifications can be identified that can be used in the computational deduction of gene structure. These include the presence of promoter and regulatory regions, intron‐exon boundaries, and both start and stop signals. Unfortunately, these signals are not always present, and when present may not always be in the same form or context. The reader is referred to the text for greater detail. View Image -
Figure 4.1.2 Sensitivity vs. specificity. In the upper portion of the figure, the four possible outcomes of a prediction are shown: a true positive (TP), a true negative (TN), a false positive (FP), and a false negative (FN). The matrix at the bottom of the figure shows how both sensitivity and specificity are determined from these four possible outcomes, giving a tangible measure of the effectiveness of any gene prediction method. (Figure adapted from Burset and Guigó, and Snyder and Stormo, .) View Image -
Figure 4.1.3 Annotated output from GeneMachine showing the results of multiple gene prediction program runs. NCBI Sequin is used as the viewer. At the top of the output are shown the results from various BLAST runs (BLASTN vs. dbEST, BLASTN vs. nr, and BLASTX vs. SWISS‐PROT). Towards the bottom of the window are shown the results from the predictive methods (FGENES, GENSCAN, MZEF, and GRAIL 2). Annotations indicating the strength of the prediction are preserved and shown wherever possible within the viewer. Putative regions of high interest would be areas where hits from the BLAST runs line up with exon predictions from the gene prediction programs. View Image
Videos
Literature Cited
Literature Cited | |
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402. | |
Burset, M. and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34:353‐367. | |
Chothia, C. and Lesk, A.M. 1986. The relation between the divergence of sequence and structure in proteins. E.M.B.O. J. 5:823‐826. | |
Claverie, J.M. 1997a. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6:1735‐2744. | |
Claverie, J.M. 1997b. Exon detection by similarity searches. Methods. Mol. Biol. 68:283‐313. | |
Claverie, J.M. 1998. Computational methods for exon detection. Mol. Biotechnol. 10:27‐48. | |
Davuluri, R.V., Grosse, I., and Zhang, M.Q. 2002. Computational identification of promoters and first exons in the human genome. Nature Genetics 29:412‐417. | |
Guigó, R. 1997. Computational gene identification. J. Mol. Med. 75:389‐393. | |
Guigó, R., Knudsen, S., Drake, N., and Smith, T. 1992. Prediction of gene structure. J. Mol. Biol. 226:141‐257. | |
Harris, N.L. 1997. Genotator: A workbench for sequence annotation. Genome Res. 7:754‐762. | |
International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860‐921. | |
Kuehl, P., Weisemann, J., Touchman, J., Green, E., and Boguski, M. 1999. An effective approach for analyzing “prefinished” genomic sequence data. Genome Res. 9:189‐294. | |
Liu, A.Y., Torchia, B.S., Migeon, B.R., and Siliciano, R.F. 1997. The human NTT gene: Identification of a novel 17‐kb noncoding nuclear RNA expressed in activated CD4+ T cells. Genomics 39:171‐284. | |
Makalowska, I., Ryan, J., and Baxevanis, A. 1999. GeneMachine: A unified solution for performing content‐based, site‐based, and comparative gene prediction methods. 12th Cold Spring Harbor Meeting on genome mapping, sequencing and Biology, Cold Spring Harbor, NY. | |
Makalowska, I., Sood, R., Faruque, M.U., Hu, P., Eddings, E.M., Mestre, J.D., Baxevanis, A.D., and Carpten, J.D. 2002. Identification of six novel genes by experimental validation of GeneMachine‐predicted genes. Gene 284:203‐213. | |
Pearson, W.R., Wood, T., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:24‐36. | |
Rogic, S., Mackworth, A., and Ouellette, B.F.F. 2001. Evaluation of Gene‐Finding Programs. Genome Res. 11:817‐832. | |
Snyder, E.E. and Stormo, G.D. 1993. Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucl. Acids Res. 21:607‐613. | |
Snyder, E.E. and Stormo, G.D. 1997. Identifying genes in genomic DNA sequences. In DNA and Protein Sequence Analysis (M.J. Bishop, and, C.J. Rawlings, eds.) pp. 209‐224. Oxford University Press, New York. | |
Stormo, G.D. 2000. Gene‐finding approaches for eukaryotes. Genome Res. 10:511‐515. | |
Wevrick, R., Kerns, J.A., and Francke, U. 1996. The IPW gene is imprinted and is not expressed in the Prader‐Willi syndrome. Acta Genet. Med. Gemollol. 45:191‐297. | |
Zhang, J. and Madden, T.L. 1997. PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation. Genome Res. 7:649‐656. |