Analysis and Management of Microarray Gene Expression Data
互联网
- Abstract
- Table of Contents
- Figures
- Literature Cited
Abstract
Microarray experiments require careful planning and choice of analysis tools in order to get the most out of the data generated, especially considering the associated significant cost and effort. Microarray experiments also require careful documentation, often residing in local databases and/or submitted to public repositories. An often bewildering assortment of choices is available for experimental design, data preprocessing, data analysis (e.g., differential gene expression, classification), and data management. This unit covers the basic steps and common applications for planning, data processing, and data management of microarray experiments, and provides guidance to making choices based on the goals and practical realities of the experiment, as well as the authors' experience in this area.
Keywords: microarray; experimental design; data preprocessing; data analysis; databases; gene expression
Table of Contents
- Experimental Design
- Data Preprocessing
- Expression and Differential Expression
- Classification
- Looking at Gene Sets
- Databases
- Conclusions
- Literature Cited
- Figures
Materials
Figures
-
Figure 19.6.1 Steps involved in carrying out a complete microarray experiment. View Image -
Figure 19.6.2 (A ) Plot of M versus A for a two‐channel microarray assay (see text). The dashed horizontal line is the line M = m where m is the median of M over all reporters. The solid curve represents a global loess fit. (B ) Plot of M versus A for the data from panel A after normalization by the global constant m . The solid curve represents a global loess fit and illustrates that this normalization did not eliminate intensity‐dependent biases. (C ) Plot of M versus A for the data from panel A after global loess normalization. All global loess fitting and normalization performed with the Bioconductor marray R package. View Image -
Figure 19.6.3 (A ) Plot of M versus A for a two‐channel microarray assay. Each curve represents a lowess fit to one of the print‐tips. Data generated with the Bioconductor sma R package, using the MouseArray dataset provided therein. (B ) Plot of M versus A for the data from panel A after print‐tip lowess normalization (performed with the Bioconductor sma R package). View Image -
Figure 19.6.4 Result of hierarchical clustering of both genes and samples for twelve one‐channel assays, using centered Pearson correlation and the Eisen's TreeView software (http://rana.lbl.gov/EisenSoftware.htm). The sample tree is along the top and the gene tree is along the side. Brighter green represents higher signal intensity. View Image -
Figure 19.6.5 Simple example of a decision tree. View Image
Videos
Literature Cited
Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. 2006. Microarray data analysis: From disarray to consolidation and consensus. Nat. Rev. Genet. 7:55‐65. | |
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A. 96:6745‐6750. | |
Bailey, T.B. and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28‐36. AAAI Press, Menlo Park, Calif. | |
Barash, Y., Dehan, E., Krupsky, M., Franklin, W., Geraci, M., Friedman, N., and Kaminski, N. 2004. Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays. Bioinformatics 20:839‐846. | |
Bar‐Joseph, Z. 2004. Analyzing time series gene expression data. Bioinformatics 20:2493‐2503. | |
Ben‐Dor, A., Shamir, R., and Yakhini, Z. 1999. Clustering gene expression patterns. J. Comput. Biol. 6:281‐297. | |
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B Met. 57:289‐300. | |
Bolstad, B.M., Irizarry, R.A., Åstrand, M., and Speed, T.P. 2003. A comparison of normalization methods for high‐density oligonucleotide array data based on variance and bias. Bioinformatics 19:185‐193. | |
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze‐Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. 2001. Minimum information about a microarray experiment (MIAME): Toward standards for microarray data. Nat. Genet. 29:365‐371. | |
Breiman L., 2001. Random forests. Machine Learning 45:5‐32. | |
Breiman, L., Friedman, J.H., Olshen, R., and Stone, C.J. 1984. Classification and Regression Trees. The Wadsworth Statistics/Probability Series. Wadsworth International Group, Belmont, Calif. | |
Choe, S.E., Boutros, M., Michelson, A.M., Church, G.M., and Halfon, M.S. 2005. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology 6:R16. | |
Cleveland, W.S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74:829‐836. | |
Cleveland, W.S. and Devlin, S.J. 1988. Locally weighted regression: An approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83:596‐610. | |
Coombes, K.R., Highsmith, W.E., Krogmann, T.A., Baggerly, K.A., Stivers, D.N., and Abruzzo, L.V. 2002. Identifying and quantifying sources of variation in microarray data using high‐density cDNA membrane arrays. J. Comp. Biol. 9:655‐669. | |
Coombes, K.R., Wang, J., and Abruzzo, L.V. 2003. Monitoring the quality of microarray experiments. In Methods of Microarray Data Analysis III (K.F. Johnson and S.K. Lin, eds.) pp. 25‐40. Kluwer Academic Publishers, Boston. | |
Cope, L.M., Irizarry, R.A., Jaffee, H., Wu, Z., and Speed, T.P. 2003. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 20:323‐331. | |
Dabney, A.R. and Storey, J.D. 2005. Optimal Feature Selection for Nearest Centroid Classifiers, With Applications to Gene Expression Microarrays. UW Biostatistics Working Paper Series. Working Paper 267. The Berkeley Electronic Press, Berkeley, Calif. | |
D'haeseleer, P., Liang, S., and Somogyi, R. 2000. Genetic network inference: From co‐expression clustering to reverse engineering. Bioinformatics 16:707‐726. | |
Durbin, R., Eddy, S.R., Krogh, A., and Mitchison, G. 1999. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, U.K. | |
Dudoit, S., Fridlyand, J., and Speed, T.P. 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97:77‐87. | |
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome‐wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863‐14868. | |
Ewens, W.J. and Grant, G.R. 2005. Statistical Methods in Bioinformatics: An Introduction, 2nd ed. Springer‐Verlag, New York. | |
Fasulo, D. 1999. An analysis of recent work on clustering algorithms. Technical Report TR 0103‐02, University of Washington, Deptartment of Computer Science & Engineering, Seattle. | |
Frank, E., Hall, M., Trigg, L., Holmes, G., and Witten, I.H. 2004. Data mining in bioinformatics using WEKA. Bioinformatics 20:2479‐2481. | |
Furlanello, C., Serafini, M., Merler, S., and Jurman, G. 2003. Entropy‐based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 4:54. | |
Gardiner‐Garden, M. and Littlejohn, T.G. 2001. A comparison of microarray databases. Brief Bioinform. 2:143‐158. | |
Ge, Y., Dudoit S., and Speed, T.P. 2003. Resampling‐based multiple testing for microarray data hypothesis. Test 12:1‐44. | |
Grant, G.R., Manduchi, E., Pizarro, A., and Stoeckert, C.J. Jr. 2003. Maintaining data integrity in microarray data management. Biotechnol. Bioeng. 84:795‐800. | |
Grant, G.R., Liu, J., and Stoeckert, C.J. Jr., 2005. A practical false discovery rate approach to identifying patterns of differential expression in microarray data. Bioinformatics 21:2684‐2690. | |
Handl, J., Knowles, J., and Kell, D.B. 2005. Computational cluster validation in post genomic data analysis. Bioinformatics 21:3201‐3212. | |
Hartigan, J. 1975. Clustering Algorithms. Wiley, Chichester, U.K. | |
Hastie, T., Tibshirani, R., Eisen, M.B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., Botstein, D., and Brown, P. 2003. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1:research0003.1‐research0003.21. | |
Hollander, M. and Wolfe, D.A. 1999. Nonparametric Statistical Methods, 2nd ed. Wiley Interscience, New York. | |
Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., and Vingron, M. 2002. Variance stabilization applied to microarray data calibration and to quantification of differential expression. Bioinformatics 18:S96‐S104. | |
Irizarry, R.A., Wu, Z., and Jaffee, H.A. 2006. Comparison of affymetrix GeneChip expression measures. Bioinformatics 22:789‐794. | |
Kerr, K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183‐202. | |
Kerr, M.K., Martin, M., and Churchill, G.A. 2000. Analysis of variance for gene expression microarray data. J. Comput. Biol. 7:819‐837. | |
Lazzeroni, L.C. and Owen, A. 2002. Plaid models for gene expression data. Statistica Sinica 12:61‐86. | |
Liu, H. 2005. Evolving feature selection. IEEE Intelligent Systems 20:64‐76. | |
Manduchi, E., Grant, G.R., He, H., Liu, J., Mailman, M.D., Pizarro, A.D., Whetzel, P.L., and Stoeckert, C.J. Jr. 2004. RAD and the RAD Study‐Annotator: An approach to collection, organization and exchange of all relevant information for high‐throughput gene expression studies. Bioinformatics 20:452‐459. | |
Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., and Group, L.C. 2003. PGC‐1α‐responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34:267‐273. | |
Quackenbush, J. 2002. Microarray data normalization and transformation. Nat. Genet. 32:496‐501. | |
Rabiner, L.A. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE Inst. Electr. Electron. Eng. 77:257‐286. | |
Rajagopalan, D. 2003. A comparison of statistical methods for analysis of high density oligonucleotide array data. Bioinformatics 19:1469‐1476. | |
Ramoni, M., Sebastiani, P., and Kohane, I. 2002. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. U.S.A. 99:9121‐9126. | |
Saeed, A.I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T., Thiagarajan, M., Sturn, A., Snuffin, M., Rezantsev, A., Popov, D., Ryltsov, A., Kostukovich, E., Borisovsky, I., Liu, Z., Vinsavich, A., Trush, V., and Quackenbush, J. 2003. TM4: A free, open‐source system for microarray data management and analysis. Biotechniques 34:374‐378. | |
Scearce, L.M., Brestelli, J.E., McWeeney, S.K., Lee, C.S., Mazzarelli, J., Pinney, D.F., Pizarro, A., Stoeckert, C.J. Jr, Clifton, S.W., Permutt, M.A., Brown, J., Melton, D.A., and Kaestner, K.H. 2002. Functional genomics of the endocrine pancreas. The pancreas clone set and PanChip, new resources for diabetes research. Diabetes 51:1997‐2004. | |
Schliep, A., Schönhuth, A., and Steinhoff, C. 2003. Using hidden Markov models to analyze gene expression time course data. Bioinformatics 19:i255‐i263. | |
Segal, E., Yelensky, R., and Koller, D. 2003. Genome‐wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics 19:273‐282. | |
Sherlock, G. and Ball, C.A. 2005. Storage and retrieval of microarray data and open source microarray database software. Mol. Biotechnol. 30:239‐251. | |
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B. 1998. Comprehensive identification of cell cycle‐regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9:3273‐3297. | |
Speed T.P. (ed.) 2003. Statistical Analysis of Microarray Gene Expression Data. Chapman & Hall/CRC, Boca Raton, Fla. | |
Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., and Levy, S. 2005. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631‐643. | |
Steibel, J.P. and Rosa, G.J.M. 2005. On reference designs for microarray experiments. Stat. Appl. Genet. Mol. Biol. 4(1):Article 36. | |
http://www.bepress.com/sagmb/vol4/iss1/art36. | |
Stivers, D., Wang, J., Rosner G., and Coombes, K. 2003. Organ specific differences in gene expression and unigene annotations describing source material. In Methods of Microarray Data Analysis III (K.F. Johnson and S.K. Lin, eds.) pp. 59‐72. Kluwer Academic Publishers, Boston. | |
Storey, J.D. 2003. The positive false discovery rate: A Bayesian interpretation and the q‐value. Ann. Stat. 31:2013‐2035. | |
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self‐organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96:2907‐2912. | |
Tamhane, A.C. and Dunlop, D.D. 2000. Statistics and Data Analysis. Prentice Hall, Upper Saddle River, N.J. | |
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. U.S.A. 99:6567‐6572. | |
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. 2003. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18:104‐117. | |
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17:520‐525. | |
Tusher, V.G., Tibshirani, R., and Chu, G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98:5116‐5121. | |
Vapnik, V. 1998. Statistical Learning Theory. Wiley Interscience, New York. | |
Westfall, P.H. and Young, S.S. 1993. Resampling‐based multiple testing. Wiley InterScience, New York. | |
Whetzel, P.L., Parkinson, H., Causton, H.C., Fan, L., Fostel, J., Fragoso, G., Game, L., Heiskanen, M., Morrison, N., Rocca‐Serra, P., Sansone, S.A., Taylor, C., White, J., and Stoeckert, C.J. Jr. 2006. The MGED Ontology: A resource for semantics‐based description of microarray experiments. Bioinformatics. 22:866‐873. | |
Witten, I.H., and Frank, E. 2005. Data mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco. | |
Yang, Y.H. and Speed, T.P. 2002. Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3:279‐588. | |
Yang, Y.H., Buckley, M.J., Dudoit, S., and Speed, T.P. 2002a. Comparison of methods for image analysis on cDNA microarray data. J. Computat. Graph. Stat. 11:108‐136. | |
Yang, Y.H, Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. 2002b. Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucl. Acids Res. 30:e15. | |
Yekutieli, D. and Benjamini, Y. 1999. Resampling‐based false discovery rate controlling multiple test procedures correlated test statistics. J. Stat. Plan. Inference 82:171‐196. | |
Internet Resources | |
http://www.affymetrix.com/support/technical/technotes/statistical_reference_guide.pdf | |
Affymetrix statistical algorithms reference guide (MAS 5.0). | |
http://www.affymetrix.com//support/downloads/manuals/data_analysis_fundamentals_manual.pdf | |
Affymetrix Data Analysis Fundamentals Manual. | |
http://www.ebi.ac.uk/arrayexpress/ | |
The ArrayExpress repository | |
http://atlas.med.harvard.edu/ | |
The AlignACE Web site. | |
http://www1.amershambiosciences.com/aptrix/upp01077.nsf/Content/microarrays_analysis | |
ArrayVision image analysis software. | |
http://base.thep.lu.se/ | |
The BASE ‐ BioArray Software Environment Web site. | |
http://www.gene‐regulation.com/pub/databases.html | |
The BIOBASE Web site, which includes TRANSFAC. | |
http://www.bioconductor.org | |
The Bioconductor project Web site. | |
http://www.genomethods.org/caged | |
The CAGED Web site. | |
http://david.abcc.ncifcrf.gov | |
DAVID Web site at NIAID, where EASE is also available. | |
http://www.dchip.org | |
dChip software, with links to references and tutorials. | |
http://www.cbcb.umd.edu/software/ELPH | |
The ELPH Gibbs Sampler Web site. | |
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene | |
The Entrez Gene Web site. | |
http://www.ncbi.nlm.nih.gov/Genbank | |
The GenBank Web site. | |
http://www.ncbi.nlm.nih.gov/geo/ | |
The Gene Expression Omnibus (GEO) repository. | |
http://www.geneontology.org/ | |
The Gene Ontology Project Web site. | |
http://www.moleculardevices.com/pages/software/gn_genepix_pro.html | |
GenePix image analysis software. | |
http://www.broad.mit.edu/gsea | |
The Gene Set Enrichment Analysis (GSEA) Web site. | |
http://www.genmapp.org/ | |
The GenMAPP project Web site. | |
http://www.gusdb.org/ | |
The GUS (Genomics Unified Schema) Web site containing the RAD software. | |
http://mordor.cgb.ki.se/cgi‐bin/jaspar2005/jaspar_db.pl | |
The JASPAR Web site. | |
http://www.genome.jp/kegg/pathway.html | |
The KEGG Pathway database. | |
http://meme.sdsc.edu | |
The MEME Web site. | |
http://www.mged.org | |
The MGED Web site with links for MIAME, MAGE, and MGED Ontology. | |
http://www.ebi.ac.uk/miamexpress/ | |
The MIAMExpress annotation and submission tool. | |
http://www.mysql.com/ | |
The MySQL open source database Web site. | |
http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htm | |
NIST handbook, section on LOESS. | |
http://www.cbil.upenn.edu/PaGE | |
The PaGE Web site. | |
http://www‐stat.stanford.edu/∼tibs/PAM/ | |
The PAM Web site. | |
http://www.postgresql.org/ | |
The PostgreSQL open source database Web site. | |
http://stat‐www.berkeley.edu/users/breiman/RandomForests/cc_home.htm | |
L. Breiman and A. Cutler Web site on Random Forests. | |
http://rmaexpress.bmbolstad.com/ | |
RMAExpress Web site, which includes relevant references. | |
http://www‐stat.stanford.edu/∼tibs/SAM/ | |
The SAM Web site | |
http://rana.lbl.gov/manuals/ScanAlyzeDoc.pdf | |
ScanAlyze image analysis software. | |
http://experimental.act.cmis.csiro.au/Spot/index.php | |
Spot image analysis software. | |
http://genome‐www5.stanford.edu/download/ | |
The Stanford Microarray Database (SMD) software download site. | |
http://genome‐www5.stanford.edu/resources/databases.shtml | |
The Stanford Microarray Database (SMD). | |
http://www.stratagene.com | |
The Stratagene Web site. | |
http://www.tm4.org/ | |
TM4: a package of open source software for microarray analysis, comprising MADAM, SpotFinder, MIDAS, and MeV. | |
http://www.stat.berkeley.edu/users/terry/zarray/TechReport/589.pdf | |
Technical report on normalization for two‐channel arrays by Y.H. Yang, S. Dudoit, P. Luu, and T.P. Speed. | |
http://159.149.109.16:8080/weederWeb/ | |
The Weeder Web site. | |
http://www.cs.waikato.ac.nz/ml/weka/ | |
The WEKA web resource. |