Quality Control Procedures for Genome‐Wide Association Studies

互联网2013-12-31

1495

Abstract
Table of Contents
Figures
Literature Cited

Abstract

Genome?wide association studies (GWAS) are being conducted at an unprecedented rate in population?based cohorts and have increased our understanding of the pathophysiology of complex disease. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the electronic MEdical Records and Genomics (eMERGE) network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. We discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research. Curr. Protoc. Hum. Genet. 68:1.19.1?1.19.18 © 2011 by John Wiley & Sons, Inc.

Keywords: genome?wide association studies; GWAS; quality control; QC; biobanks; electronic medical records; eMERGE

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
GWAS Data Format
Sample Quality
Marker Quality
Batch Effects
Evaluation of QC After Association Analysis
Future Directions
Acknowledgements
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 1.19.1 A flowchart overview of the entire GWAS QC process. Each topic is discussed in detail in the corresponding section in the text. Squares represent steps, ovals represent input or output data, and trapezoids represent filtering of data.

View Image

Figure 1.19.2 Visualization of X and Y probe intensities. The x ‐axis and y ‐axis represent the sum of the average over all probes for the normalized Cartesian intensity for allele A and the average over all probes for the normalized Cartesian intensity for allele B using all probes available on the X chromosome and Y chromosome, respectively. The XX (female, red circles) and XY (male, blue triangles) subjects are shown on the bottom right corner and on the top left corner, respectively. The plot reveals two mislabeled individuals (one male with the female cluster, and one female with the male cluster). Several XXY individuals are also clearly visible (upper right corner).

View Image

Figure 1.19.3 Copy Number and allelic variation to detect anomalies on the X chromosome. The top plot shows the B‐Allele frequencies for all probes for one sample with total loss of heterozygosity (LOH) on the X chromosome. The bottom plot shows the copy number variation from the same sample on X chromosome. Both plots are helpful to detect regions of LOH and/or copy number variation such as deletion and amplification.

View Image

Figure 1.19.4 Points in this plot show pairs of individuals plotted by their degree of relatedness: the proportion of loci where the pair shares one allele IBD (Z1) by the proportion of loci where the pair shares zero alleles IBD (Z0). These values are obtained from PLINK using the –genome option. Pairs are color‐coded by the type of relationship determined by the pedigree information embedded in the pedfile (also reported by PLINK). This plot omits pairs of individuals having an overall kinship coefficient ≥ 0.05 for clarity. There is a pair of monozygotic twins represented by a point in the lower left at (0,0), because they share two alleles IBD at every locus across the genome.

View Image

Figure 1.19.5 Histogram showing the distribution of pairwise kinship coefficients (where kinship coefficient is greater than 0.05). The peak over 0.5 represents first degree relatives (parent‐offspring, full siblings). The peak over 0.25 represents second‐degree relatives (half siblings, avuncular, grandparent‐grandchild). Third‐ and fourth‐degree relatives begin to blend into more distantly related samples between zero and 0.125.

View Image

Figure 1.19.6 Proportion of SNPs or samples remaining as call rate threshold increases. The green line shows the proportion of SNPs remaining when SNPs are discarded if they fall below the given genotyping efficiency threshold. The blue line shows the proportion of samples remaining, while the red line shows the proportion of samples remaining if a 99% call rate threshold is applied to eliminate poor‐quality markers first.

View Image

Figure 1.19.7 This shows the power to detect an association at genome‐wide significance ( p < 5×10^‐8 ), assuming the actual causal SNP is genotyped in a case‐control study consisting of 5000 cases and 5000 controls of a common disease with 10% prevalence under an additive model at several different odds ratios. Note that when the MAF is low, power is extremely low even for very large effects (odds ratio = 1.7).

View Image

Figure 1.19.8 AB and BB individuals are split into subclusters AB and AB', BB and BB', while the AA cluster is unaffected. The AB/AB' split results in some AB samples miscalled as AA (diagnosed by Mendelian inconsistencies in the genotypes), as well as deviation from HWE due to excess homozygosity. Since only samples with at least one B allele demonstrate the splitting, one consistent explanation is the presence of a cryptic polymorphism near rs2301237 on a haplotype that contains the B allele. In this case, a second polymorphism (rs3114267) lies eight bases upstream from the typed polymorphism, and is in complete LD ( D ' = 1, r ² = 0.2) with rs2301237.

View Image

Figure 1.19.9 Unexpected number of clusters resulting in departure from HWE consistent with copy loss. Hemizygous individuals cluster at AO and BO. Individuals with homozygous deletions cluster at OO and their genotype calls are missing. The AB cluster remains intact, since these individuals are ipso facto diploid at the locus. Parent‐parent‐child Mendelian errors are present when at least one parent is hemizygous and produces hemizygous offspring. The deletion results in excess homozygosity. In this case, the “copy loss” appears to be a six‐nucleotide insertion (rs71578153) coincident with rs11591064 that disrupts both A and B probes.

View Image

Figure 1.19.10 The five observed clusters are most consistent with a segmental duplication, although none is curated around the locus. A copy number variant would be expected to produce additional clusters above the AA and BB clusters (i.e., AAA and BBB), as opposed to the splits being confined to strictly the heterozygous clusters. Regardless, the artifact results in excess heterozygosity.

View Image

Videos

Literature Cited

	Aulchenko, Y.S., de Koning, D.J., and Haley, C. 2007. Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigree‐based quantitative trait loci association analysis. Genetics 177:577‐585.
	Barber, M.J., Mangravite, L.M., Hyde, C.L., Chasman, D.I., Smith, J.D., McCarty, C.A., Li, X., Wilke, R.A., Rieder, M.J., Williams, P.T., Ridker, P.M., Chatterjee, A., Rotter, J.I., Nickerson, D.A., Stephens, M., and Krauss, R.M. 2010. Genome‐wide association of lipid‐lowering response to statins in combined study populations. PLoS One 5:e9763.
	Broman, K.W. 1999. Cleaning genotype data. Genet. Epidemiol. 17:S79‐S83.
	Cardon, L.R. and Palmer, L.J. 2003. Population stratification and spurious allelic association. Lancet 361:598‐604.
	Carlson, C.S., Smith, J.D., Stanaway, I.B., Rieder, M.J., and Nickerson, D.A. 2006. Direct detection of null alleles in SNP genotyping data. Hum. Mol. Genet. 15:1931‐1937.
	Chanock, S.J., Manolio, T., Boehnke, M., Boerwinkle, E., Hunter, D.J., Thomas, G., Hirschhorn, J.N., Abecasis, G., Altshuler, D., Bailey‐Wilson, J.E., Brooks, L.D., Cardon, L.R., Daly, M., Donnelly, P., Fraumeni, J.F. Jr., Freimer, N.B., Gerhard, D.S., Gunter, C., Guttmacher, A.E., Guyer, M.S., Harris, E.L., Hoh, J., Hoover, R., Kong, C.A., Merikangas, K.R., Morton, C.C., Palmer, L.J., Phimister, E.G., Rice, J.P., Roberts, J., Rotimi, C., Tucker, M.A., Vogan, K.J., Wacholder, S., Wijsman, E.M., Winn, D.M., and Collins, F.S. 2007. Replicating genotype‐phenotype associations. Nature 447:655‐660.
	Dadd, T., Weale, M.E., and Lewis, C.M. 2009. A critical evaluation of genomic control methods for genetic association studies. Genet. Epidemiol. 33:290‐298.
	Daly, A.K., Donaldson, P.T., Bhatnagar, P., Shen, Y., Pe'er, I., Floratos, A., Daly, M.J., Goldstein, D.B., John, S., Nelson, M.R., Graham, J., Park, B.K., Dillon, J.F., Bernal, W., Cordell, H.J., Pirmohamed, M., Aithal, G.P., Day, C.P.; DILIGEN Study; International SAE Consortium. 2009. HLA‐undefined5701 genotype is a major determinant of drug‐induced liver injury due to flucloxacillin. Nat Genet 41:816‐819.
	Devlin, B. and Roeder, K. 1999. Genomic control for association studies. Biometrics 55:997‐1004.
	Devlin, B., Bacanu, S.A., and Roeder, K. 2004. Genomic Control to the extreme. Nat. Genet. 36:1129‐1130.
	Dumitrescu, L.C., Ritchie, M.D., Brown‐Gentry, K., Pulley, J.J., Basford, M., Denny, J., Oksenberg, J.R., Roden, D.M., Haines, J.L., and Crawford, D.C. 2010. Assessing the accuracy of observer‐reported ancestry in a biorepository linked to electronic medical records. Genet. Med. In press.
	Frayling, T.M. 2007. Genome‐wide association studies provide new insights into type 2 diabetes aetiology. Nat. Rev. Genet. 8:657‐662.
	Gauderman, W.J. 2002. Sample size requirements for matched case‐control studies of gene‐environment interaction. Stat. Med. 21:35‐50.
	Gorlov, I.P., Gorlova, O.Y., Sunyaev, S.R., Spitz, M.R., and Amos, C.I. 2008. Shifting paradigm of association studies: Value of rare single‐nucleotide polymorphisms. Am. J. Hum. Genet. 82:100‐112.
	Grady, B.J., Torstenson, E., Dudek, S.M., Giles, J., Sexton, D., and Ritchie, M.D. 2010. Finding unique filter sets in plato: A precursor to efficient interaction analysis in gwas data. Pac. Symp. Biocomput. 2010:315‐326.
	Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., Mehta, J.P., Collins, F.S., and Manolio, T.A. 2009. Potential etiologic and functional implications of genome‐wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. U.S.A. 106:9362‐9367.
	International HapMap consortium. 2003. The International HapMap Project. Nature 426:789‐796.
	International HapMap Consortium. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851‐861.
	Kathiresan, S., Willer, C.J., Peloso, G.M., Demissie, S., Musunuru, K., Schadt, E.E., Kaplan, L., Bennett, D., Li, Y., Tanaka, T., Voight, B.F., Bonnycastle, L.L., Jackson, A.U., Crawford, G., Surti, A., Guiducci, C., Burtt, N.P., Parish, S., Clarke, R., Zelenika, D., Kubalanza, K.A., Morken, M.A., Scott, L.J., Stringham, H.M., Galan, P., Swift, A.J., Kuusisto, J., Bergman, R.N., Sundvall, J., Laakso, M., Ferrucci, L., Scheet, P., Sanna, S., Uda, M., Yang, Q., Lunetta, K.L., Dupuis, J., de Bakker, P.I., O'Donnell, C.J., Chambers, J.C., Kooner, J.S., Hercberg, S., Meneton, P., Lakatta, E.G., Scuteri, A., Schlessinger, D., Tuomilehto, J., Collins, F.S., Groop, L., Altshuler, D., Collins, R., Lathrop, G.M., Melander, O., Salomaa, V., Peltonen, L., Orho‐Melander, M., Ordovas, J.M., Boehnke, M., Abecasis, G.R., Mohlke, K.L., and Cupples, L.A. 2009. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat. Genet. 41:56‐65.
	Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., Sangiovanni, J.P., Mane, S.M., Mayne, S.T., Bracken, M.B., Ferris, F.L., Ott, J., Barnstable, C., and Hoh, J. 2005. Complement factor H polymorphism in age‐related macular degeneration. Science 308:385‐389.
	Laurie, C., Mirel, D., Pugh, E., Bierut, L., Bhangale, T., Boehm, F., Caporaso, N., Edenburgh, H., Gabriel, S., Harris, E., Hu, F.B., Jacobs, K.B., Kraft, P., Landi, M.T., Lumley, T., Manolio, T.A., McHugh, C., Painter, I., Paschall, J., Rice, J.P., Rice, K.M., Zheng, X., Weir, B.S.; GENEVA Investigators. 2010. Quality control and quality assurance in genotypic data for genome‐wide association studies. Genet. Epidemiol. 34:591‐602.
	Link, E., Parish, S., Armitage, J., Bowman, L., Heath, S., Matsuda, F., Gut, I., Lathrop, M., and Collins, R. 2008. SLCO1B1 variants and statin‐induced myopathy: A genomewide study. 2008. N. Engl. J. Med. 359:789‐799.
	Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L., Popova, N., Pretel, S., Ziyabari, L., Lee, M., Shao, Y., Wang, Z.Y., Sirotkin, K., Ward, M., Kholodov, M., Zbicz, K., Beck, J., Kimelman, M., Shevelev, S., Preuss, D., Yaschenko, E., Graeff, A., Ostell, J., and Sherry, S.T. 2007. The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 39:1181‐1186.
	Manolio, T.A. 2009. Collaborative genome‐wide association studies of diverse diseases: Programs of the NHGRI's office of population genomics. Pharmacogenomics 10:235‐241.
	Marchini, J., Cardon, L.R., Phillips, M.S., and Donnelly, P. 2004. The effects of human population structure on large genetic association studies. Nat. Genet. 36:512‐517.
	McCarty, C., Chrisolm, R., Chute, C., Kullo, I., Jarvik, G., Larson, E., Li, R., Masys, D., Ritchie, M., Roden, D. et al. 2010. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Medical Genomics In press.
	Miyagawa, T., Nishida, N., Ohashi, J., Kimura, R., Fujimoto, A., Kawashima, M., Koike, A., Sasaki, T., Tanii, H., Otowa, T., Momose, Y., Nakahara, Y., Gotoh, J., Okazaki, Y., Tsuji, S., and Tokunaga, K. 2008. Appropriate data cleaning methods for genome‐wide association study. J. Hum. Genet. 53:886‐893.
	Newton‐Cheh, C., Johnson, T., Gateva, V., Tobin, M.D., Bochud, M., Coin, L., Najjar, S.S., Zhao, J.H., Heath, S.C., Eyheramendy, S., Papadakis, K., Voight, B.F., Scott, L.J., Zhang, F., Farrall, M., Tanaka, T., Wallace, C., Chambers, J.C., Khaw, K.T., Nilsson, P., van der Harst, P., Polidoro, S., Grobbee, D.E., Onland‐Moret, N.C., Bots, M.L., Wain, L.V., Elliott, K.S., Teumer, A., Luan, J., Lucas, G., Kuusisto, J., Burton, P.R., Hadley, D., McArdle, W.L.; Wellcome Trust Case Control Consortium, Brown, M., Dominiczak, A., Newhouse, S.J., Samani, N.J., Webster, J., Zeggini, E., Beckmann, J.S., Bergmann, S., Lim, N., Song, K., Vollenweider, P., Waeber, G., Waterworth, D.M., Yuan, X., Groop, L., Orho‐Melander, M., Allione, A., Di Gregorio, A., Guarrera, S., Panico, S., Ricceri, F., Romanazzi, V., Sacerdote, C., Vineis, P., Barroso, I., Sandhu, M.S., Luben, R.N., Crawford, G.J., Jousilahti, P., Perola, M., Boehnke, M., Bonnycastle, L.L., Collins, F.S., Jackson, A.U., Mohlke, K.L., Stringham, H.M., Valle, T.T., Willer, C.J., Bergman, R.N., Morken, M.A., Döring, A., Gieger, C., Illig, T., Meitinger, T., Org, E., Pfeufer, A., Wichmann, H.E., Kathiresan, S., Marrugat, J., O'Donnell, C.J., Schwartz, S.M., Siscovick, D.S., Subirana, I., Freimer, N.B., Hartikainen, A.L., McCarthy, M.I., O'Reilly, P.F., Peltonen, L., Pouta, A., de Jong, P.E., Snieder, H., van Gilst, W.H., Clarke, R., Goel, A., Hamsten, A., Peden, J.F., Seedorf, U., Syvänen, A.C., Tognoni, G., Lakatta, E.G., Sanna, S., Scheet, P., Schlessinger, D., Scuteri, A., Dörr, M., Ernst, F., Felix, S.B., Homuth, G., Lorbeer, R., Reffelmann, T., Rettig, R., Völker, U., Galan, P., Gut, I.G., Hercberg, S., Lathrop, G.M., Zelenika, D., Deloukas, P., Soranzo, N., Williams, F.M., Zhai, G., Salomaa, V., Laakso, M., Elosua, R., Forouhi, N.G., Völzke, H., Uiterwaal, C.S., van der Schouw, Y.T., Numans, M.E., Matullo, G., Navis, G., Berglund, G., Bingham, S.A., Kooner, J.S., Connell, J.M., Bandinelli, S., Ferrucci, L., Watkins, H., Spector, T.D., Tuomilehto, J., Altshuler, D., Strachan, D.P., Laan, M., Meneton, P., Wareham, N.J., Uda, M., Jarvelin, M.R., Mooser, V., Melander, O., Loos, R.J., Elliott, P., Abecasis, G.R., Caulfield, M., and Munroe, P.B. 2009. Genome‐wide association study identifies eight loci associated with blood pressure. Nat. Genet. 41:666‐676.
	Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R., Stephens, M., and Bustamante, C.D. 2008. Genes mirror geography within Europe. Nature 456:98‐101.
	Patterson, N., Price, A.L., and Reich, D. 2006. Population structure and eigenanalysis. PLoS Genet.2:e190.
	Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. 2006. Principal components analysis corrects for stratification in genome‐wide association studies. Nat. Genet. 38:904‐909.
	Pritchard, J.K., Stephens, M., and Donnelly, P. 2000. Inference of population structure using multilocus genotype data. Genetics 155:945‐959.
	Purcell, S., Neale, B., Todd‐Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., and Sham, P.C. 2007. PLINK: A tool set for whole‐genome association and population‐based linkage analyses. Am. J. Hum. Genet. 81:559‐575.
	Reich, D.E. and Goldstein, D.B. 2001. Detecting association in a case‐control study while correcting for population stratification. Genet. Epidemiol. 20:4‐16.
	Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29:308‐311.
	Simon‐Sanchez, J., Scholz, S., Fung, H.C., Matarin, M., Hernandez, D., Gibbs, J.R., Britton, A., de Vrieze, F.W., Peckham, E., Gwinn‐Hardy, K., Crawley, A., Keen, J.C., Nash, J., Borgaonkar, D., Hardy, J., and Singleton, A. 2007. Genome‐wide SNP assay reveals structural genomic variation, extended homozygosity and cell‐line induced alterations in normal individuals. Hum. Mol. Genet. 16:1‐14.
	Skol, A.D., Scott, L.J., Abecasis, G.R., and Boehnke, M. 2006. Joint analysis is more efficient than replication‐based analysis for two‐stage genome‐wide association studies. Nat. Genet. 38:209‐213.
	Tang, H., Quertermous, T., Rodriguez, B., Kardia, S.L., Zhu, X., Brown, A., Pankow, J.S., Province, M.A., Hunt, S.C., Boerwinkle, E., Schork, N.J., and Risch, N.J. 2005. Genetic structure, self‐identified race/ethnicity, and confounding in case‐control association studies. Am. J. Hum. Genet. 76:268‐275.
	Thompson, J.F., Hyde, C.L., Wood, L.S., Paciga, S.A., Hinds, D.A., Cox, D.R., Hovingh, G.K., and Kastelein, J.J. 2009. Comprehensive whole‐genome and candidate gene analysis for response to statin therapy in the Treating to New Targets (TNT) cohort. Circ. Cardiovasc. Genet. 2:173‐181.
	Willer, C.J., Sanna, S., Jackson, A.U., Scuteri, A., Bonnycastle, L.L., Clarke, R., Heath, S.C., Timpson, N.J., Najjar, S.S., Stringham, H.M., Strait, J., Duren, W.L., Maschio, A., Busonero, F., Mulas, A., Albai, G., Swift, A.J., Morken, M.A., Narisu, N., Bennett, D., Parish, S., Shen, H., Galan, P., Meneton, P., Hercberg, S., Zelenika, D., Chen, W.M., Li, Y., Scott, L.J., Scheet, P.A., Sundvall, J., Watanabe, R.M., Nagaraja, R., Ebrahim, S., Lawlor, D.A., Ben‐Shlomo, Y., Davey‐Smith, G., Shuldiner, A.R., Collins, R., Bergman, R.N., Uda, M., Tuomilehto, J., Cao, A., Collins, F.S., Lakatta, E., Lathrop, G.M., Boehnke, M., Schlessinger, D., Mohlke, K.L., and Abecasis, G.R. 2008. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat. Genet. 40:161‐169.
	Wittke‐Thompson, J.K., Pluzhnikov, A., and Cox, N.J. 2005. Rational inferences about departures from Hardy‐Weinberg equilibrium. Am. J. Hum. Genet. 76:967‐986.
	Zhang, F., Wang, Y., and Deng, H.W. 2008. Comparison of population‐based association study methods correcting for population stratification. PLoS One 3:e3392.
Internet Resources
	http://censtats.census.gov/data/WI/1605549675.pdf
	Census 2000. Profile of Demographic Characteristics, Marshfield, Wisconsin.
	http://pngu.mgh.harvard.edu/∼purcell/plink/
	Illumina Technical Note: “TOP/BOT” Strand and “A/B” Allele (2009).
	http://www.illumina.com/Documents/products/technotes/technote_gen_call_data_analysis_software.pdf
	Illumina GenCall Data Analysis Software (2008).
	http://www.R‐project.org
	R Development Core Team: R: A language and environment for statistical computing. ISBN 3900051070, Vienna, Austria: R Foundation for Statistical Computing (2005).
	http://pritch.bsd.uchicago.edu/structure.html.
	STRUCTURE (2009).
	https://www.mc.vanderbilt.edu/victr/dcc/projects/acc/index.php/Visualizing_relatedness
	Turner, S.D. 2009. Visualizing sample relatedness in a GWAS using PLINK and R.