Genome‐Scale Sequencing to Identify Genes Involved in Mendelian Disorders
互联网
- Abstract
- Table of Contents
- Figures
- Literature Cited
Abstract
The analysis of genome?scale sequence data can be defined as the interrogation of a complete set of genetic instructions in a search for individual loci that produce or contribute to a pathological state. Bioinformatic analysis of sequence data requires sufficient discriminant power to find this needle in a haystack. Current approaches make choices about selectivity and specificity thresholds, and the quality, quantity, and completeness of the data in these analyses. There are many software tools available for individual, analytic component?tasks, including commercial and open?source options. Three major types of techniques have been included in most published exome projects to date: frequency/population genetic analysis, inheritance state consistency, and predictions of deleteriousness. The required infrastructure and use of each technique during analysis of genomic sequence data for clinical and research applications are discussed. Future developments will alter the strategies and sequence of using these tools and are also discussed. Curr. Protoc. Hum. Genet . 79:6.13.1?6.13.19. © 2013 by John Wiley & Sons, Inc.
Keywords: exome; Mendelian inheritance; next generation sequencing; bioinformatics; clinical sequencing
Table of Contents
- Introduction
- Commentary
- Literature Cited
- Figures
Materials
Figures
-
Figure 6.13.1 Selected components of the NIH UDP analysis pipeline. The NIH Undiagnosed Diseases Program analysis pipeline combines exome data with high‐density SNP array data. This is found to be a cost‐effective method for combining deep coverage of coding regions with a genome‐spanning structural survey. SNP chips are checked for quality then analyzed for copy number variations (CNVs) with PennCNV (http://www.openbioinformatics.org/penncnv/). The list of CNVs is manually curated and combined with manual analysis for homozygosity and verification of parentage. If sufficient family members are available, Boolean searches and further manual curation are used to map recombination sites. CNVs, recombination sites, and other regions of interest are defined in Browser Extensible Data (BED) file format for incorporation into later analysis. Subsequent exome analysis utilizes two primary programs: IGV and VarSifter. The former is used to visualize pile‐ups in the assembled BAM file and the second is used to incorporate BED file filters, allele frequency data, pathogenicity data, and gene lists. VarSifter also allows the construction of arbitrary Boolean filters, providing fine control over searches for subsets of interest. View Image -
Figure 6.13.2 Integrated Genome Viewer Screenshot. The Integrated Genome Viewer (IGV, http://www.broadinstitute.org/igv/) is a lightweight yet powerful tool for viewing short‐read pileups. The example shown includes pileups from six individuals: two parents, one affected child, and three unaffected children. For convenience, a case was selected that shows two variants that are physically close to one another (and fit on the same screen). At the top of the display is a diagram of the chromosome being reviewed, with a small vertical red bar (between q12.1 and q13) highlighting the region being displayed below. The bulk of the display is taken up by six rows of pile‐up data. Each row is an individual; each short read is a thin, gray horizontal line. Base positions that have been genotyped as non‐reference are highlighted blue or red. In this case, the mother is heterozygous for two DNA variants. The father is heterozygous for one of the same variants and also for one different variant. The fact that each parent's pair of variants is cis‐oriented is known because there are short reads with both variants, and short reads with neither variant. The affected sibling has DNA variations on both alleles, in contrast to any of the unaffected siblings. View Image -
Figure 6.13.3 Boolean filter for finding compound‐heterozygote “half hets”. Boolean filtration can be used to find variant subsets of interest within the called genotypes in a genome‐scale sequencing data set. The schematic shown diagrams the criteria for all alleles to be one of two that can pair to fit a compound heterozygous recessive Mendelian model. After application of this filter, the resulting variant list is sorted by locus name. Variants of certain classes are prioritized, including those that result in stop, splice site, frame shift, and non‐synonymous amino acid changes. A normal number is ∼300 to 900 total per exome. At any one locus there are at most a very small number of these types of variants, and typically there are only a very few loci with two or more. These must be inspected individually to see if there are two variants within loci that have more than one allele, to see if any pair are oppositely phased, one to each of the two parents. Pairs of variants that occur at the same loci are of the type to change protein function, and are correctly phased (typically are no more than 0 to 5) constitute the compound heterozygous candidate variant pairs. View Image -
Figure 6.13.4 De Finetti Diagram. A De Finetti diagram is used to graph genotype frequencies in populations. It presumes two alleles, and can be used to plot genotype frequencies at which Hardy‐Weinberg equilibrium (HWE) is satisfied. The figure shows a rectangular prism with surfaces plotted in its interior. The vertices of the triangles on the ends of the prism correspond to genotypes as shown: AA, AB, and BB. The length of the prism is a scale of individuals in the population from 1 (far left) to ≥400 (far right). The area between the upper and lower internal plot surfaces define the combinations of genotypes that are consistent with HWE given a particular population size. As the population size increases, an increasingly small proportion of all of the possible genotype combinations are in HWE. However, difference between the in‐HWE and out‐of‐HWE regions changes increasingly gradually as the population size reaches hundreds of individuals. For this reason, a data set of 100s of individuals allows stringent criteria to be used in assessing whether a set of genotypes is out of HWE—potentially due to misalignment. View Image
Videos
Literature Cited
Literature Cited | |
Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., and McVean, G.A. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56‐65. | |
Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., and Sunyaev, S.R. 2010. A method and server for predicting damaging missense mutations. Nat. Methods 7:248‐249. | |
Blankenberg, D., Von Kuster, G., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A., and Taylor, J. 2010. Galaxy: A web‐based genome analysis tool for experimentalists. Curr. Protoc. Mol. Biol. 89:19.10.11‐19.10.21. | |
Cao, A., Galanello, R., Furbetta, M., Muroni, P.P., Garbato, L., Rosatelli, C., Scalas, M.T., Addis, M., Ruggeri, R., Maccioni, L., and Melis, M.A. 1978. Thalassaemia types and their incidence in Sardinia. J. Med. Genet. 15:443‐447. | |
Chen, B., Gagnon, M., Shahangian, S., Anderson, N.L., Howerton, D.A., and Boone, D.J. 2009. Good Laboratory Practices for Molecular Genetic Testing for Heritable Diseases and Conditions. Division of Laboratory Systems, National Center for Preparedness, Detection, and Control of Infectious Diseases, Coordinating Center for Infectious Diseases, Atlanta, GA. | |
Cooper, G.M., Stone, E.A., Asimenos, G., Green, E.D., Batzoglou, S., and Sidow, A. 2005. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15:901‐913. | |
Davydov, E.V., Goode, D.L., Sirota, M., Cooper, G.M., Sidow, A., and Batzoglou, S. 2010. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6:e1001025. | |
Edwards, A.W.F. 2000. Foundations of mathematical genetics, 2nd ed. Cambridge University Press, Cambridge, U.K. | |
Eigen, M. and Winkler, R. 1981. Laws of the Game: How the Principles of Nature Govern Chance, 1st American ed. Knopf. Distributed by Random House, New York. | |
Ewing, B. and Green, P. 1998. Base‐calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186‐194. | |
Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base‐calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175‐185. | |
Flicek, P., Amode, M.R., Barrell, D., Beal, K., Brent, S., Carvalho‐Silva, D., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Gil, L., Gordon, L., Hendrix, M., Hourlier, T., Johnson, N., Kahari, A.K., Keefe, D., Keenan, S., Kinsella, R., Komorowska, M., Koscielny, G., Kulesha, E., Larsson, P., Longden, I., McLaren, W., Muffato, M., Overduin, B., Pignatelli, M., Pritchard, B., Riat, H.S., Ritchie, G.R., Ruffier, M., Schuster, M., Sobral, D., Tang, Y.A., Taylor, K., Trevanion, S., Vandrovcova, J., White, S., Wilson, M., Wilder, S.P., Aken, B.L., Birney, E., Cunningham, F., Dunham, I., Durbin, R., Fernandez‐Suarez, X.M., Harrow, J., Herrero, J., Hubbard, T.J., Parker, A., Proctor, G., Spudich, G., Vogel, J., Yates, A., Zadissa, A., and Searle, S.M. 2012. Ensembl 2012. Nucleic Acids Res. 40:D84‐D90. | |
Fuentes Fajardo, K.V., Adams, D., NISC Comparative Sequencing Program, Mason, C.E., Sincan, M., Tifft, C., Toro, C., Boerkoel, C.F., Gahl, W., and Markello, T. 2012. Detecting false‐positive signals in exome sequencing. Hum. Mutat. 33:609‐613. | |
Grantham, R. 1974. Amino acid difference formula to help explain protein evolution. Science 185:862‐864. | |
Green, R.C., Berg, J.S., Grody, W.W., Kalia, S.S., Korf, B.R., Martin, C.L., McGuire, A., Nussbaum, R.L., O'Daniel, J.M., Ormond , K.E., Rehm, H.L., Watson, M.S.W., Williams, M.S., and Biesecker, L.G. 2013. ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing, Bethesda, MD. | |
Hillman‐Jackson, J., Clements, D., Blankenberg, D., Taylor, J., Nekrutenko, A., and Galaxy Team. 2012. Using Galaxy to perform large‐scale interactive data analyses. Curr. Protoc. Bioinform. 38:10.5.1‐10.5.47. | |
Hsu, F., Kent, W.J., Clawson, H., Kuhn, R.M., Diekhans, M., and Haussler, D. 2006. The UCSC known genes. Bioinformatics 22:1036‐1046. | |
Johnston, J.J, Teer, J.K., Cherukuri, P.F., Hansen, N.F., Loftus, S.K., NISC, Chong, K., Mullikin, J.C., and Biesecker, L.C. 2010. Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate. Am. J. Hum. Genet. 86:743‐748. | |
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078‐2079. | |
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M.A. 2010. The genome analysis toolkit: A MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Res. 20:1297‐1303. | |
Muller, H.J. 1950. Our load of mutations. Am. J. Hum. Genet. 2:111‐176. | |
Ng, P.C. and Henikoff, S. 2006. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. 7:61‐80. | |
Ng, S.B., Bigham, A.W., Buckingham, K.J., Hannibal, M.C., McMillin, M.J., Gildersleeve, H.I., Beck, A.E., Tabor, H.K., Cooper, G.M., Mefford, H.C., Lee, C., Turner, E.H., Smith, J.D., Rieder, M.J., Yoshiura, K., Matsumoto, N., Ohta, T., Niikawa, N., Nickerson, D.A., Bamshad, M.J., and Shendure, J. 2010. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat. Genet. 42:790‐793. | |
Peters, B.A., Kermani, B.G., Sparks, A.B., Alferov, O., Hong, P., Alexeev, A., Jiang, Y., Dahl, F., Tang, Y.T., Haas, J., Robasky, K., Zaranek, A.W., Lee, J.H., Ball, M.P., Peterson, J.E., Perazich, H., Yeung, G., Liu, J., Chen, L., Kennemer, M.I., Pothuraju, K., Konvicka, K., Tsoupko‐Sitnikov, M., Pant, K.P., Ebert, J.C., Nilsen, G.B., Baccash, J., Halpern, A.L., Church, G.M., and Drmanac, R. 2012. Accurate whole‐genome sequencing and haplotyping from 10 to 20 human cells. Nature 487:190‐195. | |
Pruitt, K.D., Harrow, J., Harte, R.A., Wallin, C., Diekhans, M., Maglott, D.R., Searle, S., Farrell, C.M., Loveland, J.E., Ruef, B.J., Hart, E., Suner, M.M., Landrum, M.J., Aken, B., Ayling, S., Baertsch, R., Fernandez‐Banet, J., Cherry, J.L., Curwen, V., Dicuccio, M., Kellis, M., Lee, J., Lin, M.F., Schuster, M., Shkeda, A., Amid, C., Brown, G., Dukhanina, O., Frankish, A., Hart, J., Maidak, B.L., Mudge, J., Murphy, M.R., Murphy, T., Rajan, J., Rajput, B., Riddick, L.D., Snow, C., Steward, C., Webb, D., Weber, J.A., Wilming, L., Wu, W., Birney, E., Haussler, D., Hubbard, T., Ostell, J., Durbin, R., and Lipman, D. 2009. The consensus coding sequence (CCDS) project: Identifying a common protein‐coding gene set for the human and mouse genomes. Genome Res. 19:1316‐1323. | |
Pruitt, K.D., Tatusova, T., Brown, G.R., and Maglott, D.R. 2012. NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Res. 40:D130‐D135. | |
Rimmer, A., Mathieson, I., Lunter, G., and McVean, G. 2012. Platypus: An Integrated Variant Caller. http://www.well.ox.ac.uk/platypus. | |
Roach, J.C., Glusman, G., Smit, A.F., Huff, C.D., Hubley, R., Shannon, P.T., Rowen, L., Pant, K.P., Goodman, N., Bamshad, M., Shendure, J., Drmanac, R., Jorde, L.B., Hood, L., and Galas, D.J. 2010. Analysis of genetic inheritance in a family quartet by whole‐genome sequencing. Science 328:636‐639. | |
Schwarz, J.M., Rodelsperger, C., Schuelke, M., and Seelow, D. 2010. MutationTaster evaluates disease‐causing potential of sequence alterations. Nat. Methods 7:575‐576. | |
Silver, N. 2012. The signal and the noise: Why so many predictions fail–but some don't. Penguin Press, New York. | |
Simpson, J.T. and Durbin, R. 2012. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22:549‐556. | |
Smith, C.A.B. 1953. The detection of linkage in human genetics. J. R. Stat. Soc. B 15:153‐192. | |
Stenson, P.D., Ball, E.V., Mort, M., Phillips, A.D., Shiel, J.A., Thomas, N.S., Abeysinghe, S., Krawczak, M., and Cooper, D.N. 2003. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21:577‐581. | |
Stenson, P.D., Ball, E.V., Mort, M., Phillips, A.D., Shaw, K., and Cooper, D.N. 2012. The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr. Protoc. Bioinform. 39:1.13.1‐1.13.20. | |
Teer, J.K., Green, E.D., Mullikin, J.C., and Biesecker, L.G. 2011. http://research.nhgri.nih.gov/software/VarSifter/. | |
Teer, J.K., Green, E.D., Mullikin, J.C., and Biesecker, L.G. 2012. VarSifter: Visualizing and analyzing exome‐scale sequence variation data on a desktop computer. Bioinformatics 28:599‐600. | |
Tennessen, J.A., Bigham, A.W., O'Connor, T.D., Fu, W., Kenny, E.E., Gravel, S., McGee, S., Do, R., Liu, X., Jun, G., Kang, H.M., Jordan, D., Leal, S.M., Gabriel, S., Rieder, M.J., Abecasis, G., Altshuler, D., Nickerson, D.A., Boerwinkle, E., Sunyaev, S., Bustamante, C.D., Bamshad, M.J., and Akey, J.M. 2012. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337:64‐69. | |
Wang, K., Li, M., and Hakonarson, H. 2010. ANNOVAR: Functional annotation of genetic variants from high‐throughput sequencing data. Nucleic Acids Res. 38:e164. | |
Wei, X., Walia, V., Lin, J.C., Teer, J.K., Prickett, T.D., Gartner, J., Davis, S., Stemke‐Hale, K., Davies, M.A., Gershenwald, J.E., Robinson, W., Robinson, S., Rosenberg, S.A., and Samuels, Y. 2011. Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat. Genet. 43:442‐446. | |
Yandell, M., Huff, C., Hu, H., Singleton, M., Moore, B., Xing, J., Jorde, L.B., and Reese, M.G. 2011. A probabilistic disease‐gene finder for personal genomes. Genome Res. 21:1529‐1542. | |
Yang, Z. 1995. A space‐time process model for the evolution of DNA sequences. Genetics 139:993‐1005. |