Analysis and Annotation of Whole-Genome or Whole-Exome Sequencing–Derived Variants for Clinical Diagnosis

UNIT 9.24

Elizabeth A. Worthey1,2,3 1

Department of Pediatrics, Medical College of Wisconsin, Milwaukee, Wisconsin The Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, Wisconsin 3 Department of Computer Science, University of Wisconsin, Milwaukee, Wisconsin 2

ABSTRACT Over the last several years, next-generation sequencing (NGS) has transformed genomic research through substantial advances in technology and reduction in the cost of sequencing, and also in the systems required for analysis of these large volumes of data. This technology is now being used as a standard molecular diagnostic test under particular circumstances in some clinical settings. The advances in sequencing have come so rapidly that the major bottleneck in identification of causal variants is no longer the sequencing but rather the analysis and interpretation. Interpretation of genetic findings in a clinical setting is scarcely a new challenge, but the task is increasingly complex in clinical genome-wide sequencing given the dramatic increase in dataset size and complexity. This increase requires the development of novel or repositioned analysis tools, methodologies, and processes. This unit provides an overview of these items. Specific challenges related to implementation in a clinical setting are discussed. Curr. Protoc. Hum. Genet. 79:9.24.1C 2013 by John Wiley & Sons, Inc. 9.24.24.  Keywords: sequencing r genome variant identification r genome variant annotation r genome variant interpretation

INTRODUCTION WGS Background The human genome sequencing project, completed over 10 years ago, generated the reference human genome—approximately 3.2 billion base pairs of reference sequence containing between 25,000 and 30,000 genes (Lander et al., 2001; Venter et al., 2001). Over the following decade, availability of this and additional human genomes and access to the sequencing technology used to produce it supported many scientific discoveries (Waterston et al., 2003; de Bakker et al., 2004; McCarroll et al., 2006; ENCODE Project Consortium, 2007; Weir et al., 2007). Over the last several years, a rapid increase in the speed of technological advancement [beginning with the development of next-generation sequencing (NGS) methods] has forever altered the rate of biological discovery (see, e.g., Shapiro and Hofreiter, 2010; Bras et al., 2012; Mardis, 2012; Gardy, 2013). Genomic

research has been transformed, with everincreasing throughput and reduction in cost, resulting in the current situation where a human genome can now be sequenced for a few thousand dollars in a single experiment (sequencing run) in a few days (Saunders et al., 2012). Through subsequent application of a variety of bioinformatics tools, differences between the genome of the individual being sequenced and the human reference genome can be identified, providing the data for clinical interpretation and the identification of alleles associated with human disease and health (e.g., Choi et al., 2009; Lupski et al., 2010; Worthey et al., 2011; Saunders et al., 2012; Jacob et al., 2013). Progress has been so rapid that, at the time this unit went to press, it was reported anecdotally that genome-wide sequencing had been performed in >100,000 individuals. Although many of these individuals were presumed healthy at the time of

Current Protocols in Human Genetics 9.24.1-9.24.24, October 2013 Published online October 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/0471142905.hg0924s79 C 2013 John Wiley & Sons, Inc. Copyright 

Clinical Molecular Genetics

9.24.1 Supplement 79

sequencing (or at least not selected due to the presence of any specific disorder), a number of publications have presented variation data from individuals selected for genome-wide sequencing because they suffered from a disease believed to be genetic in nature (Choi et al., 2009; Bilguvar et al., 2010; Lalonde et al., 2010; Berg et al., 2011; Bick and Dimmock 2011; Mayer et al., 2011; Worthey et al., 2011; Chen, Y.K. et al., 2012; Goh et al., 2012; Green et al., 2012; Ng, S.B. et al., 2013). In such studies, the goal has been to identify the variants causal for the altered phenotype of the individual or individuals studied. Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES) are already being used to support the identification of causative mutations for rare presumed monogenic disorders as discussed above, identification of driver mutations for appropriate therapeutics and treatment selection in cancer (Lander, 2011; Roukos and Ku, 2012; Wang, L. et al., 2012; Ku et al., 2013a), and rapid pathogen identification in clinical settings (Didelot et al., 2012; Shen, H. et al., 2013; reviewed in Bertelli and Greub, 2013). More recently, such approaches have been used to study the genetic underpinnings of various complex disorders (Rosenberg and Hastings, 2004; Bras et al., 2012; Hastings, R. et al., 2012; Ku et al., 2013b). The technique is also emerging as an appropriate technology for presymptomatic prediction of risk in healthy individuals (e.g., Korf and Rehm, 2013; Bras et al., 2012; Snape et al., 2012). Despite significant advances in our understanding of the genetic basis of disease, genome-wide identification of variants that alter the function of the underlying molecule and lead to altered phenotypes represents the most significant challenge in modern human genetics. With the current state of technology, the major bottleneck in identification of such causal variants is no longer the sequencing or even the variant identification or annotation, but rather the downstream analysis of variants—particularly the use of variant annotations to allow clinical interpretation of the genome.

Variant identification, annotation, and interpretation in a clinical setting Analysis and Annotation of Sequence Variants for Clinical Diagnosis

In a clinical setting, greater understanding of the underpinnings of genetic disease provides an opportunity to uncover not just single gene–to-phenotype associations, but identifi-

cation of entire pathways involved in pathology. As mentioned above, in some cases the findings guide selection of novel treatments or therapeutics that would not have been selected without the availability of the sequencing results (Bainbridge et al., 2011; Worthey et al., 2011). For example, in both Bainbridge et al. (2011) and Worthey et al. (2011), WES rendered a diagnosis that directed selection of successful treatments that would not otherwise have been selected. Interpretation of genetic findings in a clinical setting is scarcely a new challenge, but with application of genomewide sequencing comes both a dramatic increase in the size of the dataset under interrogation and an altered method of identifying variants of clinical significance, particularly because the genes under analysis are no longer preselected to be associable with the disease under study (Schrijver et al., 2012; Biesecker and Peay, 2013; Korf and Rehm, 2013; Moorthie et al., 2013; also see UNIT 9.22). These changes require development of novel or repositioned tools, methodologies, and processes. A commonly stated obstacle to implementing WGS clinically is the complexity in analysis and subsequent requirement for clinical interpretation of huge volumes of sequence variant data (e.g., Hastings, R. et al., 2012; Tucker et al., 2012). Each patient’s genome sequencing can be expected to generate more than 4 million variants, each of which must be considered for association with or causality for the phenotype under study. It is important to note, however, that even in diagnostic odysseys, where there is no knowledge of an underlying pathway and where all genes and intergenic regions must initially be considered, the majority of the variants can be ruled out because insufficient data exist to link them to clinically actionable outcomes. Simply put, the complexity of WGS (or WES) interpretation is drastically reduced by excluding from consideration variants with insufficient annotation data for clinical interpretation. Therefore, since we have insufficient data to be able to interpret the vast majority of the variants, the task of performing a clinical analysis and interpretation under these circumstances becomes entirely manageable. It is also worthwhile to point out in this introduction to clinical genome-wide sequence analysis that a second commonly stated hurdle to clinical WGS—turnaround time—has also been alleviated if not removed with recent technological advances. The ability to

9.24.2 Supplement 79

Current Protocols in Human Genetics

provide a diagnosis within weeks (within a day in some settings) is critical, as it shortens timeto-diagnosis, reduces stress for families and physicians, and reduces the impact of application of less appropriate treatments. Recent advances in the instrumentation for sequencing, protocols for preparation and sequencing of the DNA, and analysis and interpretation have reduced the time required to generate and analyze sequence data supporting clinical application of WGS approaches to a clinically appropriate timeline (Saunders et al., 2012).

BIOINFORMATICS ANALYSIS OF CLINICAL WGS DATA Bioinformatics analysis of genome-wide sequence data has a number of steps, starting with instrument-specific processing of the sequence image data through mapping variant and calling to variant annotation and interpretation (e.g., Ajay et al., 2011; O’Rawe et al., 2013). These steps can be categorized into four phases: primary, secondary, and tertiary analysis, and post-tertiary analysis/interpretation (see Fig. 9.24.1). This unit will focus mainly on the secondary and tertiary phases, but will

sample preparation DNA preparation sequencing image collection

image analysis base calling read production quality assignment

generally performed using on-instrument workstation hardware generation of FASTQ files

demultiplexing quality filtering mapping/assembly post-mapping reprocessing

generally performed using off-instrument server hardware generation of SAM, BAM, other files

variant calling - SNVs variant calling - IDIs variant calling - SVs post-calling realignment

generally performed using off-instrument server hardware generation of VCF, gVCF, VCFclin, other files

data aggregation variant annotation classification prioritization

always performed on off-instrument server hardware generation of complex data structure (often in database)

post-tertiary analysis integration of clinical data interpretation for specific patient reporting

generally performed on off-instrument workstation hardware generation of clinical report(s)

Figure 9.24.1 This figure outlines the analysis steps undertaken during analysis of genome sequence data: primary, secondary, and tertiary analysis, and post-tertiary analysis/interpretation. These steps transform the sequence image data produced by a sequencing instrument through read construction, read mapping, variant calling, variant annotation, and finally interpretation to extract clinically useful information. The types of systems that the analyses are performed on and the file types generated are also provided.

Clinical Molecular Genetics

9.24.3 Current Protocols in Human Genetics

Supplement 79

provide an overview so that the steps can be seen in their place in the entire bioinformatics process.

Primary Analysis The first set of analysis steps performed on sequencing data transform intensity data derived from the raw image files generated during sequencing to provide base calls and their associated quality scores. It extends into compilation of these base calls to produce sequence read files (short contiguous sequence data files) with associated quality score for each base. The FASTQ file format is used to store both the nucleotide sequence and associated quality scores, which encode estimates of the probability that each base is correctly called (Martinez-Alcantara et al., 2009). In general, the various sequencing instruments are supplied with preinstalled bundled software to perform this phase of analysis, and the analysis is performed as part of the sequencing run, although the base calling can be performed offline on another server to free up time on the sequencer (Quail et al., 2008; Wheeler et al., 2008). If multiple samples have been run together on a run, the last step in primary analysis (or sometimes first of secondary analysis) is demultiplexing, which separates samples using index tags that were ligated to the fractionated DNA prior to admixture for sequencing.

Secondary Analysis Read mapping and alignment

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

The secondary analysis phase first takes these sequence reads, filters them for quality and removes duplicates (to reduce bias due to overrepresentation in coverage downstream), and then uses one of two different strategies to reassemble the individual’s genome. The first (but rarely attempted) approach is de novo assembly, where the reads are reassembled without use of a reference assembly. A number of de novo assembly tools specifically designed for assembly of short read next-generation sequencing data exist, including Velvet, ABySS, ALLPATHS, and SOAPdenovo (Butler et al., 2008; Zerbino and Birney, 2008; Simpson et al., 2009; Luo et al., 2012). For a relatively recent review of these and other software tools, see Lin et al. (2011) and Zhang, W. et al. (2011). De novo assembly, given short-read technology, is not a simple process (Gnerre et al., 2009; Pop, 2009). It has many benefits relating to the ability to

accurately reconstruct regions of the genome dissimilar to the reference genome, as well as the potential for assembling regions that are not present or poorly constructed in the reference genome (Butler et al., 2008; Nagarajan and Pop, 2009). Because of this, although currently not widely used, de novo assembly of genomes for identification of variants will likely become commonplace in the none too distant future. Further advances in sequencing technology providing longer read lengths, and the development of algorithms supporting more accurate alignment, will fuel this change (e.g., Carneiro et al., 2012). Because of these limitations, the human genome resequencing performed to date has relied almost exclusively on a mapping strategy, which reassembles the patient’s genome by mapping sequence reads against the human genome reference (e.g., Butler et al., 2008; Langmead et al., 2009; Li and Durbin, 2009; Hastings, R. et al. 2012). Although most commonly applied as outlined below, this approach is not without challenge.

Complications with read mapping Some problems relate to the existing reference genome, which is neither complete nor completely accurate. The current genome build has many gaps, and obviously, with no reference sequence to match to, reads produced corresponding to these sections cannot be mapped. These sections are often nonamenable to existing sequencing methodologies due to the presence of low-complexity regions or repeats capable of forming secondary structures that hinder polymerase activity (Treangen and Salzberg, 2011; Wetzel et al., 2011). In addition, a mapping strategy is transitive in that it passes on errors in the construction of the human reference genome to the newly sequenced genome. The reference genome contains many (some large) poorly assembled sections (Church et al., 2011). These can be caused by regions that were hard to assemble accurately during construction of the reference due to the presence of interspersed or longer tandem repeats. Such poorly constructed regions cause problems when short reads from a newly sequenced genome are mapped correctly, but against an incorrect reference. Even in regions where the reference genome is of extremely high quality, it is often difficult to accurately map a short sequence read to a single position on the reference (Ruffalo et al., 2011; English et al.,

9.24.4 Supplement 79

Current Protocols in Human Genetics

2012). For example, the presence of new or ancient chromosomal duplications and other homologous regions of high sequence identity causes uncertainty as to which is the correct mapping location in the genome (Bailey et al., 2001; Marques-Bonet et al., 2008). This has always been a problem, but becomes more so with shorter reads, which have less data for accurate placement. There are numerous different ways in which the algorithms deal with placement of reads (Butler et al., 2008; Pop and Salzberg, 2008; Homer et al., 2009; Simpson et al., 2009), but none are perfect and all result in mapping problems. This method is also complicated by the presence, in the newly sequenced genome, of regions that are absent in or notably different from the reference (Vetro et al., 2012). These differences between the patient and reference genomes lead to poor reassembly— particularly for some analyses when we consider that the reference genome is not representative of the structure found in genomes from individuals with different ethnic backgrounds. Finally, due to the large size of the datasets being compared, an algorithm is required that may be subject to biases that can result in errors in alignment (English et al., 2012). In all of these instances, the presence of non-randomly distributed sequencing errors in these regions can further confound mapping, giving rise to mapping errors for a particular set of reads. To attempt to address some of these issues, many mapping tools perform realignment of reads around regions containing apparent variants prior to moving on to the next stage of analysis (discussed further below).

Tools for short read mapping Methods and tool development for accurate read mapping has and continues to be area of active development (Jiang et al., 2009; Li and Durbin, 2009; McKenna et al., 2010; DePristo et al., 2011; Carnevali et al., 2012; English et al., 2012; Hastings et al., 2012). Just a few years ago, the number of options available for read mapping was large, and deciding which tool to use was a commonly stated challenge (Homer et al., 2009; Li and Durbin, 2009; Miller et al., 2010; Wetzel et al., 2011). Comparisons of the results provided by use of these different algorithms on the same dataset have shown significant differences in the variants called, with some differences clearly attributable to the mapping algo-

rithms used (McKenna et al., 2010; Haug et al., 2013). Over the last couple of years, there has been general coalescence on a few tools, most commonly BWA and ELAND (Bentley et al., 2008; Li and Durbin, 2009). Platform-specific differences should obviously be kept in mind both when selecting a mapping tool and when undertaking validation (McKenna et al., 2010; Watt et al., 2013). It is important to note that when selecting a tool for use in a clinical setting it is critical not only to know the accuracy of the algorithm, but also (1) the level of likely support moving forward and (2) the degree to which the tool has been internally and externally validated (Baker et al., 2012). In many cases, academic tools cannot be proven to meet stringent clinical validation requirements (Baker et al., 2012).

Sequence alignment file formats A full discussion of file formats for storage and transfer of the aligned sequence read data is outside the scope of this unit. In brief, the most widely used are the SAM and BAM format (Li and Durbin, 2009). The BAM format is a binary format that can be used to store sequence data, aligned as well as unaligned. The Sequence Alignment/Map (SAM) format provides the same utility, but in a tab-delimitedtext, human-readable format; this factor leads to greater overhead due to its size, making BAM the standard (Wolfe et al., 1996; Li and Durbin, 2009).

Secondary Analysis (Variant Calling) Following read mapping comes detection (calling) of the variants (or differences) that exist between the genome under analysis and the reference genome, a process generally referred to as variant calling (Challis et al., 2012; Li, H. 2012; Zhang, L. et al., 2013). Up to now, each phase of the analysis could be accomplished through use of a single tool; that is not the case in variant calling where various tools are required to call different classes of variants. Prior to discussing the tools, these variant classes require some definition since the terminology used in the literature is often confusing. Single nucleotide variants (SNVs; often incorrectly called SNPs, where the “P” specifically implies a lack of a detectable phenotypic alteration) are the most commonly studied class of variants (Chepelev et al., 2009; De Baets et al., 2012; Hastings et al., 2012;

Clinical Molecular Genetics

9.24.5 Current Protocols in Human Genetics

Supplement 79

Zhang, L. et al., 2013). This term is generally applied only to substitutions and not to single nucleotide insertions or deletions. Small insertions or deletions are sometimes referred to as indels, which is a term used to describe an alteration where there has been an addition or deletion of nucleotides with respect to the reference. Indels can thus be broken down into component insertions, substitutions, and deletions, but doing so often adds increased complexity when storing, describing, and analyzing these events, frequently resulting in combination of individual changes into an indel with no guarantee that the changes described reflect the molecular changes that have occurred. Finally, the term structural variants (SVs) has been used to describe a number of classes of larger variants (e.g., McKernan et al., 2009; Kidd et al., 2010; Stankiewicz and Lupski, 2010; Xi et al., 2010). These include larger duplications and deletions [those leading to genomic imbalances often being referred to as Copy Number Variants (CNVs)], transversions, balanced translocations, and inversions of genomic regions varying from many hundred bases to many megabases in size. These can affect individual exons, entire genes, or larger chromosomal regions (Merikangas et al., 2009; Sharp, 2009; Stankiewicz and Lupski, 2010). No one algorithm can currently be used to accurately call all these classes of variants in general, when undertaking genomic analysis, many variant-calling algorithms are used (Li H., 2012; Zhang, L. et al., 2013).

Complications impacting variant calling

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

Many of the issues discussed above that impact the ability to accurately map sequence reads to the reference genome also lead to complexities in variant calling. When mismapped reads contain sequence differences among “copies” of the genomic region, variants can be called that are nothing more than the differences present between the paralogous regions. These variants, referred to as mapping errors, occur with high frequency in genome-wide sequencing analyses (Sundquist et al., 2007; Smith, A.D. et al., 2008; Lee and Schatz, 2012; Chen, Y. et al., 2013; Vijay et al., 2013). Mapping issues can give rise to incorrect variant calls for all of the classes of variants discussed above. Complications arising from insufficient depth of coverage, particularly in genomic regions where sequencing errors occur non-randomly, can also give rise to

variant calls if the number of reads that have the error passes the number threshold specified by the variant-calling software.

SNV callers It is widely acknowledged that algorithms for SNV detection from WGS or WES data are the most mature of the variant callers; these were some of the earliest algorithms developed and validated (e.g., Chen, K. et al., 2007; Craig et al., 2008; Smith, D.R. et al., 2008; Koboldt et al., 2009). Identification of single nucleotide substitutions where there are no complications related to accurate alignment is the simplest of variant-calling tasks. At this point in time, there are a few widely used SNV callers; CASAVA, VarScan, and GATK represent the most commonly used (Bentley et al., 2008; Koboldt et al., 2009; McKenna et al., 2010). Despite the relative maturity of these techniques, there remains discordance in the variants called by the different methods when applied to analysis of the same dataset (e.g., O’Rawe et al., 2013). For example, Bauer (2011) showed, in an in-depth comparison of CASAVA and GATK analysis of the same dataset, that although both algorithms call ∼70,000 SNVs, there is only an 80% overlap between the calls (Bauer, 2011). Similarly, Rosenfeld et al. (2012) have shown that discordance ranged between 4% and 14% in comparisons of SNV calls using different tools on the same datasets. The review by Rosenfeld et al. (2012) of the Ti/Tv ratio of the variants called by only a single tool led them to conclude that many of the variants called by only a single method were unlikely to be sequencing errors. It is worth noting, however, that some (perhaps many) of these variant-call discrepancies are likely to result from the different mapping algorithms used and not the variant callers. Here, review of the Ti/Tv ratio would not be expected to discern mapping errors from true variants, and thus some of the unique differences among pipelines may be due to algorithmic differences leading to mismapping. Because of these issues, some groups have recommended use of a combination of variant callers, with pooling of all variants called from each of the tools into their pool of candidates (e.g., O’Rawe et al., 2013). Others have reasoned that the overlap between a set of tools represents the most likely correct set of variants, and thus limit themselves to consideration of variants that all or the majority of

9.24.6 Supplement 79

Current Protocols in Human Genetics

pipelines have called. Clearly, such differences in strategy have significant implications for use of these tools in a clinical setting. At this point in time, it seems anecdotally that CASAVA or GATK are more frequently used as standalone callers, but also that many labs are now considering or developing algorithms to support a combination approach.

Small indel, insertion, and deletion callers The next most robust set of algorithms are those used for small indel, insertion, and deletion detection. As mentioned previously, the term indel is properly used to describe a special class of variant where an insertion and deletion are colocalized and thus distinct from either an insertion or deletion. This section deals with the calling of insertions, deletions, and indels (IDIs). IDI calling from WGS or WES is well known to be less robust than SNV calling (e.g., Challis et al., 2012; Li, H., 2012; Li, S. et al., 2013; O’Rawe et al., 2013). These methods remain under active algorithm development; a large number of distinct algorithms are commonly used (Bentley et al., 2008; Koboldt et al., 2009; McKenna et al., 2010). Although many tools are available that have been described anecdotally in a research setting, there has been a notable trend toward use of the GATK package from the Broad Institute for both IDI and SNV calling (McKenna et al., 2010). For clinical analysis, the instrument vendor’s software remains marginally ahead, in part awaiting release of validation statistics from use of GATK in a clinical setting. As might be expected, estimates of concordance among the most commonly used IDI callers are much lower than for SNVs, having been reported as being in the 30% to 50% range (Bauer, 2011; O’Rawe et al., 2013). In recent years, this has led to active debate as to the comparative sensitivity and selectivity of the various IDI callers; it is widely acknowledged that none of the existing tools is unequivocally best (O’Rawe et al., 2013). The take-home message is that even for small-variant calling, which is often presented as a somewhat simple undertaking, significant caution should be exercised when analyzing WGS and WES in a clinical setting. In particular, the interpretation of findings based on either identification of variants or absence of variants should be carried out very carefully, especially in the case of IDIs.

Structural variants It can be argued that the larger a variant, the more likely it is to impact a functional region of the genome and the more likely it is to have impacted normal processes at some stage in development or afterwards. It is therefore unsurprising that many disease-causing variants, including a number that occur in non-proteincoding regions of the genome, fall into this category (e.g., Lupski, 1998; Koolen et al., 2004; Lupski et al., 2010). Tools have been developed to use WGS or WES data for detection of structural variations, and these are slowly being improved to reduce false positive and false negative rates (e.g., Hall and Quinlan 2012; Karakoc et al., 2012; Mijuskovic et al., 2012; Priest et al., 2012; Lindberg et al., 2013). The specificity and sensitivity of SV calling remains significantly poorer than that of the smaller variants discussed above, leading to a higher false discovery rate (Hall and Quinlan 2012; Karakoc et al., 2012; Mijuskovic et al., 2012; Rosenfeld et al., 2012). Some challenges relate to the ability to accurately identify the boundaries of a variant with a size that exceeds the length of the sequencing read being produced (Sharp, 2009). This is especially true when trying to identify SVs by mapping short reads in repetitive areas of the genome, which are often the sites for the breakpoints and joins in such rearrangements (Hormozdiari et al., 2009). The lack of unique differentiating sequence on either end of the majority of reads spanning duplicated region in some cases causes the reads to collapse on to one another, and the lack of uniformity in sequence coverage across the genome makes it difficult to identify regions with altered coverage suggestive of some SVs. It also causes some issues with mapping boundaries accurately. It is likewise difficult to detect other types of complex rearrangements such as transversions and inversions with short reads because there is often insufficient sequence flanking the breakpoints to map them to the second genomic location. In such cases, SVs may be missed or miscalled as smaller deletions or insertions. For these reasons significant follow up is required to confirm the validity of the variant predictions (Hormozdiari et al., 2009; Chou et al., 2012; Karakoc et al., 2012; Lindberg et al., 2013; Wang, S.K. et al., 2013). Akin to the IDIs, there is also an issue with inability to map reads whose sequence diverges from the reference due to the presence of novel insertions relative to the

Clinical Molecular Genetics

9.24.7 Current Protocols in Human Genetics

Supplement 79

reference genome. For all these reasons, SV calling from NGS sequencing data is currently suitable in most instances for research, and not clinical applications, although progress is being made (Chou et al., 2012; Karakoc et al., 2012; Lindberg et al., 2013; Wang, S.K. et al., 2013). The two areas where these variants are being considered for the production of clinically actionable data are in oncology and for identification of known causally associated SVs. In oncology, many studies have shown the utility of WGS or WES for improvement of patient outcomes in a variety of different cancers, and in many cases the variants identified have been SVs (Weir et al., 2007; Borge et al., 2011; Dulak et al., 2013). In the setting of rare and other complex diseases, labs are evaluating use of genome-wide sequencing to identify known and validated SVs initially identified through other molecular diagnostic approaches, but work remains before clinical utility is proven. Interestingly, we are now seeing availability of ultra-high-throughput sequencing technologies that generate much longer reads that will be increasingly useful for identification of SVs (Gargis et al., 2012). In order to circumvent some of the mapping and reference genome assembly issues discussed previously, these technologies need to produce reads greater than 10 kb. At this length, the reads will be capable of spanning some of the low-complexity, tandemly repetitive, and interspersed-repeat regions of the genome that limit development of methods for accurate SV identification. As discussed previously, advances are being made to allow de novo assembly as opposed to mapping strategies for sequencing of entire human genomes; clearly, the ability to perform de novo assembly should greatly assist in accurate identification of SVs (e.g., Li, R. et al., 2010; Paszkiewicz and Studholme, 2010). With more maturity, we will soon see further expansion of these tools and methodologies into a clinical setting.

Integrated variant calling

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

SNV and IDI calling are often performed using separate tools, but it is clear that tools such as those presented in Challis et al. (2012) and Saunders et al. (2012), which unify the two into a single process with improved local realignment, will be more likely to accurately identify the variants present in a genome (Rosenfeld et al., 2012). Further combining these methods with SV callers would further

improve accuracy and ease of interpretation of results.

Variant data file formats At the end of secondary analysis, these approaches will have produced (from a WGS experiment) a list with perhaps ∼4,500,000 variants: ∼3,900,000 SNVs, ∼290,000 insertions, ∼300,000 deletions, and ∼2,000 high-quality SVs. The variant calls generally take the form of genomic coordinates for the variant, the nucleotide change, an indication of the number of reads spanning the variant position that support the call, and an indication of the quality or likelihood of the call. The current standard format for storage and transmission of these variant calls is the Variant Call Format (VCF) specification (Reese et al., 2010). Originally developed for the 1000 Genomes Project, this format seeks to store only the data associated with variants versus the reference genome, and not the remainder of the newly sequenced genome (Danecek et al., 2011). The standard updates frequently, as knowledge on what needs to be stored develops; a number of modifications to the main specification for SVs, which were difficult to fit into the existing schema, have been developed (http://www.1000genomes.org/wiki/Analysis/ Variant%20Call%20Format/vcf-variant-callformat-version-41).

Tertiary Analysis (Variant Annotation) Following production of this variant list, the next analysis step is annotating the variants with data to allow prioritization for the purpose of distinguishing between causative variants and the milieu of polymorphisms and errors present in a WGS dataset (e.g., Ng, S.B. et al., 2009, 2010; Bick and Dimmock, 2011; O’Roak et al., 2011; Weedon et al., 2011; Worthey et al., 2011; Yamaguchi et al., 2011; Wang et al., 2013; Yu et al., 2013). This step is termed tertiary analysis; it seeks to annotate each of these variants with the information required to deprioritize sequencing and mapping errors, polymorphisms, and, in most cases, variants associated with phenotypes not directly related to the disease under study, and prioritize causative variants for the disease under analysis (Ng, S.B. et al., 2009, 2010; Bick and Dimmock, 2011; O’Roak et al., 2011; Weedon et al., 2011; Worthey et al., 2011; Yamaguchi et al., 2011; Wang et al., 2013; Yu et al., 2013). In a clinical genome-wide sequence analysis, a particular variant can have

9.24.8 Supplement 79

Current Protocols in Human Genetics

upwards of 200 annotations. These annotations can broadly be grouped into five distinct classes: 1. Data that can be used to deprioritize variants that are likely sequencing or mapping errors. 2. Data that can be used to determine the known clinical impact of a variant. 3. Data that can be used to determine the potential functional impact of a variant. 4. Data that can be used to determine the function of the genomic feature in which the variant is located. 5. Data that can be used to estimate the allele frequency of the variant in a nondisease or disease population. These classes are each discussed under the following headers.

Class 1: Annotations supporting identification and deprioritization of likely sequencing or mapping errors All polymerases produce sequencing errors that following assembly are distributed across the genome (Furey et al., 2004; Denisov et al., 2008; Smith, A.D. et al., 2008; Shen, Y. et al., 2010). Randomly distributed errors are relatively easy to identify given sufficient depth of coverage, and are unlikely to be called as variants. Sequence errors that occur preferentially at a particular position in the genome may well be incorrectly identified as variants. The previously discussed complications that occur during mapping give rise to errors that may well be identified as variants. Extrapolating from in house–generated data, we predict that, in any given resequencing, many hundreds of thousand such errors will exist. Much time would be wasted following up on these errors if deprioritization steps were not taken. A variety of data can be produced to support this deprioritization (Choi et al., 2009; Ng, S.B. et al., 2009, 2010; Becker et al., 2011; Worthey et al., 2011; O’Roak et al., 2012; Bainbridge et al., 2013). These include depth of coverage, quality scores, number of reads supporting the variant call, location in pseudogenes or genes with pseudogenes, location with paralogous gene families, presence in low-complexity regions, presence within confirmed genomic duplications, presence in regions that are overly covered (for instance using a standard threshold of 3× average regional chromosomal cov-

erage), and ability of the sequence containing the variant to map to more than one genomic region, as predicted by the GEM algorithm (Lee and Schatz, 2012). Variants that are located in genomic regions that have paralogous counterparts within the genome are more likely to be subject to mapping problems. Similarly, variants that are found within low-complexity regions or simple sequence repeats are more likely to represent sequencing errors or to be subject to mapping issues (Wetzel et al., 2011; Lee and Schatz, 2012). In addition to these external annotation sources, the importance of building an inhouse dataset of variants that have previously been annotated to be likely sequencing or mapping errors cannot be underestimated. Providing the ability to share such datasets would be a boon to labs performing these analyses.

Class 2: Annotations supporting determination of a known clinical impact Clearly, it is essential to know whether a candidate variant has been reported as causative or otherwise associated with a particular disease phenotype. The chromosome and genomic coordinates, together with the nucleotide change, can be used to associate variants that are found in the newly sequenced genome with previously identified variants. Many sources of genotype/disease phenotype associations or causal relationships exist. Some of these sources provide association at the gene level (i.e., associating a gene or other genomic feature with a particular disease phenotype), while others provide data on the specific variants that have been associated with disease. In many if not most labs, disease associations are being annotated based on presence in the Human Gene Mutation Database (HGMD, Stenson et al., 2003; Cooper et al., 2006; http://www.hgmd.org). HGMD represents the most comprehensive collection of curated variant level data relating variants in particular genes with human inherited disease (Cooper et al., 2006). Many different classes of variants are included in this resource including SNVs, small deletions and insertions, indels, and triplet repeat expansions, as well as larger SVs such as deletions, insertions, duplications, and complex rearrangements. HGMD incorporates variants in nuclear genomic regions; it does not generally encompass mitochondrial genome variants (which are handled by MITOMAP; Kogelnik et al., 1997, 1998)

Clinical Molecular Genetics

9.24.9 Current Protocols in Human Genetics

Supplement 79

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

or somatic mutations (which are perhaps best handled by COSMIC; Forbes et al., 2011; also see UNIT 10.11). HGMD is not the only available variantlevel disease database; the Human Variome Project (HVP) compendium of databases store, integrate, and disseminate variant data collected worldwide (Ring et al., 2006; Howard et al., 2010; Patrinos et al., 2012). These resources are often organized as locusspecific mutation databases (LSMDs) with varied types and levels of curation and support. Funding models for these databases also differ; many are maintained to a high standard and will be more comprehensive and up to date on recent literature for a specific gene or disease than any of the other sources of this data discussed above, but some suffer from data-curation problems that are discussed in a little more detail below. In some cases, there are no bulk downloads for the variant data available at these various sites, making it difficult to incorporate these data into the tools that are required for clinical whole-genome analysis. Many of these resources have been compiled by the Human Genome Variation Society (HGVS), and a maintained listing is available at the HGVS Web site: http://www. hgvs.org/dblist/glsdb.html. There has been a more recent movement to use the Leiden Open (source) Variation Database (LOVD) structure to support further development of these locusspecific databases (Fokkema et al., 2011). Other variant data resources including MutaDATABASE (Bale et al., 2011) and dbVAR and DGVa from EBI and the NCBI (Sneddon and Church, 2012; Lappalainen et al., 2013). MutaDATABASE aims not only to list variants identified through sequencing but also to provide clinical information for the sequenced individuals. DGVa and dbVar share a data model and provide access to study, genomic region, and variant call data (Lappalainen et al., 2013). These systems process direct submissions and provide access to the data online as well as making the data available to the public. Another long-standing source of genotypeto-phenotype associations or causal relationships for specific variants is the Online Mendelian Inheritance in Man database, a manually curated catalog of human genes, genetic disorders, and trait information (OMIM; UNIT 9.13; Amberger et al., 2009, 2011). For most genes, OMIM provides data on a limited set of specific causative variants, shown in the allelic variants section of the gene reports. In addition to causative variants for Mendelian disorders, OMIM provides limited

data on variants with positive correlations with particular supposed non-Mendelian disorders. OMIM, however, is not comprehensive at the variant level. OMIM data have undergone a high level of manual curation, with experts reading through the original papers. Clearly, this type of data is invaluable. However, when using any of these data sources, the clinical genomics expert has to bear in mind that there are many issues with the accuracy or clinical utility of the data presented. A huge issue is with errors in the original publications that made the causal association between a variant and a disease. When these are researched in depth, a significant proportion are found to be either wrong due to inappropriate selection of controls (for example where the patients and controls come from notably distinct ethnic groups or otherwise from distinct subpopulations) or possibly correct but lacking a clinically appropriate level of evidence (Tong et al., 2011). These incorrectly reported data are then propagated into the mutation database, with no additional review in some cases. Sometimes data will have been retracted or otherwise updated; these updates do not necessarily make it into the databases if the record is already listed as curated. The actual process of entering data into these data sources is often manual or semi-manual, on occasion leading to the presence of typos or other types of data-entry errors (e.g., use of an incorrect disease term). These databases do not have the resources to regularly re-curate all the associations, making it important to understand the annotation data-quality issues. Before any findings are reported clinically, the genomics expert must therefore go back and verify all associations, looking for issues with the data source. Most of these datasets were not initially intended for clinical use, and thus standards for clinical use may not be met. One thing that is very clear looking forward is that there is a real need for re-curation of the associations in the literature and existing data sources; high-throughput analysis of the desired large volumes of WGS and WES data in a clinical setting will require us to invest in improving the accuracy of these data sources to the point where they are deemed appropriate for high-throughput clinical use.

Class 3: Annotations supporting determination of the potential functional impact of a variant If a variant has not previously been shown to be causal for a disease, there are additional

9.24.10 Supplement 79

Current Protocols in Human Genetics

annotations that can provide evidence that the variant is likely to negatively impact the function of the gene or other genomic feature within which it is contained. In some cases, the effect of the variant can clearly be defined as something likely to alter correct functioning. For example, variants of a type that would be assumed to alter protein function (for example, those leading to formation of premature stop codons, read-through of the correct stop codon, alteration of the start codon, frameshifts during translation to amino acids, and alteration of canonical splice sites) can be annotated as such and used during clinical interpretation. It is relatively easy to predict the effect of variants that alter known splice sites (Sheth et al., 2006; Xiong et al., 2009), but it is also important to know that a variant lies close to a splice site, since such variants can affect the machinery of splicing leading to disease (Wappenschmidt et al., 2012). Deep intronic variants can also be associated with disease, but these are much more difficult to identify (Wappenschmidt et al., 2012). In addition, splicesite prediction tools can be used to determine whether a variant leads to formation of a novel splice site potentially altering the protein sufficiently to lead to disease (Wappenschmidt et al., 2012), though these methods are currently error prone and unlikely to be suitable for use in a clinical setting. Variants that are within protein-coding regions not accounted for in the categories above can be classified as synonymous or nonsynonymous changes, with nonsynonymous changes further characterized based on changes in the biophysical properties of the reference and variant amino acid. Replacement of an amino acid with a similar amino acid is likely to have a lesser impact than a change involving a significant charge in size, charge, hydrophobicity, etc. These changes can be further analyzed for potential impact on protein structure or function using tools such as PolyPhen (UNIT 7.20), MutationTaster (Schwarz et al., 2010), Condel (Gonz´alezP´erez and L´opez-Bigas, 2011), and SIFT (Kumar et al., 2009), which use sequenceconservation data (and sometimes structural information) to predict the effect of the change on protein function. It is worth noting here that all the algorithms that predict the impact of a non-synonymous amino acid change have a significant false positive and false negative rate as determined through analysis of their ability to accurately predict deleterious changes that have been confirmed clinically to be causative

for disease (Wang, L.L, et al., 2009; Dorfman et al., 2010). It should also be noted that these algorithms share features making it likely that they will produce similar predictions in many cases. Agreement among the algorithms does not necessarily mean that the predictions are correct if they are all affected by the same bias. Variants within important functional regions of the genome that are not genic can also be annotated through localization of the variant on the genome. Examples include annotations to known promoter and enhancer regions and specific binding sites, as well as certain classes of repetitive elements with well defined functions (Finnis et al., 2005; ENCODE Project Consortium, 2007). Prediction of the impact of a given variant that has not been seen previously in these non-genic regions remains a more difficult task than prediction of the likely effect of a premature-stop-forming variant or damaging non-synonymous change; a huge amount of work remains to be carried out to improve our ability to accurately predict the impact of the majority of variants in such regions. One example of an annotation that can be employed is use of conservation scores produced by algorithms that assign a score for each aligned nucleotide calculated across (supposed) orthologous nucleotides from multiple species. The assumption here is that conservation of the nucleotide is related to the functional importance of it or its resultant amino acid, with higher importance leading to maintenance of the encoded feature without change over time. Some of the more commonly used algorithms are PhastCons and PhyloP (Dingel et al., 2008; Pollard et al., 2010). It is worthwhile noting that prioritization or deprioritization of a variant based on such a conservation score is only appropriate if the score has been calculated across species where the region under analysis performs the same role. Comparison of the conservation score in cases where, for example, the orthologous protein or domain has different roles in the different species studied, will likely lead to misleading conclusions. As our understanding of the functional significance and role of genomic regions without genes increases, we will be able to predict the functional impact of variants in these regions (ENCODE Project Consortium, 2007). Currently, however, we can predict the effect of few intergenic or genic intronic variants. Even where variants are included within elements with known functions, we do not know the significance of all of the components of

Clinical Molecular Genetics

9.24.11 Current Protocols in Human Genetics

Supplement 79

the element; thus, few variants even in these regions will have the level of supporting evidence required as the basis for a clinical decision. Variants in most positions within the human genome are currently clinically uninterpretable based on existing knowledge alone.

Class 4: Annotations supporting identification of informative phenotype associations through identification of the function of the genomic feature in which the variant is located

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

In class 3, the annotations produced seek to determine whether a previously unknown variant might impact the function of a gene or other genomic element. Determining that a previously unknown variant is deleterious does not provide evidence that the variant is causally associated with the disease under analysis. Interpretation needs to study the functions of the genomic element to make the association. A variety of annotations can be assigned to a variant at the level of its gene or transcript related to known functions. These annotations, in combination with damagelikelihood predictions, allow evaluation for relevance of the variant to the patient’s phenotype and the processes that might contribute to the observed phenotype. This functional information can be extracted from a variety of sources including RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/), GeneCards (Safran et al., 2010; Belinky et al., 2013), and GeneReviews (Pagon et al., 2002; Pagon, 2006), as well as through use of gene-level associations from some of the variant disease databases including OMIM (Amberger et al., 2011) and HGMD BioBase (Stenson et al., 2009). In addition to annotations derived from databases that maintain functional data for particular genomic features, functional data can also be extracted in an ordered manner through consideration of ontology terms that have been mapped to these features (Ashburner and Lewis 2002; Thomas et al., 2007; Osborne et al., 2009; Costanzo et al., 2011). Structured controlled vocabularies and ontologies provide an alternative means of recording phenotype associations, and a number of different ontologies have been developed (Ashburner and Lewis 2002; Thomas et al., 2007; Osborne et al., 2009; Costanzo et al., 2011). A review of these different entities is beyond the scope of this unit; the Open Biological Ontologies (OBO) provides organization for many life-sciences ontologies and provides in-

formation on these various ontologies (Smith, B. et al., 2007). In general, these databases or resources have collated functional information either by manual or semi-manual expert curation of functional associations extracted from the literature or through application of automated functional annotation pipelines (as reviewed in Valencia (2005). With either approach, there is an attempt to determine and present the reliability of the data often through use of evidence codes that provide this information. It is important to acknowledge, particularly when implementing in a clinical setting, that issues with the data relating to errors in initial functional annotation, inappropriate transfer of annotations among species, or a lack or delay in updating functional data from experiments, can lead to omission or error. In one study of the quality of functional annotation for a set of well characterized enzyme families, there was significant disagreement between data sources as well as significant misannotation on the molecular function of family members (Schnoes et al., 2009). There are many possible causes of such errors, including errors in the initial publications, misinterpretation of the data by curators, and paralog-ortholog misclassifications (Bork and Bairoch, 1996; Smith T.F. and Zhang, 1997; Galperin and Koonin, 2000, 2001; Sasson et al., 2006). Furthermore, the process of functional annotation does not necessarily take into account divergent functions of orthologous proteins in different species or presence of alternative pathways in different species (Zhang, Z. et al., 2004; Milenkovic et al., 2008; Kurzweil et al., 2009; Frost et al., 2012). Because of all of these potential issues, consideration of the potential functional impact of a presumed deleterious change needs to be carefully considered.

Class 5: Annotations supporting estimation of variant frequency Allele frequencies are extremely useful in prioritization of variants (e.g., Ng, S.B. et al., 2009, 2010; Choi et al., 2009; Li, Y. et al., 2010; Worthey et al., 2011; O’Roak et al., 2012; Stubbs et al., 2012; Shen, H. et al., 2013). This is because in general the individual performing the analysis and interpretation has an idea of how frequently the allele is expected to be found based on the occurrence of the phenotype. In the case of rare or ultra-rare diseases, prioritization is generally carried out such that all variants that are found at higher allele frequency than

9.24.12 Supplement 79

Current Protocols in Human Genetics

expected in population samples are excluded. The assumption made is that if the disease was caused by a frequent variant, the disease would be more common (Ng, S.B. et al., 2009, 2010; Lupski et al., 2010; Worthey et al., 2011; Chen, Y.Z. et al., 2012; Li, M. et al., 2013). Obviously, such prioritizations take into account both the mode of inheritance and the degree of penetrance, if this is known or has been estimated (Kenna et al., 2013; Reis et al., 2013; Wirth et al., 2013). Application of this type of deprioritization can exclude the majority of variants under consideration when a low-frequency threshold is applied, and this is thus an important step in reducing the time and work required to reach diagnosis in these types of analysis. Prioritization in cases where the disease is not expected to be associated with a rare allele can still make use of allele frequency data; a much higher threshold obviously does not assist in the deprioritization process to the same degree. It is interesting to note, however, that in an ever-increasing number of cases, analysis using a rare allele frequency prioritization step successfully identified rare familial variants associated with what were previously described as complex or non-monogenic diseases (O’Roak et al., 2011; Bras et al., 2012; Park et al., 2012; Wagner, 2013; Yu et al., 2013). Allele frequency annotations are generally derived from large publicly available variant repositories, with dbSNP (Sherry et al., 2001), the Exome Variant Server of the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (Fu et al., 2013), and the 1000 Genomes Project Consortium Project dataset (Clarke et al., 2012) being the most common sources of these data. In addition to these datasets, many labs also consolidate the many individual genomes or small groups of genomes that have been published or otherwise released for public consumption into their allele-frequency datasets. Each of these datasets has strengths and weaknesses that should be considered when performing a clinical analysis. For example, although the dbSNP database is the most comprehensive in terms of the types of individuals sequenced, the data in this database have been accumulated through submission by researchers with little consistent QC among datasets over a period of many years. In addition, some of the data pre-date the availability of the human genome reference. Because of these issues and the volume of submissions, it is widely acknowledged that dbSNP contains misannotations or errors relating both to

whether the variant occurs as depicted as well as to the calculated frequency. Estimates of the false positive rate in dbSNP have been between 2% and 36%, depending on the analysis (Reich et al., 2003; Mitchell et al., 2004), although recent changes at dbSNP have improved accuracy. The Exome Variant Server (EVS) is a very powerful resource, in particular because as a single project the data generation, collection, analysis, and dissemination were performed systematically (Fu et al., 2013). All exomes were sequenced to high average coverage (>100× depth), and variant calls derived from simultaneous, multi-sample genotyping across all samples provided the largest, highcoverage exome dataset available. However, clearly the hosted data are not representative of a “healthy” population; the individuals sequenced were in many cases selected because they suffered from a NHLBI disorder of interest. The EVS includes samples sequenced from studies of heart, lung, and blood disorders, and although phenotype data from the individuals sequenced were gathered, it is not possible at this time to access this information (Fu et al., 2013). This dataset is likely to have over-representation of variants that are associated or causal for these phenotypes, leading to inflation of the allele frequency when compared to the numbers in a “healthy” dataset. This dataset must therefore be used carefully when the phenotype of the individual under analysis overlaps with one of these areas. The 1000 Genomes Project sequenced many individuals at relatively low coverage to provide data on variants present in these individuals at frequencies around or greater than 1% (1000 Genomes Project Consortium et al., 2010). It also includes a smaller number of individuals sequenced to a higher depth of coverage (∼50× to 60×). This project specifically set out to provide access to data from diverse ethnicities. Both dbSNP- and EVShosted datasets are likely to be poorly represented with respect to data from some ethnic minorities, potentially leading to issues with accurate frequency representation. One limitation of this low coverage and pooling strategy is that rarer alleles are less likely to be accurately reported because they will be identified in only a small subset of samples. Clearly then, there is no one best dataset to use for determination of allele frequency; it is best to consider the frequency of the variant in many independent datasets, but before a variant is deprioritized based on these data, there has to be consideration of likely biases for

Clinical Molecular Genetics

9.24.13 Current Protocols in Human Genetics

Supplement 79

each allele-frequency data source discussed above. Perhaps the most useful dataset for querying of allele frequency is a locally gathered dataset; most clinical laboratories gather this type of data for this type of querying. Not only is the ethnic diversity of such a dataset more likely to be representative of the local population, but at least some metadata relating to the patient’s phenotype and reported ethnicity are often available for querying. This allows the clinical genomics expert to not include in the allele frequency calculations for a “normal” non-disease population, variants from individuals who may have an overlapping phenotype. These data are seldom available in the larger public databases, and if available, they are generally limited. Of course, internal datasets would initially be limited in size, leading to correspondingly skewed allele frequencies. The importance of access to allele-frequency data from as many samples as possible for groups undertaking these types of WES- or WGS-based analysis (and also other molecular diagnostics) cannot be understated. It is critical that such teams define methods, tools, and policies that support sharing not only of the genotype data but also the associated phenotype data. Many separate groups are working on this problem with somewhat diverse solutions (Smith, T.D. et al., 2012).

CONTRASTING RESEARCH AND CLINICAL VARIANT ANALYSES

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

One of the most critical differentiators between all of the analyses discussed above performed in a research setting and similar analysis performed in a clinical setting is that the process must be validated and absolutely defined when performed clinically. Oftentimes in a research setting the analysis is performed by a genomics expert using the latest tools and data sources available, and it is often a semiautomated process with many manual steps (Richards et al., 2008). The genomics expert has the ability to use the most up-to-date versions of each tool. In a clinical setting, there is obviously the desire to use the best variant calling and annotation tools, but sometimes this is at odds with the need to have the process locked down following clinical validation (Richards et al., 2008). Making use of collections of scripts that have to be run manually is not appropriate in a clinical setting due to the propensity for human error. In the clinic, the process is an end-to-end pipeline that is fixed

and not subject to any alteration until the next validated release (Richards et al., 2008).

INTERPRETATION At the end of the variant annotation process outlined above, each of the variants identified in the secondary-analysis phase should be sufficiently annotated for a clinical genomics expert to perform interpretation of the genome data. This interpretation phase involves determining which variants can be clinically reported as causative in the specific case under analysis. This process involves consideration of annotations that highlight the known or likely effect of a variant on the gene or other genomic feature within which it is contained (allowing the statement that the variant impacts function), as well as annotations relating to the known function of the feature (allowing the statement that perturbation of the genomic feature is likely to be causative for a particular phenotype). In this way, particular variants can be interpreted as impacting the function of a particular genomic feature with a role that when perturbed is likely to give rise to a particular phenotype. This process allows each variant to be placed into a category for clinical reporting [following the American College of Medical Genetics (ACMG) guidelines (Richards et al., 2007; Green et al., 2013)]. The variants are then annotated as: 1. Previously reported and a recognized cause of the disorder 2. Previously unreported and of the type which is expected to cause the disorder 3. Previously unreported and of the type which may or may not be causative of the disorder 4. Previously unreported and probably not causative of disease 5. Previously reported and a recognized neutral variant. The specific process for clinical interpretation of a WGS- or WES-derived dataset is beyond the scope of this unit. Specific discussion of this implementation are provided in UNIT 9.22, and a number of articles and excellent reviews have been published detailing this topic (Worthey et al., 2011; Choi et al., 2009; Gonzaga-Jauregui et al., 2012; Green et al., 2012; Biesecker and Peay, 2013; Korf and Rehm, 2013).

9.24.14 Supplement 79

Current Protocols in Human Genetics

CONCLUSIONS Interpretation of genetic findings from WES or WGS has been used to identify the genetic underpinnings of disease (Richards et al., 2008; Ng, S.B. et al., 2009, 2010; Lupski et al., 2010; Worthey et al., 2011; Chen, Y.Z. et al., 2012; Jacob et al., 2013; Li, M. et al., 2013). The major bottleneck in identification of causal variants is interpretation: the process of differentiating between sequencing and mapping errors, polymorphisms, and disease-associated or causative variants. At this point in time, there is no single solution for any of the steps in this process, and this unit has thus attempted to provide an overview rather than describing a specific protocol. Importantly, few of the existing analysis tools are clinically validated. Causal variants in novel genes are being discovered at an ever-increasing rate through these approaches. The cost for identifying these associations is significantly lower than attainable with pre-millennium molecular technologies. In addition to significantly altering the pace and cost of discovery, the approaches discussed here also provide new perspectives on solving genetic problems that would have been difficult to tackle with conventional approaches.

LITERATURE CITED 1000 Genomes Project Consortium; Abecasis, G.R., Altshuler, D., Auton, A., Brooks, L.D., Durbin, R.M., Gibbs, R.A., Hurles, M.E., and McVean, G.A. 2010. A map of human genome variation from population-scale sequencing. Nature 467:1061-1073. Ajay, S.S., Parker, S.C., Abaan, H.O., Fajardo, K.V., and Margulies, E.H. 2011. Accurate and comprehensive sequencing of personal genomes. Genome Res. 219:1498-1505. Amberger, J., Bocchini, C.A., Scott, A.F., and Hamosh, A. 2009. McKusick’s Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 37:D793-D796. Amberger, J., Bocchini, C., and Hamosh, A. 2011. A new face and new challenges for Online Mendelian Inheritance in Man OMIMR. Hum. Mutat. 325:564-567. Ashburner, M. and Lewis, S. 2002. On ontologies for biologists: The Gene Ontology–untangling the web. Novartis Found Symp. 247:66-80; discussion 80-83, 84-90, 244-252. Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J., and Eichler, E.E. 2001. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 116:1005-1017. Bainbridge, M.N., Wiszniewski, W., Murdock, D.R., Friedman, J., Gonzaga-Jauregui, C.,

Newsham, I., Reid, J.G., Fink, J.K., Morgan, M.B., Gingras, M.C., Muzny, D.M., Hoang, L.D., Yousaf, S., Lupski, J.R., and Gibbs, R.A. 2011. Whole-genome sequencing for optimized patient management. Sci. Transl. Med. 387:87re83. Bainbridge, M.N., Hu, H., Muzny, D.M., Musante, L., Lupski, J.R., Graham, B.H., Chen, W., Gripp, K.W., Jenny, K., Wienker, T.F., Yang, Y., Sutton, V.R., Gibbs, R.A., and Ropers, H.H. 2013. De novo truncating mutations in ASXL3 are associated with a novel clinical phenotype with similarities to Bohring-Opitz syndrome. Genome Med. 52:11. Baker, S., Joecker, A., Church, G., Snyder, M., West, J., Salzberg, S., Worthey, E., Smith, T., Wang, J., and Reid, J.G. 2012. Genome interpretation and assembly-recent progress and next steps. Nat. Biotechnol. 30:1081-1083. Bale, S., Devisscher, M., Van Criekinge, W., Rehm, H.L., Decouttere, F., Nussbaum, R., Dunnen, J.T., and Willems, P. 2011. MutaDATABASE: a centralized and standardized DNA variation database. Nat. Biotechnol. 29:117-118. Bauer, D.C. 2011. Variant calling comparison CASAVA1.8 and GATK. Nature Precedings. http://precedings.nature.com/documents/6107/ version/1. Becker, J., Semler, O., Gilissen, C., Li, Y., Bolz, H.J., Giunta, C., Bergmann, C., Rohrbach, M., Koerber, F., Zimmermann, K., de Vries, P., Wirth, B., Schoenau, E., Wollnik, B., Veltman, J.A., Hoischen, A., and Netzer, C. 2011. Exome sequencing identifies truncating mutations in human SERPINF1 in autosomal-recessive osteogenesis imperfecta. Am. J. Hum. Genet. 883:362-371. Belinky, F., Bahir, I., Stelzer, G., Zimmerman, S., Rosen, N., Nativ, N., Dalah, I., Iny Stein, T., Rappaport, N., Mituyama, T., Safran, M., and Lancet, D. 2013. Non-redundant compendium of human ncRNA genes in GeneCards. Bioinformatics 292:255-261. Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R., Boutell, J.M., Bryant, J., et al., 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:5359. Berg, J.S., Khoury, M.J., and Evans, J.P. 2011. Deploying whole genome sequencing in clinical practice and public health: Meeting the challenge one bin at a time. Genet Med. 136:499504. Bertelli, C. and Greub, G. 2013. Rapid bacterial genome sequencing: Methods and applications in clinical microbiology. Clin. Microbiol. Infect. doi: 10.1111/1469-0691.12217. Bick, D. and Dimmock, D. 2011. Whole exome and whole genome sequencing. Curr. Opin. Pediatr. 23:594-600. Biesecker, B.B. and Peay, H.L. 2013. Genomic sequencing for psychiatric disorders: Promise

Clinical Molecular Genetics

9.24.15 Current Protocols in Human Genetics

Supplement 79

and challenge. Int. J. Neuropsychopharmacol. 16:1667-1672. Bilguvar, K., Ozturk, A.K., Louvi, A., Kwan, K.Y., Choi, M., Tatli, B., Yalnizo˘glu, D., T¨uys¨uz, B., Ca˘glayan, A.O., G¨okben, S., Kaymakc¸alan, H., Barak, T., Bakircio˘glu, M., Yasuno, K., Ho, W., Sanders, S., Zhu, Y., Yilmaz, S., Dinc¸er, A., Johnson, M.H., Bronen, R.A., Koc¸er, N., Per, H., Mane, S., Pamir, M.N., Yalc¸inkaya, C., Kumandas¸, S., Topc¸u, M., Ozmen, M., Sestan, N., Lifton, R.P., State, M.W., and G¨unel, M. 2010. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature 467:207-210. Borge, K.S., Borresen-Dale, A.L., and Lingaas, F. 2011. Identification of genetic variation in 11 candidate genes of canine mammary tumour. Vet. Comp. Oncol. 94:241-250. Bork, P. and Bairoch, A. 1996. Go hunting in sequence databases but watch out for the traps. Trends Genet. 1210:425-427. Bras, J., Guerreiro, R., and Hardy, J. 2012. Use of next-generation sequencing and other whole-genome strategies to dissect neurological disease. Nat. Rev. Neurosci. 137:453464. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., and Jaffe, D.B. 2008. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 185:810-820. Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., and DePristo, M.A. 2012a. Pacific Biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13:375. Carnevali, P., Baccash, J., Halpern, A.L., Nazarenko, I., Nilsen, G.B., Pant, K.P., Ebert, J.C., Brownley, A., Morenzoni, M., Karpinchyk, V., Martin, B., Ballinger, D.G., and Drmanac, R. 2012b. Computational techniques for human genome resequencing using mated gapped reads. J. Comput. Biol. 193:279-292. Challis, D., Yu, J., Evani, U.S., Jackson, A.R., Paithankar, S., Coarfa, C., Milosavljevic, A., Gibbs, R.A., and Yu, F. 2012. An integrative variant analysis suite for whole exome nextgeneration sequencing data. BMC Bioinformatics 13:8. Chen, K., McLellan, M.D., Ding, L., Wendl, M.C., Kasai, Y., Wilson, R.K., and Mardis, E.R. 2007. PolyScan: An automatic indel and SNP detection approach to the analysis of human resequencing data. Genome Res. 175:659-666. Chen, Y., Schmidt, B., and Maskell, D.L. 2013. A hybrid short read mapping accelerator. BMC Bioinformatics 14:67.

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

Chen, Y.Z., Matsushita, M.M., Robertson, P., Rieder, M., Girirajan, S., Antonacci, F., Lipe, H., Eichler, E.E., Nickerson, D.A., Bird, T.D., and Raskind, W.H. 2012. Autosomal dominant familial dyskinesia and facial myokymia: Single exome sequencing identifies a mutation in adenylyl cyclase 5. Arch. Neurol. 695:630635.

Chepelev, I., Wei, G., Tang, Q., and Zhao, K. 2009. Detection of single nucleotide variations in expressed exons of the human genome using RNASeq. Nucleic Acids Res. 3716:e106. Choi, M., Scholl, U.I., Ji, W., Liu, T., Tikhonova, I.R., Zumbo, P., Nayir, A., Bakkalo˘glu, A., Ozen, S., Sanjad, S., Nelson-Williams, C., Farhi, A., Mane, S., and Lifton, R.P. 2009. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl. Acad. Sci. U.S.A. 106:19096-19101. Chou, J., Ohsumi, T.K., and Geha, R.S. 2012. Use of whole exome and genome sequencing in the identification of genetic causes of primary immunodeficiencies. Curr. Opin. Allergy Clin. Immunol. 12:623-628. Church, D.M., Schneider, V.A., Graves, T., Auger, K., Cunningham, F., Bouk, N., Chen, H.C., Agarwala, R., McLaren, W.M., Ritchie, G.R., Albracht, D., Kremitzki, M., Rock, S., Kotkiewicz, H., Kremitzki, C., Wollam, A., Trani, L., Fulton, L., Fulton, R., Matthews, L., Whitehead, S., Chow, W., Torrance, J., Dunn, M., Harden, G., Threadgold, G., Wood, J., Collins, J., Heath, P., Griffiths, G., Pelan, S., Grafham, D., Eichler, E.E., Weinstock, G., Mardis, E.R., Wilson, R.K., Howe, K., Flicek, P., and Hubbard, T. 2011. Modernizing reference genome assemblies. PLoS Biol. 97:e1001091. Clarke, L., Zheng-Bradley, X., Smith, R., Kulesha, E., Xiao, C., Toneva, I., Vaughan, B., Preuss, D., Leinonen, R., Shumway, M., Sherry, S., Flicek, P.; 1000 Genomes Project Consortium. 2012. The 1000 Genomes Project: Data management and community access. Nat. Methods 95:459462. Cooper, D.N., Stenson, P.D., and Chuzhanova, N.A. 2006. The Human Gene Mutation Database HGMD and its exploitation in the study of mutational mechanisms. Curr. Protoc. Bioinformatics 12:1.13.1-1.13.20 [archived version available at http://onlinelibrary.wiley.com/doi/ 10.1002/0471250953.bi0113s12/full]. Costanzo, M.C., Park, J., Balakrishnan, R., Cherry, J.M., and Hong, E.L. 2011. Using computational predictions to improve literature-based Gene Ontology annotations: A feasibility study. Database Oxford 2011:bar004. Craig, D.W., Pearson, J.V., Szelinger, S., Sekar, A., Redman, M., Corneveaux, J.J., Pawlowski, T.L., Laub, T., Nunn, G., Stephan, D.A., Homer, N., and Huentelman, M.J. 2008. Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 510:887-893. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R., and 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics 27:2156-2158. De Baets, G., Van Durme, J., Reumers, J., MaurerStroh, S., Vanhee, P., Dopazo, J., Schymkowitz, J., and Rousseau, F. 2012. SNPeffect 4.0: Online prediction of molecular and structural effects of protein-coding variants. Nucleic Acids Res. 40:D935-D939.

9.24.16 Supplement 79

Current Protocols in Human Genetics

de Bakker, P.I., Saxena, R., and Graham, R.R. 2004. Variation in the human genome and risk to common disease. Keystone Symposium on ‘Human Genome Sequence Variation and the Inherited Basis of Common Diseases’. January 8-13, Breckenridge, Colorado, U.S.A. Pharmacogenomics 5:157-161. Denisov, G., Walenz, B., Halpern, A.L., Miller, J., Axelrod, N., Levy, S., and Sutton, G. 2008. Consensus generation and variant detection by Celera Assembler. Bioinformatics 248:10351040. DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., and Daly, M.J. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 435:491-498. Didelot, X., Bowden, R., Wilson, D.J., Peto, T.E., and Crook, D.W. 2012. Transforming clinical microbiology with bacterial genome sequencing. Nat. Rev. Genet. 139:601-612. Dingel, J., Hanus, P., Leonardi, N., Hagenauer, J., Zech, J., and Mueller, J.C. 2008. Local conservation scores without a priori assumptions on neutral substitution rates. BMC Bioinformatics 9:190. Dorfman, R., Nalpathamkalam, T., T, Taylor, C., Gonska, T., Keenan, K., Yuan, X.W., Corey, M., Tsui, L.C., Zielenski, J., and Durie, P. 2010. Do common in silico tools predict the clinical consequences of amino-acid substitutions in the CFTR gene? Clin. Genet. 775:464473. Dulak, A.M., Stojanov, P., Peng, S., Lawrence, M.S., Fox, C., Stewart, C., Bandla, S., Imamura, Y., Schumacher, S.E., Shefler, E., McKenna, A., Carter, S.L., Cibulskis, K., Sivachenko, A., Saksena, G., Voet, D., Ramos, A.H., Auclair, D., Thompson, K., Sougnez, C., Onofrio, R.C., Guiducci, C., Beroukhim, R., Zhou, Z., Lin, L., Lin, J., Reddy, R., Chang, A., Landrenau, R., Pennathur, A., Ogino, S., Luketich, J.D., Golub, T.R., Gabriel, S.B., Lander, E.S., Beer, D.G., Godfrey, T.E., Getz, G., and Bass, A.J. 2013. Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity. Nat. Genet. 455:478-486. ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799-816. English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., Worley, K.C., and Gibbs, R.A. 2012. Mind the gap: Upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7:e47768. Finnis, M., Dayan, S., Hobson, L., ChenevixTrench, G., Friend, K., Ried, K., Venter, D., Woollatt, E., Baker, E., and Richards, R.I. 2005.

Common chromosomal fragile site FRA16D mutation in cancer cells. Hum. Mol. Genet. 14:1341-1349. Fokkema, I.F., Taschner, P.E., Schaafsma, G.C., Celli, J., Laros, J.F., and den Dunnen, J.T. 2011. LOVD v.2.0: The next generation in gene variant databases. Hum. Mutat. 325:557-563. Forbes, S.A., Bindal, N., Bamford, S., Cole, C., Kok, C.Y., Beare, D., Jia, M., Shepherd, R., Leung, K., Menzies, A., Teague, J.W., Campbell, P.J., Stratton, M.R., and Futreal, P.A. 2011. COSMIC: Mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39:D945-D950. Frost, A., Elgort, M.G., Brandman, O., Ives, C., Collins, S.R., Miller-Vedam, L., Weibezahn, J., Hein, M.Y., Poser, I., Mann, M., Hyman, A.A., and Weissman, J.S. 2012. Functional repurposing revealed by comparing S. pombe and S. cerevisiae genetic interactions. Cell 149:13391352. Fu, W., O’Connor, T.D., Jun, G., Kang, H.M., Abecasis, G., Leal, S.M., Gabriel, S., Rieder, M.J., Altshuler, D., Shendure, J., Nickerson, D.A., Bamshad, M.J.; NHLBI Exome Sequencing Project, and Akey, J.M. 2013. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493:216220. Furey, T.S., Diekhans, M., Lu, Y., Graves, T.A., Oddy, L., Randall-Maher, J., Hillier, L.W., Wilson, R.K., and Haussler, D. 2004. Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing. Genome Res. 14:2034-2040. Galperin, M.Y. and Koonin, E.V. 2000. Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol. 186:609613. Galperin, M.Y. and Koonin, E.V. 2001. Comparative genome analysis. Methods Biochem. Anal. 43:359-392. Gardy, J.L. 2013. Investigation of disease outbreaks with genome sequencing. Lancet Infect. Dis. 132:101-102. Gargis, A.S., Kalman, L., Berry, M.W., Bick, D.P., Dimmock, D.P., Hambuch, T., Lu, F., Lyon, E., Voelkerding, K.V., Zehnbauer, B.A., Agarwala, R., Bennett, S.F., Chen, B., Chin, E.L., Compton, J.G., Das, S., Farkas, D.H., Ferber, M.J., Funke, B.H., Furtado, M.R., GanovaRaeva, L.M., Geigenm¨uller, U., Gunselman, S.J., Hegde, M.R., Johnson, P.L., Kasarskis, A., Kulkarni, S., Lenk, T., Liu, C.S., Manion, M., Manolio, T.A., Mardis, E.R., Merker, J.D., Rajeevan, M.S., Reese, M.G., Rehm, H.L., Simen, B.B., Yeakley, J.M., Zook, J.M., and Lubin, I.M. 2012. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat. Biotechnol. 30:1033-1036. Gnerre, S., Lander, E.S., Lindblad-Toh, K., and Jaffe, D.B. 2009. Assisted assembly: How to improve a de novo genome assembly by using related species. Genome Biol. 108:R88.

Clinical Molecular Genetics

9.24.17 Current Protocols in Human Genetics

Supplement 79

Goh, V., Helbling, D., Biank, V., Jarzembowski, J., and Dimmock, D. 2012. Next-generation sequencing facilitates the diagnosis in a child with twinkle mutations causing cholestatic liver failure. J. Pediatr. Gastroenterol. Nutr. 54:291294. Gonzaga-Jauregui, C., Lupski, J.R., and Gibbs, R.A. 2012. Human genome sequencing in health and disease. Annu. Rev. Med. 63:35-61. Gonz´alez-P´erez, A. and L´opez-Bigas, N. 2011. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 88:440-449. Green, R.C., Berg, J.S., Berry, G.T., Biesecker, L.G., Dimmock, D.P., Evans, J.P., Grody, W.W., Hegde, M.R., Kalia, S., Korf, B.R., Krantz, I., McGuire, A.L., Miller, D.T., Murray, M.F., Nussbaum, R.L., Plon, S.E., Rehm, H.L., and Jacob, H.J. 2012. Exploring concordance and discordance for return of incidental findings from clinical sequencing. Genet. Med. 144:405410. Green, R.C., Berg, J.S., Grody, W.W., Kalia, S.S., Korf, B.R., Martin, C.L., McGuire, A.L., Nussbaum, R.L., O’Daniel, J.M., Ormond, K.E., Rehm, H.L., Watson, M.S., Williams, M.S., and Biesecker, L.G. 2013. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet. Med. 15:565-574. Hall, I.M. and Quinlan, A.R. 2012. Detection and interpretation of genomic structural variation in mammals. Methods Mol. Biol. 838:225248. Hastings, R., de Wert, G., Fowler, B., Krawczak, M., Vermeulen, E., Bakker, E., Borry, P., Dondorp, W., Nijsingh, N., Barton, D., Schmidtke, J., van El, C.G., Vermeesch, J., Stol, Y., Carmen Howard, H., and Cornel, M.C. 2012. The changing landscape of genetic testing and its impact on clinical and laboratory services and research in Europe. Eur. J. Hum. Genet. 20:911916. Haug, K., Salek, R.M., Conesa, P., Hastings, J., de Matos, P., Rijnbeek, M., Mahendraker, T., Williams, M., Neumann, S., Rocca-Serra, P., Maguire, E., Gonz´alez-Beltr´an, A., Sansone, S.A., Griffin, J.L., and Steinbeck, C. 2013. MetaboLights—An open-access generalpurpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res. 41:D781-D786. Hirsch, V., Adger-Johnson, D., Campbell, B., Goldstein, S., Brown, C., Elkins, W.R., and Montefiori, D.C. 1997. A molecularly cloned, pathogenic, neutralization-resistant simian immunodeficiency virus, SIVsmE543-3. J. Virol. 71:1608-1620.

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

Homer, N., Merriman, B., and Nelson, S.F. 2009. BFAST: An alignment tool for large scale genome resequencing. PLoS One 411: e7767. Hormozdiari, F., Alkan, C., Eichler, E.E., and Sahinalp, S.C. 2009. Combinatorial algorithms

for structural variation detection in highthroughput sequenced genomes. Genome Res. 19:1270-1278. Howard, H.J., Horaitis, O., Cotton, R.G., Vihinen, M., Dalgleish, R., Robinson, P., Brookes, A.J., Axton, M., Hoffmann, R., and Tuffery-Giraud, S. 2010. The Human Variome Project HVP 2009 forum towards establishing standards. Hum. Mutat. 31:366-367. Jacob, H.J., Abrams, K., Bick, D.P., Brodie, K., Dimmock, D.P., Farrell, M., Geurts, J., Harris, J., Helbling, D., Joers, B.J., Kliegman, R., Kowalski, G., Lazar, J., Margolis, D.A., North, P., Northup, J., Roquemore-Goins, A., Scharer, G., Shimoyama, M., Strong, K., Taylor, B., Tsaih, S.W., Tschannen, M.R., Veith, R.L., Wendt-Andrae, J., Wilk, B., and Worthey, E.A. 2013. Genomics in clinical practice: Lessons from the front lines. Sci. Transl. Med. Jul 17;5(194):194cm5. Jiang, Z., Rokhsar, D.S., and Harland, R.M. 2009. Old can be new again: HAPPY whole genome sequencing, mapping and assembly. Int. J. Biol. Sci. 5:298-303. Karakoc, E., Alkan, C., O’Roak, B.J., Dennis, M.Y., Vives, L., Mark, K., Rieder, M.J., Nickerson, D.A., and Eichler, E.E. 2012. Detection of structural variants and indels within exome data. Nat. Methods 9:176-178. Kenna, K.P., McLaughlin, R.L., Hardiman, O., and Bradley, D.G. 2013. Using reference databases of genetic variation to evaluate the potential pathogenicity of candidate disease variants. Hum. Mutat. 34:836-841. Kidd, J.M., Graves, T., Newman, T.L., Fulton, R., Hayden, H.S., Malig, M., Kallicki, J., Kaul, R., Wilson, R.K., and Eichler, E.E. 2010. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143:837-847. Koboldt, D.C., Chen, K., Wylie, T., Larson, D.E., McLellan, M.D., Mardis, E.R., Weinstock, G.M., Wilson, R.K., and Ding, L. 2009. VarScan: Variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25:2283-2285. Kogelnik, A.M., Lott, M.T., Brown, M.D., Navathe, S.B., Wallace, D.C. 1997. MITOMAP: An update on the status of the human mitochondrial genome database. Nucleic Acids Res. 25:196199. Kogelnik, A.M., Lott, M.T., Brown, M.D., Navathe, S.B., and Wallace, D.C. 1998. MITOMAP: A human mitochondrial genome database–1998 update. Nucleic Acids Res. 261:112-115. Koolen, D.A., Veltman, J.A., Renier, W.O., Droog, R.P., van Kessel, A.G., and de Vries, B.B. 2004. Chromosome 22q11 deletion and pachygyria characterized by array-based comparative genomic hybridization. Am. J. Med. Genet. A 131:322-324. Korf, B.R. and Rehm, H.L. 2013. New approaches to molecular diagnosis. JAMA 309:15111521.

9.24.18 Supplement 79

Current Protocols in Human Genetics

Ku, C.S., Cooper, D.N., Ziogas, D.E., Halkia, E., Tzaphlidou, M., and Roukos, D.H. 2013a. Research and clinical applications of cancer genome sequencing. Curr. Opin. Obstet. Gynecol. 25:3-10. Ku, C.S., Polychronakos, C., Tan, E.K., Naidoo, N., Pawitan, Y., Roukos, D.H., Mort, M., and Cooper, D.N. 2013b. A new paradigm emerges from the study of de novo mutations in the context of neurodevelopmental disease. Mol. Psychiatry 18:141-153. Kumar, P., Henikoff, S., and Ng, P.C. 2009. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4:1073-1081. Kurzweil, V.C., Getman, M., NISC Comparative Sequencing Program, Green, E.D., and Lane, R.P. 2009. Dynamic evolution of V1R putative pheromone receptors between Mus musculus and Mus spretus. BMC Genomics 10: 74. Lalonde, E., Albrecht, S., Ha, K.C., Jacob, K., Bolduc, N., Polychronakos, C., Dechelotte, P., Majewski, J., and Jabado, N. 2010. Unexpected allelic heterogeneity and spectrum of mutations in Fowler syndrome revealed by next-generation exome sequencing. Hum. Mutat. 31:918-923. Lander, E.S. 2011. Initial impact of the sequencing of the human genome. Nature 470:187-197. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., et al., 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10:R25. Lappalainen, I., Lopez, J., Skipper, L., Hefferon, T., Spalding, J.D., Garner, J., Chen, C., Maguire, M., Corbett, M., Zhou, G., Paschall, J., Ananiev, V., Flicek, P, and Church, D.M. 2013. 1.DbVar and DGVa: Public archives for genomic structural variation. Nucleic Acids Res. 41:D936D941. Lee, H. and Schatz, M.C. 2012. Genomic dark matter: The reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28:2097-2105. Li, H. 2012. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28:1838-1844. Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754-1760.

de novo assembly of the giant panda genome. Nature 463:311-317. Li, S., Li, R., Li, H., Lu, J., Bolund, L., Schierup, M.H., and Wang, J. 2013. SOAPindel: Efficient identification of indels from short paired reads. Genome Res. 23:195-200. Li, Y., Vinckenbosch, N., Huerta-Sanchez, E., Jiang, T., Jiang, H., Albrechtsen, A., Andersen, G., Cao, H., Korneliussen, T., Grarup, N., Guo, Y., Hellman, I., Jin, X., Li, Q., Liu, J., Liu, X., Sparsø, T., Tang, M., Wu, H., Wu, R., Yu, C., Zheng, H., Astrup, A., Bolund, L., Holmkvist, J., Jørgensen, T., Kristiansen, K., Schmitz, O., Schwartz, T.W., Zhang, X., Li, R., Yang, H., Wang, J., Hansen, T., Pedersen, O., Nielsen, R., and Wang, J. 2010. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 42:969-972. Lin, Y., Li, J., Shen, H., Zhang, L., Papasian, C.J., and Deng, H.W. 2011. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27:20312037. Lindberg, J., Klevebring, D., Liu, W., Neiman, M., Xu, J., Wiklund, P., Wiklund, F., Mills, I.G., Egevad, L., and Gr¨onberg, H. 2013. Exome sequencing of prostate cancer supports the hypothesis of independent tumour origins. Eur. Urol. 63:347-353. Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., Tang, J., Wu, G., Zhang, H., Shi, Y., Liu, Y., Yu, C., Wang, B., Lu, Y., Han, C., Cheung, D.W., Yiu, S.M., Peng, S., Xiaoqian, Z., Liu, G., Liao, X., Li, Y., Yang, H., Wang, J., Lam, T.W., and Wang, J. 2012. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. Gigascience 1:18. Lupski, J.R. 1998. Genomic disorders: Structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet. 14:417-422. Lupski, J.R., Reid, J.G., Gonzaga-Jauregui, C., Rio Deiros, D., Chen, D.C., Nazareth, L., Bainbridge, M., Dinh, H., Jing, C., Wheeler, D.A., McGuire, A.L., Zhang, F., Stankiewicz, P., Halperin, J.J., Yang, C., Gehman, C., Guo, D., Irikat, R.K., Tom, W., Fantin, N.J., Muzny, D.M., and Gibbs, R.A. 2010. Whole-genome sequencing in a patient with charcot-marie-tooth neuropathy. N. Engl. J. Med. 362:1181-1191. Mardis, E.R. 2012. Applying next-generation sequencing to pancreatic cancer treatment. Nat. Rev. Gastroenterol. Hepatol. 9:477-486.

Li, M., Pang, S.Y., Song, Y., Kung, M.H., Ho, S.L., and Sham, P.C. 2013. Whole exome sequencing identifies a novel mutation in the transglutaminase 6 gene for spinocerebellar ataxia in a Chinese family. Clin. Genet. 83:269-273.

Marques-Bonet, T., Cheng, Z., She, X., Eichler, E.E., and Navarro, A., 2008. The genomic distribution of intraspecific and interspecific sequence divergence of human segmental duplications relative to human/chimpanzee chromosomal rearrangements. BMC Genomics 9:384.

Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., et al., 2010. The sequence and

Martinez-Alcantara, A., Ballesteros, E., Feng, C., Rojas, M., Koshinsky, H., Fofanov, V.Y., Havlak, P., and Fofanov, Y. 2009. PIQA: Pipeline for

Clinical Molecular Genetics

9.24.19 Current Protocols in Human Genetics

Supplement 79

Illumina G1 genome analyzer data quality assessment. Bioinformatics 25:2438-2439.

cations to next generation sequencing. J. Comput. Biol. 16:897-908.

Mayer, A.N., Dimmock, D.P., Arca, M.J., Bick, D.P., Verbsky, J.W., Worthey, E.A., Jacob, H.J., and Margolis, D.A. 2011. A timely arrival for genomic medicine. Genet. Med. 13:195-196.

NCBI Resource Coordinators. 2013. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 41:D8D20. Ng, B.G., Buckingham, K.J., Raymond, K., Kircher, M., Turner, E.H., He, M., Smith, J.D., Eroshkin, A., Szybowska, M., Losfeld, M.E., Chong, J.X., Kozenko, M., Li, C., Patterson, M.C., Gilbert, R.D., Nickerson, D.A., Shendure, J., Bamshad, M.J.; University of Washington Center for Mendelian Genomics, Freeze, H.H. 2013. Mosaicism of the UDP-galactose transporter SLC35A2 causes a congenital disorder of glycosylation. Am. J. Hum. Genet. 92: 632-636. Ng, S.B., Turner, E.H., Robertson, P.D., Flygare, S.D., Bigham, A.W., Lee, C., Shaffer, T., Wong, M., Bhattacharjee, A., Eichler, E.E., Bamshad, M., Nickerson, D.A., and Shendure, J. 2009. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272-276. Ng, S.B., Buckingham, K.J., Lee, C., Bigham, A.W., Tabor, H.K., Dent, K.M., Huff, C.D., Shannon, P.T., Jabs, E.W., Nickerson, D.A., Shendure, J., and Bamshad, M.J. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42:30-35. O’Rawe, J., Guangqing, S., Sun, G., Wu, Y., Wang, W., Hu, J., Bodily, P., Tian, L., Hakonarson, H., Johnson, W.E., Wei, Z., Wang, K., and Lyon, G.J. 2013. Low concordance of multiple variantcalling pipelines: Practical implications for exome and genome sequencing. Genome Med. 53:28.

McCarroll, S.A., Hadnott, T.N., Perry, G.H., Sabeti, P.C., Zody, M.C., Barrett, J.C., Dallaire, S., Gabriel, S.B., Lee, C., Daly, M.J., Altshuler, D.M.; International HapMap Consortium. 2006. Common deletion polymorphisms in the human genome. Nat. Genet. 38:86-92. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M.A. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-1303. McKernan, K.J., Peckham, H.E., Costa, G.L., McLaughlin, S.F., Fu, Y., Tsung, E.F., Clouser, C.R., Duncan, C., Ichikawa, J.K., Lee, C.C., Zhang, Z., Ranade, S.S., Dimalanta, E.T., Hyland, F.C., Sokolsky, T.D., Zhang, L., Sheridan, A., Fu, H., Hendrickson, C.L., Li, B., Kotler, L., Stuart, J.R., Malek, J.A., Manning, J.M., Antipova, A.A., Perez, D.S., Moore, M.P., Hayashibara, K.C., Lyons, M.R., Beaudoin, R.E., Coleman, B.E., Laptewicz, M.W., Sannicandro, A.E., Rhodes, M.D., Gottimukkala, R.K., Yang, S., Bafna, V., Bashir, A., MacBride, A., Alkan, C., Kidd, J.M., Eichler, E.E., Reese, M.G., De La Vega, F.M., and Blanchard, A.P. 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using twobase encoding. Genome Res. 19:1527-1541. Merikangas, A.K., Corvin, A.P., and Gallagher, L. 2009. Copy-number variants in neurodevelopmental disorders: Promises and challenges. Trends Genet. 25:536-544. Mijuskovic, M., Brown, S.M., Tang, Z., Lindsay, C.R., Efstathiadis, E., Deriano, L., and Roth, D.B. 2012. A streamlined method for detecting structural variants in cancer genomes by short read paired-end sequencing. PLoS One 7:e48314. Milenkovic, V.M., Langmann, T., Schreiber, R., Kunzelmann, K., and Weber, B.H. 2008. Molecular evolution and functional divergence of the bestrophin protein family. BMC Evol. Biol. 8:72. Miller, J.R., Koren, S., and Sutton, G. 2010. Assembly algorithms for next-generation sequencing data. Genomics 95:315-327. Mitchell, A.A., Zwick, M.E., Chakravarti, A., and Cutler, D.J. 2004. Discrepancies in dbSNP confirmation rates and allele frequency distributions from varying genotyping error rates and patterns. Bioinformatics 20:1022-1032. Analysis and Annotation of Sequence Variants for Clinical Diagnosis

Moorthie, S., Hall, A., Wright, C.F. 2013. Informatics and clinical genome sequencing: Opening the black box. Genet. Med. 15:165-171. Nagarajan, N. and Pop, M. 2009. Parametric complexity of sequence assembly: Theory and appli-

O’Roak, B.J., Deriziotis, P., Lee, C., Vives, L., Schwartz, J.J., Girirajan, S., Karakoc, E., Mackenzie, A.P., Ng, S.B., Baker, C., Rieder, M.J., Nickerson, D.A., Bernier, R., Fisher, S.E., Shendure, J., and Eichler, E.E. 2011. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat. Genet. 43:585-589. O’Roak, B.J., Vives, L., Fu, W., Egertson, J.D., Stanaway, I.B., Phelps, I.G., Carvill, G., Kumar, A., Lee, C., Ankenman, K., Munson, J., Hiatt, J.B., Turner, E.H., Levy, R., O’Day, D.R., Krumm, N., Coe, B.P., Martin, B.K., Borenstein, E., Nickerson, D.A., Mefford, H.C., Doherty, D., Akey, J.M., Bernier, R., Eichler, E.E., and Shendure, J. 2012. Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders. Science 338:1619-1622. Osborne, J.D., Flatow, J., Holko, M., Lin, S.M., Kibbe, W.A., Zhu, L.J., Danila, M.I., Feng, G., and Chisholm, R.L. 2009. Annotating the human genome with Disease Ontology. BMC Genomics 10:S6. Pagon, R.A. 2006. GeneTests: An online genetic information resource for health care providers. J. Med. Libr. Assoc. 94:343-348. Pagon, R.A., Tarczy-Hornoch, P., Baskin, P.K., Edwards, J.E., Covington, M.L., Espeseth, M., Beahler, C., Bird, T.D., Popovich, B., Nesbitt,

9.24.20 Supplement 79

Current Protocols in Human Genetics

C., Dolan, C., Marymee, K., Hanson, N.B., Neufeld-Kaiser, W., Grohs, G.M., Kicklighter, T., Abair, C., Malmin, A., Barclay, M., and Palepu, R.D. 2002. GeneTests-GeneClinics: Genetic testing information for a growing audience. Hum. Mutat. 19:501-509. Park, D.J., Lesueur, F., Nguyen-Dumont, T., Pertesi, M., Odefrey, F., Hammet, F., Neuhausen, S.L., John, E.M., Andrulis, I.L., Terry, M.B., Daly, M., Buys, S., Le Calvez-Kelm, F., Lonie, A., Pope, B.J., Tsimiklis, H., Voegele, C., Hilbers, F.M., Hoogerbrugge, N., Barroso, A., Osorio, A.; Breast Cancer Family Registry; Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer, Giles, G.G., Devilee, P., Benitez, J., Hopper, J.L., Tavtigian, S.V., Goldgar, D.E., and Southey, M.C. 2012. Rare mutations in XRCC2 increase the risk of breast cancer. Am. J. Hum. Genet. 90:734-739. Paszkiewicz, K. and Studholme, D.J. 2010. De novo assembly of short sequence reads. Brief. Bioinform. 11:457-472. Patrinos, G.P., Smith, T.D., Howard, H., AlMulla, F., Chouchane, L., Hadjisavvas, A., Hamed, S.A., Li, X.T., Marafie, M., Ramesar, R.S., Ramos, F.J., de Ravel, T., El-Ruby, M.O., Shrestha, T.R., Sobrido, M.J., Tadmouri, G., Witsch-Baumgartner, M., Zilfalil, B.A., Auerbach, A.D., Carpenter, K., Cutting, G.R., Dung, V.C., Grody, W., Hasler, J., Jorde, L., Kaput, J., Macek, M., Matsubara, Y., Padilla, C., Robinson, H., Rojas-Martinez, A., Taylor, G.R., Vihinen, M., Weber, T., Burn, J., Qi, M., Cotton, R.G., Rimoin, D.; International Confederation of Countries Advisory Council. 2012. Human Variome Project country nodes: Documenting genetic information within a country. Hum. Mutat. 33:1513-1519. Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R., and Siepel, A. 2010. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20:110-121. Pop, M. 2009. Genome assembly reborn: Recent computational challenges. Brief. Bioinform. 10:354-366. Porter, J.D. and Baker, R.S. 1997. Absence of oculomotor and trochlear motoneurons leads to altered extraocular muscle development in the Wnt-1 null mutant mouse. Brain Res. Dev. Brain Res. 100:121-126. Priest, J.R., Girirajan, S., Vu, T.H., Olson, A., Eichler, E.E., and Portman, M.A. 2012. Rare copy number variants in isolated sporadic and syndromic atrioventricular septal defects. Am. J. Med. Genet. A 158:1279-1284. Quail, M.A., Kozarewa, I., Smith, F., Scally, A., Stephens, P.J., Durbin, R., Swerdlow, H., and Turner, D.J. 2008. A large genome center’s improvements to the Illumina sequencing system. Nat. Methods 5:1005-1010. Reese, M.G., Moore, B., Batchelor, C., Salas, F., Cunningham, F., Marth, G.T., Stein, L., Flicek, P., Yandell, M., and Eilbeck, K. 2010. A standard variation file format for human genome sequences. Genome Biol. 11:R88.

Reich, D.E., Gabriel, S.B., and Altshuler, D. 2003. Quality and completeness of SNP databases. Nat. Genet 33:457-458. Reis, L.M., Tyler, R.C., Muheisen, S., Raggio, V., Salviati, L., Han, D.P., Costakos, D., Yonath, H., Hall, S., Power, P., and Semina, E.V. 2013. Whole exome sequencing in dominant cataract identifies a new causative factor, CRYBA2, and a variety of novel alleles in known genes. Hum. Genet. 132:761-770. Richards, C.S., Bale, S., Bellissimo, D.B., Das, S., Grody, W.W., Hegde, M.R., Lyon, E., Ward, B.E.; Molecular Subcommittee of the ACMG Laboratory Quality Assurance Committee. 2007. ACMG recommendations for standards for interpretation and reporting of sequence variations: Revisions 2007. Genet. Med. 10:294-300. Richards, C.S., Bale, S., Bellissimo, D.B., Das, S., Grody, W.W., Hegde, M.R., Lyon, E., Ward, B.E.; Molecular Subcommittee of the ACMG Laboratory Quality Assurance Committee. 2008. ACMG recommendations for standards for interpretation and reporting of sequence variations: Revisions 2007. Genet. Med. 10:294-300. Ring, H.Z., Kwok, P.Y., and Cotton, R.G. 2006. Human Variome Project: An international collaboration to catalogue human genetic variation. Pharmacogenomics 7:969-972. Rosenberg, S.M. and Hastings, P.J. 2004. Rebuttal: Adaptive mutation in Escherichia coli (Foster). J. Bacteriol. 186:4853. Rosenfeld, J.A., Mason, C.E., and Smith, T.M. 2012. Limitations of the human reference genome for personalized genomics. PLoS One 7:e40294. Roukos, D.H. and Ku, C.S. 2012. Clinical cancer genome and precision medicine. Ann. Surg. Oncol. 19:3646-3650. Ruffalo, M., LaFramboise, T., and Koyut¨urk M. 2011. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27:2790-2806. Safran, M., Dalah, I., Alexander, J., Rosen, N., Iny Stein, T., Shmoish, M., Nativ, N., Bahir, I., Doniger, T., Krug, H., Sirota-Madi, A., Olender, T., Golan, Y., Stelzer, G., Harel, A., and Lancet, D. 2010. GeneCards Version 3: The human gene integrator. Database (Oxford) 2010:baq020. Sasson, O., Kaplan, N., and Linial, M. 2006. Functional annotation prediction: All for one and one for all. Protein Sci. 15:1557-1562. Saunders, C.J., Miller, N.A., Soden, S.E., Dinwiddie, D.L., Noll, A., Alnadi, N.A., Andraws, N., Patterson, M.L., Krivohlavek, L.A., Fellis, J., Humphray, S., Saffrey, P., Kingsbury, Z., Weir, J.C., Betley, J., Grocock, R.J., Margulies, E.H., Farrow, E.G., Artman, M., Safina, N.P., Petrikin, J.E., Hall, K.P., and Kingsmore, S.F. 2012. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci. Transl. Med. 4: 154ra135.

Clinical Molecular Genetics

9.24.21 Current Protocols in Human Genetics

Supplement 79

Schnoes, A.M., Brown, S.D., Dodevski, I., and Babbitt, P.C. 2009. Annotation error in public databases: Misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 5:e1000605. Schrijver, I., Aziz, N., Farkas, D.H., Furtado, M., Gonzalez, A.F., Greiner, T.C., Grody, W.W., Hambuch, T., Kalman, L., Kant, J.A., Klein, R.D., Leonard, D.G., Lubin, I.M., Mao, R., Nagan, N., Pratt, V.M., Sobel, M.E., Voelkerding, K.V., and Gibson, J.S. 2012. Opportunities and challenges associated with clinical diagnostic genome sequencing: A report of the Association for Molecular Pathology. J. Mol. Diagn. 14:525-540. Schwarz, J.M., R¨odelsperger, C., Schuelke, M., and Seelow, D. 2010. MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods 7:575-576. Shapiro, B. and Hofreiter, M. 2010. Analysis of ancient human genomes: using next generation sequencing, 20-fold coverage of the genome of a 4,000-year-old human from Greenland has been obtained. Bioessays 32:388-391. Sharp, A.J. 2009. Emerging themes and new challenges in defining the role of structural variation in human disease. Hum. Mutat. 30:135-144. Shen, H., Li, J., Xu, C., Jiang, Y., Wu, Z., Zhao, F., Liao, L., Chen, J., Lin, Y., Tian, Q., Papasian, C.J., and Deng, H.W. 2013. Comprehensive characterization of human genome variation by high coverage whole-genome sequencing of forty four Caucasians. PLoS One 8:e59494. Shen, Y., Wan, Z., Coarfa, C., Drabek, R., Chen, L., Ostrowski, E.A., Liu, Y., Weinstock, G.M., Wheeler, D.A., Gibbs, R.A., and Yu, F. 2010. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res. 20:273-280. Sheth, N., Roca, X., Hastings, M.L., Roeder, T., Krainer, A.R., and Sachidanandam, R. 2006. Comprehensive splice-site analysis using comparative genomics. Nucleic Acids Res. 34:39553967. Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29:308-311.

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., and Birol, I. 2009. ABySS: A parallel assembler for short read sequence data. Genome Res. 19:1117-1123. Smith, A.D., Xuan, Z., and Zhang, M.Q. 2008. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9:128. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J.; OBI Consortium, Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.A., Scheuermann, R.H., Shah, N., Whetzel, P.L., and Lewis, S. 2007. The OBO Foundry: Coordinated evolution of ontologies

to support biomedical data integration. Nat. Biotechnol. 25:1251-1255. Smith, D.R., Quinlan, A.R., Peckham, H.E., Makowsky, K., Tao, W., Woolf, B., Shen, L., Donahue, W.F., Tusneem, N., Stromberg, M.P., Stewart, D.A., Zhang, L., Ranade, S.S., Warner, J.B., Lee, C.C., Coleman, B.E., Zhang, Z., McLaughlin, S.F., Malek, J.A., Sorenson, J.M., Blanchard, A.P., Chapman, J., Hillman, D., Chen, F., Rokhsar, D.S., McKernan, K.J., Jeffries, T.W., Marth, G.T., and Richardson, P.M. 2008. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res. 18:1638-1642. Smith, T.D., Robinson, H.M., and Cotton, R.G. 2012. The Human Variome Project Beijing meeting. J. Med. Genet. 49:284-289. Smith, T.F. and Zhang, X. 1997. The challenges of genome sequence annotation or the devil is in the details. Nat. Biotechnol. 15:1222-1223. Snape, K., Ruark, E., Tarpey, P., Renwick, A., Turnbull, C., Seal, S., Murray, A., Hanks, S., Douglas, J., Stratton, M.R., and Rahman, N. 2012. Predisposition gene identification in common cancers by exome sequencing: Insights from familial breast cancer. Breast Cancer Res. Treat. 134:429-433. Sneddon, T.P. and Church, D.M. 2012. Online resources for genomic structural variation. Methods Mol. Biol. 838:273-289. Stankiewicz, P. and Lupski, J.R. Structural variation in the human genome and its role in disease. Annu. Rev. Med. 61:437-455. Stenson, P.D., Ball, E.V., Mort, M., Phillips, A.D., Shiel, J.A., Thomas, N.S., Abeysinghe, S., Krawczak, M., and Cooper, D.N. 2003. Human Gene Mutation Database HGMD: 2003 update. Hum. Mutat. 21:577-581. Stenson, P.D., Mort, M., Ball, E.V., Howells, K., Phillips, A.D., Thomas, N.S., and Cooper, D.N. 2009. The Human Gene Mutation Database: 2008 update. Genome Med. 11:13. Stubbs, A., McClellan, E.A., Horsman, S., Hiltemann, S.D., Palli, I., Nouwens, S., Koning, A.H., Hoogland, F., Reumers, J., Heijsman, D., Swagemakers, S., Kremer, A., Meijerink, J., Lambrechts, D., and van der Spek, P.J. 2012. Huvariome: A web server resource of whole genome next-generation sequencing allelic frequencies to aid in pathological candidate gene selection. J. Clin. Bioinforma 2:19. Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P., and Batzoglou, S. 2007. Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS One 2:e484. Thomas, P.D., Mi, H., and Lewis, S. 2007. Ontology annotation: Mapping genomic regions to biological function. Curr. Opin. Chem. Biol. 11:4-11. Tong, M.Y., Cassa, C.A., and Kohane, I.S. 2011. Automated validation of genetic variants from large databases: Ensuring that variant references refer to the same genomic locations. Bioinformatics 276:891-893.

9.24.22 Supplement 79

Current Protocols in Human Genetics

Treangen, T.J. and Salzberg, S.L. 2011. Repetitive DNA and next-generation sequencing: Computational challenges and solutions. Nat. Rev. Genet. 13:36-46. Tucker, E.J., Mimaki, M., Compton, A.G., McKenzie, M., Ryan, M.T., and Thorburn, D.R. 2012. Next-generation sequencing in molecular diagnosis: NUBPL mutations highlight the challenges of variant detection and interpretation. Hum. Mutat. 332:411-418. Valencia, A. 2005. Automatic annotation of protein function. Curr. Opin. Struct. Biol. 15:267274. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., Gocayne, J.D., Amanatides, P., et al., 2001. The sequence of the human genome. Science 291:1304-1351. Vijay, N., Poelstra, J.W., Kunstner, A., and Wolf, J.B. 2013. Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Mol. Ecol. 22:620-634. Wagner, M.J. 2013. Rare-variant genome-wide association studies: A new frontier in genetic analysis of complex traits. Pharmacogenomics 14:413-424. Wang, K., Li, M., and Hakonarson, H. 2010. ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 3816:e164. Wang, L., Tsutsumi, S., Kawaguchi, T., Nagasaki, K., Tatsuno, K., Yamamoto, S., Sang, F., Sonoda, K., Sugawara, M., Saiura, A., Hirono, S., Yamaue, H., Miki, Y., Isomura, M., Totoki, Y., Nagae, G., Isagawa, T., Ueda, H., MurayamaHosokawa, S., Shibata, T., Sakamoto, H., Kanai, Y., Kaneda, A., Noda, T., and Aburatani, H. 2012. Whole-exome sequencing of human pancreatic cancers and characterization of genomic instability caused by MLH1 haploinsufficiency and complete deficiency. Genome Res. 22:208219. Wang, L.L., Li, Y., and Zhou, S.F. 2009. A bioinformatics approach for the phenotype prediction of nonsynonymous single nucleotide polymorphisms in human cytochromes P450. Drug Metab. Dispos. 37:977-991. Wang, S.K., Hu, Y., Simmer, J.P., Seymen, F., Estrella, N.M., Pal, S., Reid, B.M., Yidirim, M., Bayram, M., Bartlett, J.D., and Hu, J.C. 2013. Novel KLK4 and MMP20 mutations discovered by whole-exome sequencing. J. Dent. Res. 92:266-271. Wappenschmidt, B., Becker, A.A., Hauke, J., Weber, U., Engert, S., K¨ohler, J., Kast, K., Arnold, N., Rhiem, K., Hahnen, E., Meindl, A., and Schmutzler, R.K. 2012. Analysis of 30 putative BRCA1 splicing mutations in hereditary breast and ovarian cancer families identifies exonic splice site mutations that escape in silico prediction. PLoS One 7:e50800. Waterston, R.H., Lander, E.S., and Sulston, J.E. 2003. More on the sequencing of the human

genome. Proc. Natl. Acad. Sci. U.S.A. 100:30223024; author reply 3025-3026. Watt, S., Jiao, W., Brown, A.M., Petrocelli, T., Tran, B., Zhang, T., McPherson, J.D., KamelReid, S., Bedard, P.L., Onetto, N., Hudson, T.J., Dancey, J., Siu, L.L., Stein, L., and Ferretti, V. 2013. Clinical genomics information management software linking cancer genome sequence and clinical decisions. Genomics Apr 17. pii: S0888-7543(13)00070-0. doi: 10.1016/j.ygeno.2013.04.007. [Epub ahead of print]. Weedon, M.N., Hastings, R., Caswell, R., Xie, W., Paszkiewicz, K., Antoniadi, T., Williams, M., King, C., Greenhalgh, L., Newbury-Ecob, R., and Ellard, S. 2011. Exome sequencing identifies a DYNC1H1 mutation in a large pedigree with dominant axonal Charcot-Marie-Tooth disease. Am. J. Hum. Genet. 89:308-312. Weir, B.A., Woo, M.S., Getz, G., Perner, S., Ding, L., Beroukhim, R., Lin, W.M., Province, M.A., Kraja, A., Johnson, L.A., Shah, K., Sato, M., et al., 2007. Characterizing the cancer genome in lung adenocarcinoma. Nature 450: 893-898. Wetzel, J., Kingsford, C., and Pop, M. 2011. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 12:95. Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y.J., Makhijani, V., Roth, G.T., Gomes, X., Tartaro, K., Niazi, F., Turcotte, C.L., Irzyk, G.P., Lupski, J.R., Chinault, C., Song, X.Z., Liu, Y., Yuan, Y., Nazareth, L., Qin, X., Muzny, D.M., Margulies, M., Weinstock, G.M., Gibbs, R.A., and Rothberg, J.M. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452:872-876. Wirth, B., Garbes, L., and Riessland, M. 2013. How genetic modifiers influence the phenotype of spinal muscular atrophy and suggest future therapeutic approaches. Curr. Opin. Genet. Dev. 23:330-338. Wolfe, A.L., Felock, P.J., Hastings, J.C., Blau, C.U., and Hazuda, D.J. 1996. The role of manganese in promoting multimerization and assembly of human immunodeficiency virus type 1 integrase as a catalytically active complex on immobilized long terminal repeat substrates. J. Virol. 70:1424-1432. Worthey, E.A., Mayer, A.N., Syverson, G.D., Helbling, D., Bonacci, B.B., Decker, B., Serpe, J.M., Dasu, T., Tschannen, M.R., Veith, R.L., Basehore, M.J., Broeckel, U., TomitaMitchell, A., Arca, M.J., Casper, J.T., Margolis, D.A., Bick, D.P., Hessner, M.J., Routes, J.M., Verbsky, J.W., Jacob, H.J., and Dimmock, D.P. 2011. Making a definitive diagnosis: Successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet. Med. 13:255-262. Xi, R., Kim, T.M., and Park, P.J. 2010. Detecting structural variations in the human genome using next generation sequencing. Brief. Funct. Genomics 9:405-415.

Clinical Molecular Genetics

9.24.23 Current Protocols in Human Genetics

Supplement 79

Xiong, F., Gao, J., Li, J., Liu, Y., Feng, G., Fang, W., Chang, H., Xie, J., Zheng, H., Li, T., and He, L. 2009. Noncanonical and canonical splice sites: A novel mutation at the rare noncanonical splice-donor cut site IVS4+1A>G of SEDL causes variable splicing isoforms in Xlinked spondyloepiphyseal dysplasia tarda. Eur. J. Hum. Genet. 17:510-516. Yamaguchi, T., Hosomichi, K., Narita, A., Shirota, T., Tomoyasu, Y., Maki, K., and Inoue, I. 2011. Exome resequencing combined with linkage analysis identifies novel PTH1R variants in primary failure of tooth eruption in Japanese. J. Bone Miner. Res. 26:1655-1661. Yu, T.W., Chahrour, M.H., Coulter, M.E., Jiralerspong, S., Okamura-Ikeda, K., Ataman, B., Schmitz-Abe, K., Harmin, D.A., Adli, M., Malik, A.N., D’Gama, A.M., Lim, E.T., Sanders, S.J., Mochida, G.H., Partlow, J.N., Sunu, C.M., Felie, J.M., Rodriguez, J., Nasir, R.H., Ware, J., Joseph, R.M., Hill, R.S., Kwan, B.Y., Al-Saffar, M., Mukaddes, N.M., Hashmi, A., Balkhy, S., Gascon, G.G., Hisama, F.M., LeClair, E., Poduri, A., Oner, O., Al-Saad, S., Al-Awadi, S.A., Bastaki, L., Ben-Omran, T., Teebi, A.S., Al-Gazali, L., Eapen, V., Stevens,

C.R., Rappaport, L., Gabriel, S.B., Markianos, K., State, M.W., Greenberg, M.E., Taniguchi, H., Braverman, N.E., Morrow, E.M., and Walsh, C.A. 2013. Using whole-exome sequencing to identify inherited causes of autism. Neuron 77:259-273. Zerbino, D.R. and Birney, E. 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:821-829. Zhang, L., Zhang, J., Yang, J., Ying, D., Lau, Y.L., and Yang, W. 2013. PriVar: A toolkit for prioritizing SNVs and indels from nextgeneration sequencing data. Bioinformatics 29: 124-125. Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J., and Shen, B.A. 2011. Practical comparison of de novo genome assembly software tools for nextgeneration sequencing technologies. PLoS One 6:e17915. doi: 10.1371/journal.pone.0017915. Zhang, Z., Burch, P.E., Cooney, A.J., Lanz, R.B., Pereira, F.A., Wu, J., Gibbs, R.A., Weinstock, G., and Wheeler, D.A. 2004. Genomic analysis of the nuclear receptor family: New insights into structure, regulation, and evolution from the rat genome. Genome Res. 14:580-590.

Analysis and Annotation of Sequence Variants for Clinical Diagnosis

9.24.24 Supplement 79

Current Protocols in Human Genetics

Analysis and annotation of whole-genome or whole-exome sequencing-derived variants for clinical diagnosis.

Over the last several years, next-generation sequencing (NGS) has transformed genomic research through substantial advances in technology and reductio...
348KB Sizes 0 Downloads 0 Views