Hum Genet DOI 10.1007/s00439-013-1396-y

Review Paper

Molecular genetic epidemiology of human diseases: from patterns to predictions Carolin Knecht · Michael Krawczak 

Received: 1 November 2013 / Accepted: 7 November 2013 © Springer-Verlag Berlin Heidelberg 2013

Abstract  Databases of disease-associated or diseasecausing mutations allow the study, not only of the molecular mechanisms underlying the primary lesions at the DNA level, but also of the functional consequences of mutation at the phenotypic level. The Human Gene Mutation Database (HGMD) and the bioinformatics analyses of its content provide an illustrative example of this indirect approach to molecular genetic epidemiology. In fact, the Bayesian type of reasoning underlying previous scientific analyses of HGMD data is also reflected in current software tools used to predict the likely disease relevance of a newly detected genetic variant. After a brief resume of the past scientific utility of HGMD, we, therefore, shortly review three representative and commonly used examples of these tools, namely SIFT, PolyPhen-2 and NNSplice.

Introduction Beginning with the first discovery of mutational hotspots in phage T4 (Benzer 1961), an ever increasing amount of evidence has accumulated for a relationship between the sequence characteristics of a piece of DNA and its propensity to undergo mutation. Indeed, the idea of, for example, repeatmediated DNA slippage, methylation-induced deamination of cytosine residues or inefficient DNA mismatch repair representing major causes of both in vitro and in vivo mutagenesis has become commonplace in twenty-first century molecular genetics. Nearly 35 years ago, one of us (MK) was fortunate

C. Knecht · M. Krawczak (*)  Institute of Medical Informatics and Statistics, ChristianAlbrechts University of Kiel, Brunswiker Strasse 10, 24105 Kiel, Germany e-mail: [email protected]‑kiel.de

to become involved in what would nowadays be called “bioinformatics” research into these mutational mechanisms, an endeavor that resulted in three publications (surprisingly) counting among the most cited papers in the 50 years history of Human Genetics (Cooper and Krawczak 1990; Krawczak and Cooper 1991; Krawczak et al. 1992). Instead of banking on hypothesis-driven laboratory experiments in systems of sufficiently high mutational turnover, this work was aimed at the in vivo situation in humans, following an approach that was inspired greatly by classical medical statistics. If any of the putative mutational mechanisms inferred from bacteria, fruit flies or rats were to be relevant for human cells as well, they should leave characteristic footprints in the observable pattern of naturally occurring human mutations. Fortunately, a data resource suitable to address this research question from a particular germline angle had started to become available in the 1980s. Thus, at the beginning of the decade, Kwok et al. (1981) had published the first report of an inherited single base-pair substitution that was causally related to a human disease state, namely a C to G transversion (Phe to Leu) at codon 24 of the insulin (INS) gene causing hyperproinsulinemia. Some 10 years later, the number of mutations reportedly underlying an inherited human disease state had increased considerably, including 139 non-synonymous single base-pair substitutions, 101 point mutations near mRNA splice sites and 60 microdeletions (20 years ago (Cooper et al. 2011). In particular, with the advent of high-throughput genotyping technologies, the relationship between the functional characteristics of a mutation and its likely disease relevance has gained enormous attention. Both SNPbased genome-wide association studies (GWAS) and whole exome or whole genome sequencing experiments

Hum Genet

are means of hypothesis-free “fishing expeditions” with the aim to unravel the molecular basis of genetic diseases. Since the number of disease-relevant mutations present in the current human population is likely to number several hundreds of millions, however, with most of them being very rare or even private (Cooper et al. 2010), statistical approaches to assess the impact of a newly found mutation, namely case–control comparisons or a prospective followup of carriers, are bound to fail. Therefore, it is not surprising that genetic epidemiology and bioinformatics have jointly picked up the idea of “predicting” the functional consequences of a mutation from auxiliary information that is either readily available or easy to obtain (Thusberg and Vihinen 2009; Frousios et al. 2013). The background and current state-of-the art of these endeavors will be reviewed briefly in the following section.

Predicting the functional consequences of mutation According to the website of the International HapMap Project (see http://www.genome.gov), the human genome contains approximately 10 million common single nucleotide polymorphisms (SNPs), defined as single base-pair substitutions with a minor allele >1 % in (one or more) human populations. Over the last 10 years, these SNPs have been the main target of genome-wide studies (i.e. GWAS) into the genetic basis of common as well as rare diseases. Increasingly, these association-based studies are either supplemented or even replaced by sequencing-based variant searches using next-generation sequencing technologies. In any case, all these endeavors have generated comprehensive lists of genetic variants of possible disease relevance. Since the pathogenicity of a mutation will often not be immediately clear, particularly if the lesion in question has not been reported in a disease context before, additional information is required to distinguish between (potential) causality and neutrality. One major approach to solve this issue is by way of in silico analysis using ‘prediction tools’. The basic idea Various software tools have been developed to assess the likelihood of a certain genetic variant having disadvantageous functional consequences. Since mutations resulting in an amino acid replacement are the best understood in terms of their biochemical and biophysical impact on the gene product, it is not surprising that considerable progress has been made in developing tools for these lesions in particular, henceforth referred to as ‘non-synonymous single nucleotide variants (nsSNVs)’. The majority of nsSNV prediction tools are available as freeware and comparatively easy to use. They are usually based on a combination of primary DNA sequence features,

Amino Acid Sequence Variant

Sequence Features

Structural Features

primary structure secondary structure evolutionary conservation tertiary structure biochemistry biochemistry

Annotation Data function molecular pathology

Algorithm

Output (Score)

Fig. 1  Generic framework of nsSNV prediction tools (adapted from Wu and Jiang 2013)

structural features of the encoded mRNA or protein product, or auxiliary annotation data (e.g. whether the affected amino acid residue is located in ligand binding site, or has been reported as being susceptible to mutation in a disease context before). A generic framework of nsSNV prediction tools is depicted in Fig. 1. The input of each tool usually comprises (1) a suitable description of the variant of interest and (2) the tool-specific amino acid sequence information required. The latter can be either the primary sequence data or a pointer to these data such as, for example, a SwissProt identifier. The features employed and the algorithms used for prediction making (e.g. neural network, support vector machine) differ substantially between the tools, so that it is not readily clear which tool should be used. In fact, the optimal choice of a tool is likely to depend on the study in question (Thusberg et al. 2011). Therefore, and because of their popularity and methodological representativeness, we will confine our short review to just two nsSNV prediction tools, namely SIFT and PolyPhen-2 (Table 1). SIFT (sorting intolerant from tolerant) Sorting intolerant from tolerant (SIFT), a public available prediction tool developed by Ng and Henikoff (2001), uses amino acid sequence homology to assess the potential effect of an nsSNV, based on the assumption that lesions in evolutionary conserved regions are more likely to affect protein function than lesions in variable regions (Ng and Henikoff 2006). SIFT entails a multistep procedure rating both the position and type of the amino acid substitution of interest. First, it generates a multiple alignment of closely related sequences from SwissProt that may have similar function, followed by the calculation of the probabilities of all substitutions possible at the affected position. These probabilities are then normalized relative to the most probable substitution and the results compared

13



Hum Genet

Table 1  Characteristics of three selected in silico prediction tools Characteristic

SIFT

PolyPhen-2

NNSplice

Target Algorithm Features

nsSNV Sequence alignment Amino acid sequence

Splice site mutations Neural network DNA sequence

Input

Amino acid sequence or SwissProt ID or rs number or location, amino acid substitution Tolerated, damaging

http://www.fruitfly.org/seq_tools/ splice.html

Additional output

Number and median conservation of aligned sequences

nsSNV Bayes classifier Amino acid sequence, secondary and tertiary structure Amino acid sequence or SwissProt ID or rs number or location, amino acid substitution Probably damaging, possibly damaging, benign, unknown False and true positive rate, protein structure

URL

http://sift.jcvi.org/

http://genetics.bwh.harvard.edu/pph2/

Classification

to a certain threshold. If the normalized probability of the observed nsSNV is below the threshold, it is classified as damaging; otherwise, the substitution is predicted to be tolerated. In addition to the binary decision about functionality and the underlying normalized probability, SIFT also outputs the median conservation of the sequences involved in the underlying alignment so as to allow appraisal of the quality of the latter. Moreover, together with the number of sequences involved in the alignment, this value also gives an indication of the reliability of the prediction.

DNA sequence (up to 100,000 bp)

n.a. Exon/intron boundaries

Table 2  Performance of three selected in silico prediction tools Tool

SIFT

PolyPhen-2a

NNSplice

Sensitivity Specificity

0.68 0.62

0.73 (0.86) 0.70 (0.51)

~0.85 ~0.86

Matthews correlation coefficient

0.30

0.43 (0.39)

n.a.

a  Values refer to either the Mendelian or the common disease (in brackets) training set of PolyPhen-2. SIFT and PolyPhen-2 were evaluated by Thusberg et al. (2011). NNSplice was assessed by Beth Hellen in her 2009 report to the National Genetics Reference Laboratory, Manchester UK (see text)

PolyPhen‑2 (Polymorphism Phenotyping v2) The PolyPhen-2 tool by Adzhubei et al. (2010) uses eight sequence-based and three structure-based features for prediction, with the most predictive features being (1) the probability of the two nsSNV alleles occupying the affected position in orthologous sequences (PSIC score), (2) the congruence of the mutant allele and the alignment, and (3) whether the nsSNV is located in a hypermutable CpG dinucleotide. The functional effect of the amino acid substitution is predicted by PolyPhen-2 using a naïve Bayes classifier that categorizes the nsSNV as “probably damaging”, “possibly damaging”, “benign” or “unknown”. The actual output score also serves to indicate the confidence into this prediction. PolyPhen-2 was trained on two different datasets to detect either (1) mutations with a drastic effect comparable to a Mendelian disease or (2) mildly deleterious variants such as those presumably underlying complex diseases. The PolyPhen-2 user must decide between these two different settings when applying the tool. Tool performance and further developments Thusberg et al. (2011) compared some of the most widely used prediction tools with regard to their

13

sensitivity, specificity and concordance between predicted and true effect (measured by the Matthews correlation coefficient). As expected, no tool emerged as the uniformly best. However, although SIFT and PolyPhen-2, like most of the other tools examined, were found to perform quite well (Table 2), it has been suggested repeatedly that more than one tool should be used to optimize prediction (Thusberg et al. 2011; Wu and Jiang 2013). Recently, the development of novel prediction tools has slowed down considerably, and the focus of current efforts is on the creation of pipelines such as, for example, PON-P (Olatubosun et al. 2012) or ANNOVAR (Wang et al. 2010) that allows many tools to be used simultaneously. Point mutations at splice sites In contrast to nsSNVs, available tools to predict the functional effect of mutations affecting splice sites do not address this question directly. Instead, they evaluate whether a certain DNA sequence represents a potential splice site or not. If the probability of being a splice site changes notably due to the mutation under study, this may indicate a functional effect on splicing itself. In line

Hum Genet

with the early work of Krawczak et al. (1992), splice site prediction tools focusing on conserved sequence elements of splice sites such as the immediate context of the invariant dinucleotides are more accurate than others (Baralle et al. 2009). In fact, most tools compare these ‘consensus sequences’ to the sequence affected by mutation and quantify the similarity of the two, albeit with different algorithms (e.g. Neural Networks, Weight Matrix models). As with nsSNVs, integrative tools combining different prediction methods for splice sites are available as well (e.g. Alamut from Interactive Biosoftware). However, although intuitively appealing, according to a 2009 report to the National Genetics Reference Laboratory (NGRL), Manchester UK (http://www.academia.edu/209605/ Splice_Site_Tool_Analysis_Report), it is not entirely clear whether combined tools really have improved prediction accuracy. NNSplice (Neural Network Splice site prediction tool) NNSplice, a popular splice site prediction tool developed by Reese et al. (1997), is based on a neural network, trained by backward propagation (Table 1). It screens DNA sequences of between 41 and 100,000 bp in length for potential splice sites. Depending on the chosen threshold, it outputs the position of these putative splice sites together with a site-specific score. NNSplice works for both acceptor and donor splice sites, and also does so reasonably well (Table 2).

Conclusions Since their publication in the early 1990s, the basic idea underlying the three honored bioinformatics papers published in Human Genetics (Cooper and Krawczak 1990; Krawczak and Cooper 1991; Krawczak et al. 1992) has been corroborated in a myriad of ways (Cooper et al. 2011). Moreover, their Bayesian type of reasoning is being echoed by current tools to predict the functional consequences of mutation. These tools may be extremely helpful in elucidating further the genetic basis of human disease. However, it must never been forgotten that the outcome of these algorithms is statistical in nature, as were the conclusions of the three original Human Genetics papers. Moreover, mutation prediction tools are only as good as the data that have been used in their construction, yet mutation datasets vary greatly in terms of their relevance, accuracy and comprehensiveness (Johnston and Biesecker 2013; Peterson et al. 2013). Thus, even if a variant looks damaging in silico with one or more prediction tools, it will always require careful medical and experimental verification whether it is indeed damaging in the individual case.

Acknowledgments  The authors are most grateful to Amke Caliebe, Kiel, for her support and for helpful comments of the manuscript.

References Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249 Ball EV, Stenson PD, Abeysinghe SS, Krawczak M, Cooper DN, Chuzhanova NA (2005) Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat 26:205–213 Baralle D, Lucassen A, Buratti E (2009) Missed threads. The impact of pre-mRNA splicing defects on clinical practice. EMBO Rep 10:810–816 Benzer S (1961) On the topography of the genetic fine structure. Proc Natl Acad Sci USA 47:403–415 Cooper DN, Krawczak M (1990) The mutational spectrum of single base-pair substitutions causing human genetic disease: patterns and predictions. Hum Genet 85:55–74 Cooper DN, Chen JM, Ball EV, Howells K, Mort M, Phillips AD, Chuzhanova N, Krawczak M, Kehrer-Sawatzki H, Stenson PD (2010) Genes, mutations, and human inherited disease at the dawn of the age of personalized genomics. Hum Mutat 31:631–655 Cooper DN, Bacolla A, Férec C, Vasquez KM, Kehrer-Sawatzki H, Chen JM (2011) On the sequence-directed nature of human gene mutation: the role of genomic architecture and the local DNA sequence environment in mediating gene mutations underlying human inherited disease. Hum Mutat 32:1075–1099 Frousios K, Iliopoulos CS, Schlitt T, Simpson MA (2013) Predicting the functional consequences of non-synonymous DNA sequence variants—evaluation of bioinformatics tools and development of a consensus strategy. Genomics 102:223–228 Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185:862–864 Johnston JJ, Biesecker LG (2013) Databases of genomic variation and phenotypes: existing resources and future needs. Hum Mol Genet 22(R1):R27–R31 Krawczak M, Cooper DN (1991) Gene deletions causing human genetic disease: mechanisms of mutagenesis and the role of the local DNA sequence environment. Hum Genet 86:425–441 Krawczak M, Reiß J, Cooper DN (1992) The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum Genet 90:41–54 Krawczak M, Ball EV, Cooper DN (1998) Neighboring nucleotide effects on the rates of germline single base-pair substitution in human genes. Am J Hum Genet 63:474–488 Kwok SC, Chan SJ, Rubenstein AH, Poucher R, Steiner DF (1981) Loss of a restriction endonuclease cleavage site in the gene of a structurally abnormal human insulin. Biochem Biophys Res Commun 98:844–849 Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11:863–874 Ng PC, Henikoff S (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7:61–80 Olatubosun A, Väliaho J, Härkönen J, Thusberg J, Vihinen M (2012) PON-P: integrated predictor for pathogenicity of missense variants. Hum Mutat 33:1166–1174 Peterson TA, Doughty E, Kann MG (2013) Towards precision medicine: advances in computational approaches for the analysis of human variants. J Mol Biol 425:4047–4063

13

Reese MG, Eeckman FH, Kulp D, Haussler D (1997) Improved splice site detection in Genie. J Comp Biol 4:311–323 Stenson PD, Mort M, Ball EV, Shaw K, Phillips AD, Cooper DN (2013) The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet (in press). doi:10.1007/s00439-013-1358-4 Thusberg J, Vihinen M (2009) Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat 30:703–714

13

Hum Genet Thusberg J, Olatubosun A, Vihinen M (2011) Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat 32:358–368 Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164 Wu J, Jiang R (2013) Prediction of deleterious nonsynonymous single-nucleotide polymorphism for human diseases. ScientificWorldJournal 2013:675851

Molecular genetic epidemiology of human diseases: from patterns to predictions.

Databases of disease-associated or disease-causing mutations allow the study, not only of the molecular mechanisms underlying the primary lesions at t...
230KB Sizes 0 Downloads 0 Views