Hum Genet DOI 10.1007/s00439-015-1551-8

ORIGINAL INVESTIGATION

Prevalence estimation for monogenic autosomal recessive diseases using population‑based genetic data Steven J. Schrodi1,2 · Andrea DeBarber3 · Max He1,2 · Zhan Ye1 · Peggy Peissig1 · Jeffrey J. Van Wormer1 · Robert Haws1 · Murray H. Brilliant1,2 · Robert D. Steiner1,4 

Received: 5 January 2015 / Accepted: 5 April 2015 © Springer-Verlag Berlin Heidelberg 2015

Abstract  Genetic methods can complement epidemiological surveys and clinical registries in determining prevalence of monogenic autosomal recessive diseases. Several large population-based genetic databases, such as the NHLBI GO Exome Sequencing Project, are now publically available. By assuming Hardy–Weinberg equilibrium, the frequency of individuals homozygous in the general population for a particular pathogenic allele can be directly calculated from a sample of chromosomes where some harbor the pathogenic allele. Further assuming that the penetrance of the pathogenic allele(s) is known, the prevalence of recessive phenotypes can be determined. Such work can inform public health efforts for rare recessive diseases. A Bayesian estimation procedure has yet to be applied to the problem of estimating disease prevalence from large population-based genetic data. A Bayesian framework is developed to derive the posterior probability density

of monogenic, autosomal recessive phenotypes. Explicit equations are presented for the credible intervals of these disease prevalence estimates. A primary impediment to performing accurate disease prevalence calculations is the determination of truly pathogenic alleles. This issue is discussed, but in many instances remains a significant barrier to investigations solely reliant on statistical interrogation—functional studies can provide important information for solidifying evidence of variant pathogenicity. We also discuss several challenges to these efforts, including the population structure in the sample of chromosomes, the treatment of allelic heterogeneity, and reduced penetrance of pathogenic variants. To illustrate the application of these methods, we utilized recently published genetic data collected on a large sample from the Schmiedeleut Hutterites. We estimate prevalence and calculate 95 % credible intervals for 13 autosomal recessive diseases using these data.

* Steven J. Schrodi [email protected]

Robert D. Steiner [email protected]

Andrea DeBarber [email protected]

1



Center for Human Genetics, Marshfield Clinic Research Foundation, 1000 N Oak Ave‑MLR, Marshfield, WI 54449, USA

2



Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, 425 Henry Mall Rm 3420, Madison, WI 53706, USA

Peggy Peissig [email protected]

3



Physiology and Pharmacology Department, Oregon Health and Science University, Portland, USA

Jeffrey J. Van Wormer [email protected]

4



Pediatrics Department, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA

Max He [email protected] Zhan Ye [email protected]

Robert Haws [email protected] Murray H. Brilliant [email protected]

13



In addition, the Bayesian estimation procedure is applied to data from a central European study of hereditary fructose intolerance. The methods described herein show a viable path to robustly estimating both the expected prevalence of autosomal recessive phenotypes and corresponding credible intervals using population-based genetic databases that have recently become available. As these genetic databases increase in number and size with the advent of cost-effective next-generation sequencing, we anticipate that these methods and approaches may be helpful in recessive disease prevalence calculations, potentially impacting public health management, health economic analyses, and treatment of rare diseases.

Background For many rare inherited diseases, registries and a variety of survey-based approaches have traditionally been the primary sources of information concerning disease prevalence. With the advent and commercialization of massive parallel sequencing (i.e., next-generation sequencing), large-scale population-based sequencing studies are increasingly common (e.g., Tennessen et al. 2012). Such studies have expanded our knowledge of the frequency of loss-of-function variants in the human genome (e.g., MacArthur et al. 2012). With these studies, it is now possible to utilize allele frequency information from large numbers of sequenced chromosomes or genotyped samples to obtain estimates of disease prevalence using genetic data and specialized statistical techniques (e.g., Kim et al. 2008; Fitterer et al. 2014). This provides prevalence information that may complement epidemiological and clinical research, potentially revealing the extent of undiagnosed cases that are not typically captured by registries or population-screening approaches. In some instances, this information may also be useful in understanding the extent of embryonic lethality and neonatal fatalities prior to diagnosis ascertainment which affect prevalence estimates. This use of genetic data is particularly useful for phenotypes inherited in an autosomal recessive manner. The extent to which monogenic recessive diseases are underdiagnosed is of considerable clinical importance. Studies have suggested that the number of undiagnosed cases may be high for several disorders, for example Smith–Lemli– Opitz syndrome and spinal muscular atrophy (Battaile et al. 2001; Nowaczyk et al. 2006; Lyahyai et al. 2012). The public health ramifications of more accurate quantification of disease prevalence could result in more efficient allocation of disease prevention and early treatment resources such as screening techniques. The discovery of alleles that rescue these phenotypes in asymptomatic individuals may inform novel therapeutic approaches, and, targeted treatment for

13

Hum Genet

all patients affected with these diseases. That said, many biases present in registry and epidemiological studies can also be present in sequence databases, distorting conclusions that may be drawn from such data. Therefore, caution in the selection of population-based sequence data is of critical importance. The general idea of the genetic approach is to make use of Hardy–Weinberg Equilibrium (HWE) (Hardy 1908; Weinberg 1908) to estimate the prevalence of a monogenic, autosomal recessive phenotype based on pathogenic allele frequencies from a population-based sample. Under the random mating assumption in a large population, the frequency of homozygotes for a disease-causing allele is simply the square of the frequency of that allele. Hence, from a sample of chromosomes, one can determine the frequency of chromosomes harboring at least one pathogenic allele segregating at a gene that causes a disease being investigated. Providing that pathogenic variants are fully penetrant, the frequency of individuals with two pathogenic alleles (pathogenic homozygotes and compound heterozygotes—i.e., affected individuals) can then be inferred. This approach can provide accurate disease prevalence estimates even in the absence of directly observing affected cases in that sample. However, there are several cautionary aspects to account for which can produce erroneous prevalence estimates: (i) the accuracy of assignment of variant pathogenicity is critical for appropriate calculations. The inclusion/exclusion of variants known to be pathogenic is of paramount importance, for the inclusion of falsely pathogenic variants will overestimate disease prevalence and, conversely, the exclusion of truly pathogenic variants will underestimate disease prevalence (MacArthur et al. 2014). (ii) Equating the frequency of individuals homozygous for pathogenic alleles with the frequency of the recessive disease implicitly assumes that the alleles are fully penetrant. Relaxing this assumption will underestimate disease prevalence (Cooper et al. 2013). (iii) If the sample set used in allele frequency calculation is not representative of the target population (or is composed of individuals derived from different subpopulations), prevalence estimates will not be appropriate. (iv) Rigorous estimates will not rely on the assumption that pathogenic alleles are mutually exclusive on the chromosomes that they reside. For example, if there is linkage disequilibrium between pathogenic alleles, then this correlation needs to be accounted for when calculating the frequency of affected individuals. (v) The genetic technique used to capture the pathogenic variants in the sample set should be unbiased with respect to the calling of such variants. Lastly, (vi) a statistically rigorous approach should be employed when calculating point estimates, confidence intervals/credible intervals, and the probability density of the disease prevalence. We attempt to address these

Hum Genet

challenges by developing a Bayesian estimation procedure for the disease prevalence, presenting quantitative expressions for the pitfalls, and then apply these methods to several disease prevalence estimation examples.

P(q|X = x; 2n) =

0

Methods The approach proposed here is to use Bayes’ theorem to express the posterior probability of the prevalence of a monogenic, autosomal recessive phenotype. We assume a sample is analyzed from the general population through an unbiased genotyping method to determine pathogenic alleles in the sample—sequencing approaches are largely unbiased with regard to the variants detected in the sample set. Credible intervals can then be directly calculated from the posterior probability density.

1

=

2n x





2n x



β −1 (v, w)qx+v−1 (1 − q)2n−x+w−1 β −1 (v, w)qx+v−1 (1 − q)2n−x+w−1 dq

(2)

(2n + v + w) qx+v−1 (1 − q)2n−x+w−1 (2n − x + w)(x + v)

(3)

Since the limit of the posterior probability, as the beta parameters converge to 0, is

(2n + v + w) qx+v−1 (1 − q)2n−x+w−1 v,w→0 (2n − x + w)(x + v) (2n − 1)! = qx−1 (1 − q)2n−x−1 (2n − x − 1)!(x − 1)! lim

(4)

the following form is suggested for situations where v and w are negligible

Definitions Assume initially that the monogenic disease is caused by a fully penetrant single allele in homozygous form. Denote n as the number of diploid individuals in the sample set sequenced. Let q be the population frequency of the causative allele. Denote X as a random variable representing the number of chromosomes in the sample of 2n chromosomes harboring the causative allele, where x is the observed number of causative allele chromosomes in the sample.



P(q|X = x; 2n) ≈

(2n − 1)! qx−1 (1 − q)2n−x−1 (5) (2n − x − 1)!(x − 1)!

If v and/or w are not negligible, then they can be estimated from large databases of autosomal pathogenic alleles (e.g., Stenson et al. 2014) and estimated with standard maximum likelihood or moment-matching methods (Evans et al. 1993). To explore the effect of ignoring the beta parameters, v and w, we calculated the ratio of the expectation of the q with v and w, to the expectation setting the parameters equal to zero:

Basic point estimate Under HWE, a simple point estimate for the frequency  x 2 . of affected individuals in the general population is 2n However, the estimation can be improved and reveal additional statistical properties through calculation of a posterior probability densities of q and q2. Bayesian estimate Using Bayes’ theorem, we can write the posterior density of the allele frequency as



P(q|X = x; 2n) = P(X|q)P(q)

1

P(X|q)P(q)dq 0

−1

. (1)

We model the prior probability of the population allele frequency as a beta variate for two primary reasons: the beta density has the appropriate characteristics for a frequency (e.g., bounded by the unit interval), and most population genetic models for a site frequency spectrum of a biallelic marker are well modeled by a beta variate (e.g., Wright 1931, 1937; Ewens 2004). Let the parameters of the beta variate be v and w. Hence, Eq. (1) can be rewritten

  E q 2n(x + v)   . = x(2n + v + w) E q v,w→0

(6)

Using the NCBI Variation Viewer (http://www.ncbi. nlm.nih.gov/variation/view/), we found 23,139 autosomal variants that were in ClinVar and annotated as being pathogenic. Extracting frequency data that were available from 1000 Genomes and GO-ESP, we estimated the beta parameters as vˆ = 0.008828 and wˆ = 3.564552. Using these values, we calculated Eq. (6) for varying values of the number of chromosomes carrying a causative allele (x), for different numbers of diploid samples (n = 200, 500, 103 , 104 , 105 , 106). The results are presented in Fig. 1. We calculated less than a 1 % departure between the expectation using the beta parameters and the expectation ignoring these parameters across all diploid sample sizes examined and all values of x, and in much of this parameter space we observed a less than 0.05 % departure between the expectations. Values of n smaller than those evaluated will be cause for concern when ignoring the beta parameters. To obtain an understanding of the dispersion of the disease-causing allele frequency in the general population, a

13



Hum Genet

  E q2 |x; 2n =

(2n − 1)! 2(2n − x − 1)!(x − 1)!



0

1

x √ 2n−x−1 q2 1 − q dq.

(10) Typically, this estimate is fairly close to the basic point estimate shown previously. Similar to the credible interval calculation for the population allele frequency in Eqs. (7) and (8), the (1−α) % credible interval can be calculated for the frequency of homozygous individuals in the population: 

1

U

 x (2n − x − 1)!(x − 1)! √ 2n−x−1 q 2 −1 1 − q dq = 2(1 − α/2) (2n − 1)!

(11)

And 

0

Fig. 1  Sensitivity analysis of the beta parameters

(1−α) % Bayesian credible interval can be calculated using Eq. (5). This is accomplished by solving for the lower limit of the integral such that the right-hand side of posterior probability is equal to 1−α/2. Similarly, solving for the upper limit of the integral so that the integration also equals 1−α/2 will allow the lower extent of the credible interval. Hence, the (1−α) % credible interval for the population allele frequency is (L, U) such that 

1

qx−1 (1 − q)2n−x−1 dq = (1 − α/2)

U

(2n − x − 1)!(x − 1)! (2n − 1)!

(7)

And 

0

L

qx−1 (1 − q)2n−x−1 dq = (1 − α/2)

(2n − x − 1)!(x − 1)! . (2n − 1)!

(8)

Standard numerical methods can be employed to rapidly converge upon the values of L and U. Assuming HWE, one can then calculate the posterior probability density of the frequency of homozygous individuals in the population by taking a squared transform of Eq. (5).   P q2 |X = x; 2n ≈

 x (2n − 1)! √ 2n−x−1 q 2 −1 1 − q . 2(2n − x − 1)!(x − 1)!

(9)

We use the expectation of this posterior probability density as the Bayesian point estimate for the disease prevalence, assuming full penetrance.

13

L

 x (2n − x − 1)!(x − 1)! √ 2n−x−1 q 2 −1 1 − q dq = 2(1 − α/2) . (2n − 1)!

(12) Again, standard numerical methods can be used to rapidly solve for (L, U). Under the additional assumption of full penetrance, the above results for homozygous frequencies Eqs. (9, 10, 11, 12) can be used to model the prevalence of a monogenic autosomal recessive disease. In situations where the investigator would like to perform the above calculations using the full Bayesian model with the beta variate parameters can be calculated in a manner analogous to the above derivation. One important factor not considered in the above Bayesian derivations is the incorporation of an inbreeding coefficient. As the density of inbreeding coefficients is dependent on complicated population genetic models, formal treatment of the inbreeding coefficient in the Bayesian framework is outside the scope of this work. Thus, the above Bayesian results are the most applicable in situations where consanguinity is negligible. We do apply these results to a founder population in this work, but that is done primarily for the purpose of illustrating the application of these methods. We do incorporate the inbreeding coefficient below in a Fisherian/frequentist point estimate of prevalence. Allelic heterogeneity and linkage disequilibrium Most monogenic recessive phenotypes can be caused by multiple alleles segregating at the causative gene. For example, CFTR, the gene that, when dysfunctional at both chromosomal copies, causes cystic fibrosis, has many hundreds of reported variants with putative pathogenic effects (CFTR2 website http://www.cftr2.org/index.php; OMIM, Cystic fibrosis transmembrane conductance regulator CFTR http://omim.org/entry/602421). Appropriate consolidation of the effects of multiple pathogenic alleles requires some care. Simply summing the frequency of pathogenic alleles in the sample to use for the calculation of q will

Hum Genet

generate an overestimate of the disease prevalence, for chromosomes harboring two or more pathogenic variants will be erroneously counted at least twice. To obtain the proper count of pathogenic chromosomes, one would need to subtract the number of chromosomes carrying two or more pathogenic alleles such that the total number of pathogenic chromosomes in the sample counts those with one or greater than one disease-causing allele(s). Hence, with cystic fibrosis, as well as the majority of monogenic recessive phenotypes, any combination of two pathogenic alleles in trans form (i.e., homozygotes or compound heterozygotes for pathogenic variants) can produce the autosomal recessive disorder. If we consider that q in the above equations represents the frequency of any chromosome harboring at least one fully penetrant, pathogenic allele and x is the number of chromosomes with at least one fully penetrant, pathogenic allele, then the posterior probability of the disease prevalence can be calculated with Eq. (9). Likewise, Eqs. (11) and (12) can be used to obtain a specified credible interval. Phase of these pathogenic alleles often cannot be robustly determined in deep sequencing studies, as appropriate family members may not be recruited as part of the experimental design or the alleles segregate at frequencies that are very small. Therefore, identification of chromosomes that harbor more than one pathogenic variant may be difficult. However, this may not be as problematic as it appears. Firstly, the effect on the proper number that constitutes x in the sample set should only be mildly affected by this miscount if the frequency of pathogenic alleles in the population is small. Secondly, rare pathogenic alleles are typically—but not always—in relative linkage equilibrium with other rare pathogenic alleles. The chromosome upon which a pathogenic mutation arises will be randomized and the probability of two or more rare events (pathogenic mutations) occurring on the same chromosome is small per unit time. Forward simulation models, and particularly those under mutation– selection balance substantiate these claims (King et al. 2010; Browning and Thompson 2012; Thornton et al. 2013). If we therefore model the pathogenic alleles as being in very low linkage disequilibrium, then an appropriate point estimate of the frequency of chromosomes carrying at least one pathogenic allele is the sum of the individual pathogenic variant counts with various products of the counts across variants subtracted. In the instance of two pathogenic alleles (x1 and x2), the appropriate point estimate for the frequency of individuals homozygous for at least one pathogenic allele is



1 1 x1 + x2 − D − 2n 2n



1 2n

2

x1 x2

2

;

(13)

where D is the standard measure of pairwise linkage disequilibrium, D = P12−q1q2 and P12 is the frequency of the

haplotype carrying both pathogenic alleles, q1 is the frequency of one pathogenic allele, and q2 is the frequency of the other pathogenic allele. Under linkage equilibrium, Eq. (13) will simplify to

2  1 1 x . x x + x − 1 2 1 2 4n2 2n

(14)

The linkage equilibrium assumption carries considerable utility in circumventing phased data, and one can extend this line of reasoning to an arbitrary number of variants. Population structure and inbreeding Numerous instances exist where the frequency of a pathogenic variant varies considerably across subpopulations. Indeed, many variants are present in one subpopulation, but entirely absent in others. This pattern of variation is inherent with rare variants that are likely to be mutations occurring in the recent past (Kimura and Ohta 1973; Griffiths and Tavare 1998; Fu et al. 2013). If the sample of individuals sequenced is derived from two or more subpopulations, then errors can occur with the estimation of the number of individuals homozygous for pathogenic variants. To analytically explore this effect, suppose that the sequenced sample set is composed of two different subsets, each derived from a different ancestral subpopulation. One example of this is the NHLBI Exome Sequencing Project (ESP), which has samples from European Americans and African Americans (Tennessen et al. 2012; Tabor et al. 2014). Implicitly treating such sample sets as being obtained from a panmictic population can lead to poor estimates of the frequency of homozygotes. Modeling this situation, let q1 and q2 be the frequency of a fully penetrant, pathogenic allele in subpopulations 1 and 2, respectively. Similarly, denote n1 and n2 as the number of diploid samples obtained from subpopulations 1 and 2, respectively. Lastly, let x1 and x2 be the number of chromosomes harboring the pathogenic allele in subpopulations 1 and 2, respectively. Provided that the sample set is representative of a larger subdivided population, then the point estimate of the disease prevalence should be



n1 n1 + n 2



x1 2n1

2

+



n2 n1 + n 2



x2 2n2

2

.

(15)

However, if one is ignorant of the underlying structure, and calculates the point estimate of the disease prevalence from the pooled allele frequency, the result would be   x1 + x2 2 . (16) 2(n1 + n2 )

13



Hum Genet

The ratio of Eq. (16) to Eq. (15), which, when compared with unity, would offer information regarding the extent of the bias in the estimation, is therefore

n1 n2 (x1 + x2 )2   (n1 + n2 ) n2 x12 + n1 x22

(17)

As expected, the ratio is unity when n1 = n2 and x1 = x2. Setting x2 = x1 + δ, and differentiating, one can show that

δx(2x+δ) , is (2x2 +2δx+δ 2 )2 0 when the ratio is unity (and where δ  = 0) and the corresponding second derivative with respect to δ, evaluated at that point is negative, − 2x12 . Hence, we conclude that Eq. (16) always underestimates the frequency of homozygotes for the pathogenic allele for nontrivial cases. This effect is akin to the Wahlund effect on heterozygosity (Wahlund 1928). One critical assumption of Hardy–Weinberg equilibrium is random mating of individuals. If one incorporates a nonzero of inbreeding, then the number of homozygotes is increased as compared to that expected under Hardy–Weinberg equilibrium. If p is the frequency of chromosomes with at least one pathogenic allele and F is the inbreeding coefficient, then the frequency of individuals homozygous for pathogenic chromosomes (chromosomes with at least one fully penetrant, pathogenic allele) is p2  +  p(1−p)F, where p is the pathogenic chromosomes and F is the inbreeding coefficient (Wright 1921). This suggests the basic estimator of homozygote frequency:  x 2 1 ˆ + 2 [x(2n − x)]F; (18) 2n 4n where Fˆ is an estimator of the inbreeding coefficient for the population investigated. The selection of Fˆ depends on several factors including sample size and the extent and type of genetic data available. Several studies have compared different types of Fˆ estimators (e.g., Li et al. 2011).

the first derivative with respect to δ, −

Reduced penetrance A critical assumption in this process of estimating disease prevalence is the presumption that the pathogenic alleles are fully penetrant. Several recent empirical studies have provided evidence that variants previously thought to be fully penetrant can be found in unaffected individuals (e.g., Cooper et al. 2013; Xue et al. 2012). Different mechanisms can account for these effects: For example, the original studies suggesting fully penetrant effects were largely based on evaluation of those individuals with disease without extensive investigation in the general population—an ascertainment bias; and in some instances, other variants or additional gene copy numbers can rescue disease

13

phenotypes (e.g., Zhang et al. 2013; Ricard et al. 2010). To account for reduced penetrance of individuals homozygous for a pathogenic allele, one simply needs to scale the prevalence estimate by the penetrance (e.g., assuming a single pathogenic allele with penetrance of, say 0.50, the expected prevalence would be half the expected prevalence of a similar, fully penetrant allele).

Results and discussion To demonstrate the use of the Bayesian estimation methods developed above, we first apply these techniques to genetic data collected on the Schmiedeleut Hutterite population in the United States (Chong et al. 2012). Following the Hutterite application, we also apply these methods to data collected on 2000 central European samples investigating hereditary fructose intolerance (Santer et al. 2005). For small, isolated populations, the variance in pathogenic allele frequencies is expected to be larger than the variance in large, outbred populations. That is, some pathogenic alleles will exhibit much higher frequencies in the small, isolated population than a larger population; and, conversely, other pathogenic alleles will be much lower or absent in the small, isolated population, compared to a larger, outbred population. This increased variance is postulated to be the result of founder effects, and to a lesser extent, more pronounced genetic drift. For example, as the results below will illustrate, the 508del allele at CFTR (rs113993960) segregates at a frequency approximately two- to threefold lower in the Schmiedeleut Hutterite population compared to Northern European populations. However, the 1101Lys variant in the same gene is present on roughly 4 % of the chromosomes in the Schmiedeleut Hutterites, compared to less than 0.1 % of European individuals used in the CS Agilent ClinSeq population (1 in 1324 chromosomes) (http://www.ncbi.nlm.nih.gov/SNP/ snp_viewTable.cgi?pop=15248). The Schmiedeleut Hutterite population is composed of 177 colonies located in Manitoba, North Dakota, South Dakota, and Minnesota. They are one of three groups of Hutterites in North America, with a total of 462 colonies (http://www.hutterites.org/). The Hutterites are an Anabaptist religious community that originated in the Tyrolean Alps (Hostetler 1974). The North American population of Hutterites has grown to approximately 45,000 individuals (http://www.hutterites.org/). For the sake of illustrating the methods developed here, let us assume that all colonies are of similar size and therefore estimate the current population of Schmiedeleut Hutterites to number 17,240. This value will be used to estimate the prevalence of 13 autosomal recessive diseases studied by Chong and colleagues. In the Chong et al. study, direct genotyping was used to determine

Hum Genet Table 1  Summary of allelic counts in the Chong et al. publication

Table 2  Calculated disease prevalence and 95 % credible intervals, without inbreeding

Disease

Gene/mutation

Total Chr

Mut Chr

Limb-girdle muscular dystrophy 2H

TRIM32 p.Asp487Asn

2986

246

Oculocutaneous albinism type 1A

TYR p.Cys91Tyr

2562

186

Spinal muscular atrophy type III

SMN1 exon 7 del

2830

183

Limb-girdle muscular dystrophy 2I

FKRP p.Leu276Ile

2254

127

Sitosterolemia

ABCG8 p.Ser107Ter

3030

135

Joubert syndrome

TMEM237 p.Arg18Ter

3040

122

Cystic fibrosis

CFTR p.Met1101Lys

2946

120 113

Nonsyndromic mental retardation

TECR p.Pro182Leu

2992

Restrictive dermopathy

ZMPSTE24 c.1085dupT

2722

87

Nonsyndromic deafness

GJB2 c.35delG

3020

54

Dilated cardiomyopathy with ataxia

DNAJC19 IVS3-1G>C

3008

42

Bardet–Biedl syndrome

BBS2 IVS3-2A>G

3036

42

Usher syndrome type 1F

PCDH15 c.147delT

3048

38

Cystic fibrosis

CFTR p.Phe508del

2964

32

Disease

Point estimate (95 % credible interval)

Limb-girdle muscular dystrophy 2H Oculocutaneous albinism type 1A Spinal muscular atrophy type III Limb-girdle muscular dystrophy 2I Sitosterolemia Joubert syndrome Cystic fibrosis (Met1101Lys) Nonsyndromic mental retardation Restrictive dermopathy Nonsyndromic deafness Dilated cardiomyopathy with ataxia Bardet–Biedl syndrome Usher syndrome type 1F

0.00681 (0.00530–0.00856) 0.00530 (0.00395–0.00688) 0.00420 (0.00312–0.00548) 0.00320 (0.00223–0.00439) 0.00200 (0.00121–0.00272) 0.00162 (0.00112–0.00225) 0.00167 (0.00115–0.00232) 0.00144 (0.000976–0.00201) 0.00103 (0.000660–0.00151) 0.000326 (0.000181–0.000524) 0.000200 (0.000102–0.000340) 0.000196 (0.0000998–0.000334) 0.000159 (0.0000781–0.000279)

Cystic fibrosis (Phe508del)

0.000120 (0.0000547–0.000220)

the number of chromosomes carrying pathogenic mutations for a variety of autosomal recessive diseases in 1644 subjects. Table 1 presents a summary of data found in Chong et al. (2012). We use this summary information on the number of chromosomes carrying pathogenic alleles in the sample set to determine the posterior probability of the disease prevalence and 95 % credible interval using Eqs. (6, 7, 8, 9), thereby demonstrating the use of these equations in the Schmiedeleut Hutterite population (Table 2). We present these analyses simply as an example of how these Bayesian methods can be applied to population-based genetic data—there are other effects such as the distribution of kinship coefficients among mating pairs and the extent of geographical heterogeneity in pathogenic allele frequencies among the three Hutterite groups that should be accounted for to obtain accurate Bayesian credible intervals.

Example 1a: oculocutaneous albinism type 1A Oculocutaneous albinism type 1A (OCA1A) occurs in individuals who are homozygous or compound heterozygous for pathogenic mutations in the gene encoding for tyrosinase (TYR) (Tomita et al. 1989; Tripathi et al. 1992). The compromised oxidative activity of pathogenic tyrosinase results in little or no production of melanin from tyrosine, impacting skin, hair and eye pigmentation and decreased visual acuity. The Cys91Tyr missense variant that occurs in the Schmiedeleut Hutterite population segregates at a relatively high frequency (Chong et al. 2012). Chong and colleagues do not present confidence intervals or credible sets for disease prevalence. Assuming full penetrance, the posterior probability of OCA1A in this population is displayed in Fig. 2. The frequentist, basic point estimate,

13



Hum Genet

Fig. 2  Posterior probability of oculocutaneous albinism type 1A

Table 3  Fold increase in the expected prevalence incorporating inbreeding

Example 1b: Bardet–Biedl syndrome

Disease

Fold increase in expected prevalence

Limb-girdle muscular dystrophy 2H Oculocutaneous albinism type 1A Spinal muscular atrophy type III Limb-girdle muscular dystrophy 2I Sitosterolemia Joubert syndrome Cystic fibrosis (Met1101Lys) Nonsyndromic mental retardation Restrictive dermopathy Nonsyndromic deafness Dilated cardiomyopathy with ataxia Bardet–Biedl syndrome Usher syndrome type 1F

1.364 1.418 1.473 1.548 1.701 1.782 1.770 1.833 1.990 2.796 3.309 3.331 3.590

Cystic fibrosis (Phe508del)

3.996

Bardet–Biedl syndrome is an autosomal recessive ciliopathy affecting multiple organ systems including renal dysfunction, retinal degeneration, postaxial polydactyly, and central obesity (Beales et al. 1999). BBS2 encodes for a protein in the BBSome complex, critical in ciliogenesis, cell migration and division (Harnandez-Hernandez et al. 2013). Using the allele frequency in the Schmiedeleut Hutterite sample from the work of Chong and colleagues, we calculate the posterior probability of Bardet–Biedl syndrome as generated from the BBS2 IVS3-2A>G variant (Fig.  3). The expected prevalence, without consanguinity, was calculated to be 1 in 5105 individuals, 95 % credible set (1 in 10,017–1 in 2992). Given our assumptions, within the total Schmiedeleut Hutterite population, this credible interval spans 1.72–5.76 individuals with BBS due to homozygosity for IVS3-2A>G. Using Eq. (18), the estimated inbreeding coefficient increases these numbers considerably—a 3.3-fold increase in the expected prevalence is projected. Hence, the range of BBS individuals could be as high as 6–19 individuals.

 x 2 2n ,

for OCA1A in this population would be 0.00527 and if we adjust by the inbreeding coefficient (Fˆ  = 0.0327; Ober et al. 1998) using Eq. (18), this prevalence estimate increases to 0.00747. Using the Bayesian approach, we calculate the expected prevalence to be 1 in 189 individuals with the 95 % credible interval extending from 1 in 145 to 1 in 253 individuals (Table 2). Across the entire Schmiedeleut Hutterite population, we anticipate between 68 and 119 individuals to be affected with OCA1A due to homozygosity for Cys91Tyr, assuming no consanguinity. Further, given the estimated inbreeding coefficient, the number may be between 96 and 169. Table 3 shows the fold increase in the expected prevalence using Eq. (18).



13

Example 1c: cystic fibrosis Cystic fibrosis is an obstructive exocrine gland disorder affecting multiple organ systems (Rowe et al. 2005). The disease is the most common, high-mortality inherited disease in those of European ancestry. The segregation of cystic fibrosis follows a monogenic, autosomal recessive pattern and the disease was the focus of one of the earliest efforts in linkage mapping, resulting in the discovery of the CFTR gene on chr 7q31 (Riordan et al. 1989). Two pathogenic variants were genotyped in the Schmiedeleut

Hum Genet Fig. 3  Posterior probability of Bardet–Biedl syndrome

Fig. 4  Cystic fibrosis

Hutterites by Chong and colleagues. As phase was not directly measured in the sample set, we follow our previously described guidelines and assume linkage equilibrium between Met1101Lys and Phe508del. The two alleles can then be combined and the resulting prevalence of cystic fibrosis due to these two variants can be calculated (Fig. 4). Compound heterozygotes are included in the calculation of the cystic fibrosis prevalence. The expected prevalence is 0.002617, or 1 in 382 individuals; the 95 % credible interval: (1 in 532–1 in 285 individuals). Example 2: hereditary fructose intolerance The previous examples used data from the isolated Schmiedeleut Hutterite population. However, we anticipate that much of the utility of the methods described above will be in the application to randomly selected individuals from much larger populations. Let us therefore analyze data from a large population. Santer and colleagues investigated

pathogenic mutations in ALDOB in a mutation screening study in 2000 randomly selected newborns, primarily of central European ancestry (Santer et al. 2005). Loss of function of the gene that encodes for aldolase B, ALDOB, causes hereditary fructose intolerance (HFI)—a defect in the metabolism of fructose. In the study, A150P, A175D, and N335K were the most frequent mutations. The counts of chromosomes harboring these specific mutations, among the 4000 chromosomes screened, totalled 21—all of which were heterozygotes (Santer et al. 2005). From these data, the authors calculate a point estimate of HFI prevalence due to these three mutations as 1 in 36,300 with a 95 % confidence interval of (1 in 17,500–1 in 110,000). Although the authors do not state how these confidence intervals are calculated, a reasonable assumption is that a normal approximation to the binomial, the Wilson score interval, or similar method was employed. Applying our Bayesian framework to these data generates somewhat different results with 1 in 34,641 being the expectation, and a 95 % credible interval of (1 in 16,807–1 in 94,471). Therefore, the Bayesian approach shows a moderate departure from those presented in Santer et al. with HFI prevalence higher by 4 % and the low end of the credible set being 14 % higher than the low end of the confidence interval reported in that study. Mirroring the history of human genetics, the populations sequenced for genetic studies are expected to be diverse: some studies may focus on small, founder populations, whereas other studies may use samples from very large outbred populations. Under the neutral model, it is well known from diffusions, recursive, and coalescent approaches that heterozygosity is equal to a curvilinear function of the θ , where θ is the product of effective population size, 1+θ 4, the effective population size (N), and the neutral mutation rate (μ) for the region being examined, θ  = 4 Nμ (Kimura and Crow 1964). The homozygosity is therefore

13



Hum Genet

1 θ = 1+θ 1 − 1+θ . Replacing this expression into Eq. (18), yields the following for the estimate of the homozygote frequency in terms of θ:

x(2n + θ x) . 4n2 (1 + θ )

(19)

Another classical result from population genetics is that mutation–selection balance on recessive phenotypes generates standing pathogenic allele frequencies of the square √ root of the mutation rate to pathogenic alleles, µ, at equilibrium (e.g., Hartl and Clark 1989). This result assumes that the selection against recessive genotypes is complete (i.e., the selection coefficient is unity). Hence, assuming mutation–selection balance, the expectation for the frequency of homozygote individuals [e.g., Eq. (10)] is also an estimate of the mutation rate to pathogenic alleles—specifically, the pathogenic alleles used in the prevalence calculation—driving that recessive phenotype.

Conclusions Estimating the prevalence of monogenic autosomal recessive phenotypes can add considerable understanding of the burden of rare diseases in a population. Population-screening approaches to obtain recessive disease prevalence estimates can be augmented by such genetic calculations to quantify undiagnosed or asymptomatic cases. These methods are particularly timely as large population-based genetic databases are created and expanded. The genetic approach based on existing information in databases is very accessible and provides prevalence estimates that are complementary to screening efforts. As has been discussed routinely in the statistical literature, Bayesian methods have many desirable properties including interpretability, the incorporation of prior information, consistency, and unknown parameters are treated as random variables (Jaynes 1976). Previous estimation attempts using genetic data have primarily focused solely on point estimates of homozygote frequency; however, a few have gone further to employ frequentist procedures and asymptotic distributions to obtain confidence intervals. As with any asymptotic procedure, error is introduced from the assumption that non-limiting distributions behave as limiting distributions. Further, the rate of convergence to the limiting distribution is not always well understood and therefore the error may be largely unknown. Here we present a Bayesian derivation for the probability density of the population allele frequency conditioned on the observed count of the allele from a sample set of chromosomes. Equations are then presented for the expectation and posterior probability of the homozygote frequency in the sourced

13

population. From this posterior density, expressions for the 95 % credible set are shown and calculated in the applications. Indeed, any (1−α) % credible interval can be calculated with Eqs. (11) and (12). In addition, we present and discuss several challenges to these estimation procedures in the setting of monogenic autosomal recessive phenotypes: we show the appropriate treatment of allelic heterogeneity, the utility of the linkage equilibrium assumption, sampling from structured populations, and the inverse effect from reduced prevalence. All of the facets of the prevalence estimation approach enable researchers to hold a more complete view on these procedures. We apply these probabilistic techniques to data collected from Chong and colleagues in a large sample from the Schmiedeleut Hutterites and draw conclusions concerning the expected prevalence (and 95 % credible interval) of 13 diseases in this population. Considerable increases in the expected disease prevalence resulted from the incorporation of an estimated inbreeding coefficient in the Schmiedeleut Hutterite population. Large, outbred population would not be subject to such dramatic increases. Results of analyses using three of these diseases are presented in detail showing the posterior probability densities of the disease prevalence using a robust probabilistic model. In addition, we compare results from our methods to those presented in a study of HFI. These results demonstrate the utility of performing the Bayesian calculations described herein. In turn, these Bayesian calculations will enable accurate disease prevalence density calculations based on existing genetic data.

References Battaile KP, Battaile BC, Merkens LS, Maslen CL, Steiner RD (2001) Carrier frequency of the common mutation IVS8-1G>C in DHCR7 and estimate of the expected incidence of Smith–Lemli– Opitz syndrome. Mol Genet Metab 72(1):67–71 Beales PL, Elcioglu N, Woolf AS, Parker D, Flinter FA (1999) New criteria for improved diagnosis of Bardet–Biedl syndrome: results of a population survey. J Med Genet 36:437–446 Browning SR, Thompson EA (2012) Detecting rare variant associations by identity-by-descent mapping in case–control studies. Genetics 190:1521–1531 Chong JX, Ouwenga R, Anderson RL, Waggoner DJ, Ober C (2012) A population-based study of autosomal-recessive diseasecausing mutations in a founder population. Am J Hum Genet 91:608–620 Cooper DN, Krawczak M, Polychronakis C, Tyler-Smith C, KehrerSawatzki H (2013) Where genotype is not predictive of phenotype: towards an understanding of the molecular basis of reduced penetrance in human inherited disease. Hum Genet 132:1077–1130 Evans M, Hastings N, Peacock B (1993) Statistical Distributions, 2nd edn. Wiley, USA Ewens WJ (2004) Mathematical Population Genetics, 2nd edn. Springer-Verlag, New York

Hum Genet Fitterer B, Hall P, Antonishyn N, Desikan R, Gelb M, Lehotay D (2014) Incidence and carrier frequency of Sandhoff disease in Saskatchewan determined using novel substrate with detection by tandem mass spectrometry and molecular genetic analysis. Mol Genet Metabol 111(3):382–389 Fu W, O’Connor TD, Jun G, Kang HM et al (2013) Analysis of 6515 exomes reveals the recent origin of most human protein-coding variants. Nature 493(7431):216–220 Griffiths RC, Tavare S (1998) The age of a mutation in a general coalescent tree. Commun Stat Stoch Models 14(1–2):273–295 Hardy GH (1908) Mendelian proportions in a mixed population. Science 28(706):49–50 Harnandez-Hernandez V, Pravincumar P, Diaz-Font A et al (2013) Bardet–Biedl syndrome proteins control the cilia length through regulation of actin polymerization. Hum Mol Genet 22(19):3858–3868 Hartl DL, Clark AG (1989) Principles of population genetics, 2nd edn. Sinauer Associates, Sunderland Hostetler JA (1974) Hutterite Society. Johns Hopkins University Press, Baltimore Jaynes ET (1976) Confidence intervals vs bayesian intervals. In: Harper AL, Hooker CA (eds) Foundations of probability, statistical inference, and statistical theories of science Kim GH, Yang JY, Park JY, Lee JJ, Kim JH, Yoo HW (2008) Estimation of Wilson’s disease incidence and carrier frequency in the Korean population by screening ATP7B major mutations in newborn filter papers using the SYBR green intercalator method on the amplification refractory mutation system. Genet Test 12(3):395–399 Kimura M, Crow JF (1964) The number of alleles that can be maintained in a finite population. Genetics 49:725–738 Kimura M, Ohta T (1973) The age of a neutral mutant persisting in a finite population. Genetics 75:199–212 King CR, Rathouz PJ, Nicolae DL (2010) An evolutionary framework for association testing in resequencing studies. PLoS Genet 6:e1001202 Li M-H, Stranden I, Tiirikka T, Sevon-Aimonen M-L, Kantanen J (2011) A comparison of approaches to estimate the inbreeding coefficient and pairwise relatedness using genomic and pedigree data in a sheep population. PLoS One 6(11):e26256 Lyahyai J, Sbiti A, Barkat A, Ratbi I, Sefiani A (2012) Spinal muscular atrophy carrier frequency and estimated prevalence of the disease in Moroccan newborns. Genet Testing Mol Biomark 16(3):215–218 MacArthur DG, Balasubramanian S, Frankish A, Huang N et al (2012) A systematic survey of loss-of-function variants in the human protein-coding genes. Science 335(6070):823–828 MacArthur DG, Manolio TA, Dimmock DP, Rahm HL et al (2014) Guidelines for investigating causality of sequence variants in human disease. Nature 508:469–476 Nowaczyk MJM, Waye JS, Douketis JD (2006) DHCR7 mutation carrier rates and prevalence of the RSH/Smith–Lemli–Opitz syndrome: where are the patients? Am J Med Genet Part A 140A:2057–2062 Ober C, Cox NJ, Abney M, DiRienzo A et al (1998) Genome-wide search for asthma susceptibility loci in a founder population. The Collaborative Study on the Genetics of Asthma. Hum Mol Genet 7(9):1393–1398 Ricard G, Molina J, Chrast J, Gu W et al (2010) Phenotypic consequences of copy number variation: insights from Smith–Magenis

and Potocki–Lupski Syndrome mouse models. PLoS Biol 8(11):e1000543 Riordan JR, Rommens JM, Kerem B, Alon N et al (1989) Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245:1066–1073 Rowe SM, Miller S, Sorscher EJ (2005) Cystic fibrosis. N Engl J Med 352:1992–2001 Santer R, Rischewski J, von Weihe M, Niederhaus M et al (2005) The spectrum of Aldolase B (ALDOB) mutations and the prevalence of hereditary fructose intolerance in Central Europe. Hum Mut 25(6):594 Stenson PD, Mort M, Ball EV, Shaw K, Phillips AD, Cooper DN (2014) The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet 133(1):1–9 Tabor HK, Auer PL, Jamal SM, Chong JX et al (2014) Pathogenic variants for Mendelian and complex traits in exomes of 6517 European and African Americans: implications for the return of incidental results. Am J Hum Genet 95(2):183–193 Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM, Broad GO, Seattle GO, on behalf of the NHLBI Exome Sequencing Project (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337(6069):64–69 (PMID: 22604720) Thornton KR, Foran AJ, Long AD (2013) Properties and modeling of GWAS when complex disease risk is due to non-complementing, deleterious mutations in genes of large effect. PLoS Genet 9(2):e1003258 Tomita Y, Takeda A, Okinaga S, Tagami H, Shibahara S (1989) Human oculocutaneous albinism caused by single base insertion in the tyrosinase gene. Biochem Biophys Res Commun 164:990–996 Tripathi RK, Droetto S, Spritz RA (1992) Many patients with ‘tyrosinase-positive’ oculocutaneous albinism have tyrosinase gene mutations (abstract). Am J Hum Genet 51(suppl):A179 Wahlund S (1928) Zusammensetzung von Population und Korrelationserscheinung vom Standpunkt der Vererbungslehre aus betrachtet. Hereditas 11:65–106 Weinberg W (1908) Über den Nachweis der Vererbung beim Menschen. Jahreshefte des Vereins für vaterländische Naturkunde in Württemberg 64:368–382 Wright S (1921) Systems of mating, I-V. Genetics 6:111–178 Wright S (1931) Evolution in Mendelian populations. Genetics 16:97–159 Wright S (1937) The distribution of gene frequencies in populations. Proc Natl Acad Sci 31(12):382–389 Xue Y, Chen Y, Ayub Q, Huang N et al (2012) Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing. Am J Hum Genet 91(6):1022–1032 Zhang L, Karsten P, Hamm S, Pogson JH et al (2013) TRAP1 rescues PINK1 loss-of-function phenotypes. Hum Mol Genet 22(14):2829–2841

13

Prevalence estimation for monogenic autosomal recessive diseases using population-based genetic data.

Genetic methods can complement epidemiological surveys and clinical registries in determining prevalence of monogenic autosomal recessive diseases. Se...
723KB Sizes 0 Downloads 5 Views