Genetic Epidemiology

RESEARCH ARTICLE Bayesian Latent Variable Collapsing Model for Detecting Rare Variant Interaction Effect in Twin Study 1,6 ¨ 2,3 Samuli Ripatti,1,4,5 and Janne Pitkaniemi ¨ Liang He,1 ∗ Mikko J. Sillanpa¨ a, 1

Department of Public Health, Hjelt Institute, University of Helsinki, Finland; 2 Department of Mathematical Sciences, University of Oulu, Oulu, Finland; 3 Department of Biology and Biocenter Oulu, University of Oulu, Oulu, Finland; 4 Institute for Molecular Medicine Finland FIMM, University of Helsinki, Finland; 5 Human Genetics, Wellcome Trust Sanger Institute, United Kingdom; 6 Finnish Cancer Registry, Institute for Statistical and Epidemiological Cancer Research, Helsinki, Finland

Received 13 October 2013; Revised 28 February 2014; accepted revised manuscript 28 February 2014. Published online 9 April 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/gepi.21804

ABSTRACT: By analyzing more next-generation sequencing data, researchers have affirmed that rare genetic variants are widespread among populations and likely play an important role in complex phenotypes. Recently, a handful of statistical models have been developed to analyze rare variant (RV) association in different study designs. However, due to the scarce occurrence of minor alleles in data, appropriate statistical methods for detecting RV interaction effects are still difficult to develop. We propose a hierarchical Bayesian latent variable collapsing method (BLVCM), which circumvents the obstacles by parameterizing the signals of RVs with latent variables in a Bayesian framework and is parameterized for twin data. The BLVCM can tackle nonassociated variants, allow both protective and deleterious effects, capture SNP-SNP synergistic effect, provide estimates for the gene level and individual SNP contributions, and can be applied to both independent and various twin designs. We assessed the statistical properties of the BLVCM using simulated data, and found that it achieved better performance in terms of power for interaction effect detection compared to the Granvil and the SKAT. As proof of practical application, the BLVCM was then applied to a twin study analysis of more than 20,000 gene regions to identify significant RVs associated with low-density lipoprotein cholesterol level. The results show that some of the findings are consistent with previous studies, and we identified some novel gene regions with significant SNP–SNP synergistic effects. Genet Epidemiol 38:310–324, 2014. © 2014 Wiley Periodicals, Inc.

KEY WORDS: Bayesian collapsing model; genetic association; LDL-C; rare variant; twin study

Introduction Genome-wide association (GWA) studies have turned out successful in identifying variants that are significantly related to various complex diseases [McCarthy et al., 2008]. The current GWA studies and methods are founded on the common disease-common variant (CD-CV) hypothesis [Becker, 2004], which predicts that CD-associating alleles, or variants, will be found in all human populations that manifest a given disease. However, although over 1,500 genetic loci have been associated with CD and traits, they only account for a small proportion of heritability for most diseases [Eichler et al., 2010; Frazer et al., 2009; Hindorff et al., 2011]. Many potential sources of missing heritability have been examined [Manolio et al., 2009], including another competing so-called CD, rare variant (CD-RV) hypothesis, which states that causal DNA sequence variation at any gene could encompass a wide range of possibilities, with the most extreme being that each mutation is only found once in the population [Iyengar and Elston, 2007]. Simulations have shown that such Supporting Information is available in the online issue at wileyonlinelibrary.com. ∗

Correspondence to: Liang He, Department of Public Health, Hjelt Institute, Univer-

sity of Helsinki, P.O. Box 41, FI-00014, Finland. E-mail: [email protected]

hypothesis is possible [Robinson, 2010]. By analyzing more next-generation sequencing data, researchers have affirmed that rare genetic variants are widespread among populations and probably play an important role in complex phenotypes [Tennessen et al., 2012]. To elucidate the missing heritability, the genetic association studies have been focused on RVs. Due to low minor allele frequencies (MAFs) of RVs, even with high penetrance, most existing statistical methods for CV detection might drastically lose power or not work properly. In order to improve the power, many recently proposed analyses focus on the idea of pooling and collapsing instead of individually investigating each variant [Han and Pan, 2010; Li and Leal, 2008; Madsen and Browning, 2009; Morris and Zeggini, 2010; Price et al., 2010]. In these methods, multiple RVs are collapsed together, which increase the frequencies of occurrence so that they outperform single variant-based tests. These methods differ in the way of aggregating RVs, e.g., with or without explicit weighting and criterion for weighting. Thus, given different threshold values or variant information such as MAF, such algorithms can give rather different results. However, the pooling strategy has some drawbacks and limitations. If the proportion of noncausal RVs unrelated to the phenotype, is high, pooling such RVs will lower the  C 2014 WILEY PERIODICALS, INC.

statistical power to detect association. Furthermore, causal RVs can be either protective or deleterious, which means that an RV can affect the phenotype either positively or negatively. In such case the pooling method may worsen the statistical power. Recently, some novel approaches have been proposed to tackle this situation. For example, model selection using stepwise regression can be utilized in order to determine whether an RV should be incorporated into a pooling group and if pooled, what is the direction of association [Hoffmann et al., 2010]. The kernel-based adaptive clustering (KBAC) method models the effect of pooled RVs using a mixture of kernels [Liu and Leal, 2010]. The KBAC framework combines databased adaptive variant classification and testing of association so that it avoids misclassification of variants, which is found in the naive pooling methods. Neale et al. [2011] introduced the C-alpha test statistic for testing the presence of a mixture of effects across a set of RVs. Basu and Pan [2011] provided a comprehensive review and comparison of these statistical methods. In a recent study [Ladouceur et al., 2012], a more detailed power comparison using Sanger sequencing data evaluated the pros and cons of these methods in different scenarios. More recently, Bayesian framework has been introduced into RV association detection. Yi et al. introduced a Bayesian model of multiple RVs, which performed better than the Collapsing and Weight-sum methods [Yi and Zhi, 2011], however, it cannot assess the variant-based contribution. Another approach based on the Bayesian framework has been introduced to answer not only the question of a global association, but also the question of which variants are driving that association [Quintana et al., 2011]. Such an approach searches all possible models of different combinations and directions of RVs so that it is powerful both globally and marginally. However, due to its complexity of model space and sampling procedures, it suffers from much more computation time than other tests restricting its practical use for a large number of RVs. The CD-RV hypothesis claims that most of the variance for certain complex diseases is due to many RVs of large effect. However, such model along with CD-CV is still insufficient to explain the missing heritability. A broad sense heritability model posits that genotype-by-genotype interactions (G × G interactions) and genotype-by-environment interactions (G × E interactions) play an important role in quantitative genetics research [Gibson, 2012]. Some reports [Gerke et al., 2009; Meyers et al., 2010] also suggest that synergistic combinations may carry information that cannot be discovered by interrogating main effects of individual SNPs alone. Although many approaches have been proposed for identifying CV interactions, nonadditive or synergistic interactions between RVs are difficult to detect in GWASs because of the low power. With the increase of variants available from the next generation sequencing technologies, there is an urgent demand for the development of an efficient approach able to detect whether a large group of candidate variants affect a trait in a synergistic manner. Despite the proposal of some methods for family-based RV association study [Oualkacha

et al., 2013], the methods dedicated to identifying significant RVs in twin data based on the classical twin model have not been widely developed, and the impact of twin-based methods has not been investigated. It has been shown that family-based design can be more powerful than analysis of independent observations. Classical twin design still makes an important contribution for identifying the genetic variation that underlies complex traits in many ways [van Dongen et al., 2012]. In this study, we introduce a novel Bayesian latent variable collapsing model (BLVCM) to identify the association between a trait of interest and a group of RVs for both additive and synergistic effects. The paper is organized as follows. First, we introduce the model and methodology, by which it combines various features into an integrated framework. We show how the strategy deals with nonassociated variants, allows both protective and deleterious effects, captures potential SNP-SNP interactions, identifies the global effect with the consideration of variant classification, and pinpoints underlying RVs that contribute most and can be applied to twin data in reasonable computational time. We then present the simulation study, and compare the BLVCM to other methods under different settings. Finally, we analyze a real dataset from a Finnish twin study for testing the association between RVs and low-density lipoprotein cholesterol (LDL-C) level.

Methods Model Specification Suppose that we have observed data on M twin samples (including monozygotic (MZ) twins and dizygotic (DZ) twins) indexed by i, each of which contains a set of j = 1, . . . , ni (Although multiple birth subjects such as triplets or quadruplets are very rare, they also can be included if available, and in that case, ni is larger than 2. However, nontwin siblings are not allowed in the model)  individuals so that the total number of individuals is M i =1 n i = N. The type of twin sample i is defined by T (i) ∈ {MZ , DZ }. The model can be readily extended to handle other phenotypic forms, but for simplicity, we assume that the phenotype y ij , i = 1, . . . , M, j = 1, . . . , ni measured on these individuals is of quantitative value. Let there be K RVs genotyped in some genomic regions of interest along with other genetic information, such as MAF. Although the rigid definition of an RV is one present with a MAF of less than 1%, the boundary varies under different scenarios. Throughout this article, a variant of MAF less than 5% is defined as an RV. We denote the genotypes of variant k by A k A k , A k ak , or ak ak , where A k is the minor allele. Thus, the observed K RV genotypic values for N individuals can be coded in a matrix G i×j ×k . There are several ways to encode the genotypes. For example, in an additive model, G ij k = 2, 1, 0 corresponds to the number of copies of minor allele A k for subject j in twin sample i at variant site k. In a dominant model, which we adopt in this article, G ij k = 0, 1 corresponds to the occurrence of the minor allele A k . In the case of RVs, the additive model is almost equivalent to the Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

311

dominant one due to the extremely low frequency of copies of minor alleles. Given the fact that RVs show less correlation in data than CVs, we collapse the genotypic values across all RVs for individual j in twin sample i, and define new variables x ij and z ij as: ⎧ K  ⎪ ⎪ ⎪ ⎪ 1, if Gij k > 0 ⎪ ⎨ k=1 x ij = , (1) K ⎪  ⎪ ⎪ ⎪ Gij k = 0 ⎪ ⎩0, if k=1

⎧ ⎪ ⎪ ⎪ 1, ⎪ ⎪ ⎨ z ij =

⎪ ⎪ ⎪ ⎪ ⎪ ⎩0,

if

K 

Gij k > 1

k=1

if

K 

.

(2)

Gij k ≤ 1

k=1

In a twin study, individuals are not independent but correlated within each family. Based on the classical twin design, we decompose the phenotypic variation into additive genetic (A), shared environmental (C), and specific environmental (E) components using a random effects model, often denoted as the acronym “ACE” model. Thus, for MZ and DZ twins, the relation between the observed quantitative phenotypic value y ij and the collapsed indicators x ij and z ij can be ex-

pressed by the following hierarchical mixed effects regression model: ⎛ ⎞  αk ⎠ · x ij + γ y ij = μ + β · ⎝ k∈V(ij )

· 1{T(i)=MZ } + θijDZ · 1{T(i)=DZ } · z ij + ϕMZ i DZ 1

DZ 2 · 1{T(i)=DZ } + c i + εij , (3) ·1{T(i)=MZ } + ϕi + ϕij ·



θiMZ

where μ is the intercept representing the mean level of all subjects’ phenotypic values, and the coefficients β and γ are representative of the main gene effect and the synergistic effect. Throughout this paper, synergistic refers to any nonlinear effect of combination of two or more RVs. Here, αk , θiMZ , and θijDZ are latent variables that represent the unobservable single variant-specific effect and individual-specific synergistic effect of a twin (Fig. 1). These effects can obtain values of {–1,0,1}, i.e., they are signed indicator variables controlling whether the particular RV (twin group or individual) is contributing to the phenotype or not, and if it is contributing, then the effect can be positive or negative. For each individual, V(ij ) is defined as the set of variant sites at which at least one copy of a minor allele is located, i.e., V (ij ) = {k|G ij k > 0}. The shared environmental effect c i , the additive genetic effect 1 for MZ, and ϕDZ for DZ twins are twin-level random ϕMZ i i effects that follow the following independent and identical

Figure 1. DAG of BLVCM. The relationship between parameters and latent variables within the hierarchical Bayesian model is depicted in a diagram.

312

Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

zero mean normal distributions:

c i ∼ i.i.d. N 0, σc2

∼ i.i.d. N 0, σϕ2 ϕMZ i  1 ϕDZ ∼ i.i.d. N 0, σϕ2 2 . i

2 The individual-level random effect ϕDZ only for DZ twins ij DZ 1 follows the same distribution as ϕi , i.e.,  2 ϕDZ ∼ i.i.d. N 0, σϕ2 2 . ij

In model (3), εij is the random error term for the unique environmental component, which follows a normal distribution with zero mean and variance σε2

εij ∼ i.i.d. N 0, σε2 . We set inverse Gamma(0.001,0.001) as an noninformative prior for all these variance hyper-parameters above because the conjugate prior leads to analytical form of the posterior so that Gibbs sampling can be applied. Our model can also be applied to independent studies when ϕM Z , ϕDZ1 , ϕDZ2 , and c are fixed at zero. If only the shared environmental effect c is set to be zero, model (3) becomes an AE twin model. Some mutations can change the amino acids and the structure of the protein making the protein dysfunctional. Model (3) is based on the biological assumption that causal RVs in the same gene or pathway would contribute a similar effect size to a phenotype, although they can affect the phenotype in opposite directions. Therefore, in our model, the magnitude of effect is assumed to be the same for all causal variants. Some mutations in a gene region or a pathway may not modify the produced protein (synonymous mutation), and this is the biological interpretation for noncausal variants. Because a protein is a functional unit, destroying the same protein or pathway would lead to the similar effect on the phenotype, which is captured by the main gene effect β in formula (3). Thus, the overall association between the phenotype and the set of RVs can be identified by investigating the posterior distribution of β. Each RV may have an unobserved unique contribution to the phenotype, which can be regarded as missing data and modeled as a latent variable. Thus, αk is a variable modulating the global effect β, and reflects the contribution and direction of the minor allele. The setup is based on the fact that although the gene can affect the trait in a collective way, each variant may play a unique role, such as functional or nonfunctional, deleterious, or protective. For synergistic effects, to enumerate all possible combinations of RVs would lead to a huge number of parameters, so, instead of variant-level effects, we assign twin-level components for MZ twins and individual-level components for DZ twins. Notice that because only a small number of individuals can harbor copies of a minor allele in more than one site for a set of RVs, we collapse the synergistic effect, denoted by γ, to boost the power. Furthermore, any individual with a copy of minor allele in more than one RV site may have an unobserved RV combination effect. Similar to the main effect, an RV combination may or may not affect the phenotype in a synergistic way. Therefore, we introduce latent variables

θiMZ and θijDZ to model this unobserved RV combination effect. Because the genotypes are the same for MZ twins, they must share a twin-level synergistic effect θiMZ , whereas for DZ twins, θijDZ is an individual-wise latent variable because they can have different patterns of minor alleles. For individuals with z ij = 0, there is no need to update their θiMZ or θijDZ , so we set them to be 0. In doing so, there are actually only a small number of θiMZ and θijDZ to be updated in the model, which prevents the overfitting problem. If a large number of RVs are under investigation, a shrinkage-inducing prior distribution can be applied to restrict the diversity of θiMZ and θijDZ . Prior Distributions for Latent Variables In model (3), the multiplicative relations between β and αk , γ and θiMZ or θijDZ may result in a nonidentifiable model [Gelman and Hill, 2006]. Several approaches to circumvent an identifiability problem have been proposed [Bafumi et al., 2005]. To ensure the identifiability of these two pairs of parameters, the distributions over the latent variables hence have to be restrictive and carefully selected. In this study, categorical distributions are adopted, and their effects in analyses are investigated in detail. In this paper, αk , θiMZ , and θijDZ are sampled from threedimensional categorical distributions with hyperparameters Z DZ p k , qM i , and qij , respectively (Fig. 1). More specifically, to represent protective, neutral, and deleterious effect, we allow them to be sampled in a discrete space {–1, 0, 1} as follows: αk |p k ∈ (–1, 0, 1) ∼ Cat(3, p k )  Z

Z θiMZ qM ∈ (–1, 0, 1) ∼ Cat 3, qM i i  DZ θijDZ qDZ . ij ∈ (–1, 0, 1) ∼ Cat 3, qij The proof for the identifiability is provided in Supplementary (Appendix 1). So, the kth RV is treated as associated when αk  = 0; otherwise it is neutral. p k is the predictor for the variant-level latent variable αk , and can be regarded as the probability for the kth variant to be protective, neutral, or deleterious. In doing so, this model takes into account the variant-specific main effect of each RV in a group. Because usually we do not have such information or estimation beforehand, we may not determine the values for p k subjectively, but instead construct a hierarchical prior and estimate it in a data-driven way. Therefore, we place a Dirichlet prior Z DZ distribution on p k , qM i , and qij : p k ∼ Dir(0.1, 0.1, 0.1) Z qM i

∼ Dir(0.1, 0.1, 0.1)

qDZ ij

∼ Dir(0.1, 0.1, 0.1).

Such a distribution maintains the conjugate property for the multinomial distribution, which makes it straightforward to apply the Gibbs sampler. Small hyper-parameters can reduce the influence of the prior and allow the posterior distribution more dependent on the data, yet the hyperparameters can be tuned based on other prior information to improve the performance, which is one of the advantages of Bayesian Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

313

framework. For example, if there is existing evidence to show deleterious effect for some RVs, a larger weight can be assigned to that category. Sometimes, it may be reasonable and more beneficial to put more weight on zero for those RVs in a noncoding region. Moreover, it is obvious that the naive collapsing method [Li and Leal, 2008] becomes a special case of our model by fixing αk = 1 and θiMZ , θijDZ = 0 through the priors. Correspondingly, for the purpose of avoiding aliasing, which arises in models with symmetries where there are two symmetric posterior modes, p (β) and p (γ) are given the following truncated normal distributions,

β ∼ N 0, σβ2 I (β ≥ 0)

γ ∼ N 0, σγ2 I (γ ≥ 0), where σβ2 and σγ2 are prespecified hyperparameters, whose selection will be discussed later. Extension for Other Trait Types and Covariates So far, the attention has been restricted to phenotypes defined on a continuous scale. We can also consider other phenotypes, in particular a binary outcome, which is commonly found in case-control design. For the purpose of simplicity, we omit the variance components here, and the model can be slightly modified as E (y ij ) = ξij



g (ξij ) = μ + β · ⎝



⎞ αk ⎠ · x ij + γ

k∈V(ij )

· θiMZ · 1{T(i)=MZ } + θijDZ · 1{T(i)=DZ } · z ij , where g (·) is a link function, and y ij is assumed to be generated from a particular distribution in the exponential family [Nelder and Wedderburn, 1972]. We keep the other parameters, latent variables, variance components and their priors unchanged. Particularly, when y ij conforms to a Bernoulli distribution, and logit or probit is adopted as the link function, the model can deal with popular case-control design in genetic epidemiology. In such cases, some changes in the MCMC sampling scheme have to be made, for example, introducing auxiliary variables. Other covariates are readily included in the following model, ⎛ ⎞  αk ⎠ · x ij E (y ij ) = μ + β · ⎝

on model (3), the inference on the association between a phenotype and a group of RVs can be made from the posterior distributions to draw a conclusion whether there exists a significant main or synergistic effect. To investigate the genelevel main effect association, we wish to test the following interval hypotheses: H 0 : 0 < β ≤ λ or ∀1 ≤ k ≤ K , H1 : β > λ and ∃1 ≤ k ≤ K ,

where cv ij is the covariate vector, and w is the weight vector for the covariates to be estimated.

 β>λ∩

P

The Gibbs sampler that enables drawing inference under model (3) is described in Supplementary (Appendix 2). Based

314

Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

K 



0 0

k=1 K 

 |αk | = 0

k=1

P (β > λ)P

 K 

=

1 – P (β > λ)P 



 |αk | > 0

k=1  K 

 |αk | > 0

k=1

λ



(1 – (1/3)K ) √ 2σβ   = ,

λ K K erf √ 1 – (1/3) + (1/3) 2σβ 1 – erf

Inference on Posterior Quantities

αk  = 0,

where λ is a small positive number giving definition of nonnegligible gene effect. Although in principle the choice of λ is arbitrary, a too small λ might inflate the estimate error resulting from the numerical approximation. So we set λ as 0.2 σε , where σε is the estimated standard deviation for random error. It is noted that we need to consider both β and αk altogether because, as shown in Supplementary (Appendix 1), β = 0 is equivalent to ∀1 ≤ k ≤ K , αk = 0. Moreover, to sustain the identifiability of β and αk , k = 1, . . . , K , we give a more informative prior to restrict β to be positive. However, such constrain incurs the difficulty of setting an appropriate credible interval for the posterior distribution to test β = 0 because the interval encounters the boundary of parameter space. Therefore, we adopt the interval hypothesis instead of a point one, and propose the Bayes factor (BF) [Kass and Raftery, 1995] instead of using credible interval. BFs give a measurement on the evidence provided by a dataset in favor of one model against the other while taking into account model complexity. Such an interval null hypothesis leads to a hybrid BF [Morey and Rouder, 2011]. The BF can be expressed using the ratio of the odds of posterior distribution, which can be approximated by the Markov chain Monte Carlo (MCMC) method, to the prior odds. For 1) the prior distribution described above, the prior odds PP (H (H0 ) are calculated as:

k∈V(ij )

+ γ · θiMZ · 1{T(i)=MZ } + θijDZ · 1{T(i)=DZ } · z ij + w ·cv ij ,

αk = 0,

where (·) is the error function, defined as: erf (x) =  x erf –t2 √2 e dt. Thus, the hybrid BF can be obtained by π 0  β>λ∩



0λ∩

P

K 

K 

0 λ ∩ ∃1 ≤ i ≤ n, θi  = 0) . P (0 < γ ≤ λ ∪ ∀1 ≤ i ≤ n, θi = 0)



|αk | > 0

k=1

 P

|αk | > 0|Data

k=1



⎞       θMZ  + θDZ  > 0|Data⎠ Pˆ ⎝γ > λ ∩  i ij



k=1



BF(H1 : H0 ) =

K 

BF(H1 : H0 ) ⎛

K 

.

(4)

|αk | = 0

Results Simulation

k=1

Evaluation of power in twin design Under general assumptions, for small P-values (P < 1/e, where e ≈ 2.72), the connection between P-values and BFs is BF < –1/(e · P · log (P )) [Stephens and Balding, 2009]. Jeffreys also suggested a classification of BFs [Jeffreys, 1998]. More accurate inference can be determined through simulation or permutation, which we discuss in the next section. Once given the evidence of a global association, we can further predict the underlying RVs by investigating the K marginal posterior distributions of αk to test αk = 0 vs. αk  = 0, and a good way to quantify and summarize the posterior probabilities is also to calculate the marginal BF, which is the ratio of the posterior odds to the prior odds of the same variable, and defined as: BF(M1 (αk  = 0) : M0 (αk = 0)) =

p (αk  = 0|Data)/p (αk  = 0) . p (αk = 0|Data)/p (αk = 0)

Note that because the two models share the same parameter space, the marginal model posterior distribution for variant k is the posterior distribution p (αk |Data ), which can also be approximately obtained from the output using the MCMC method. Given the same prior probability for the two models, the BF simply becomes the ratio of the posterior distributions of αk . Finally, we derive a similar inference strategy for the test on significant synergistic effect association. For identifying the synergistic effect association, we need to test the following hypotheses: H0 : 0 < γ ≤ λ ∪

      θMZ  + θDZ  = 0, i ij T(i)=MZ

T(i)=DZ

      θMZ  + θDZ  > 0, H1 : γ > λ ∩ i ij T(i)=MZ

T(i)=DZ

where λ is a small positive number. Similar to the main effect testing, the hybrid BF for this hypothesis can be expressed by:

We conducted simulations to evaluate the performance of the BLVCM for detecting main and synergistic effects. A group of K = 40 RVs were generated for each replicate with 600 individuals, half of whom were MZ twins and the rest DZ twins, and no CVs were included in the simulations. Although a wider MAF spectrum is involved in real data, for simplicity, we set uniformly MAF = 0.005 for all RVs. The genotypes were coded using the dominant model, that is, the genotype was 1 when at least one copy of a minor allele occurred, otherwise, it was 0. Thus, by assuming Hardy–Weinberg equilibrium, the probability of G ij = 1 was approximately 0.01. To generate the genotype matrix G ij , in each twin i, we first randomly and independently drew samples based on B ern(0.01) for all RVs of member j = 1. For the other (j = 2) twin, we copied the genotypic value if T (i) = MZ , otherwise we sampled it from B ern (0.01) · (1 – G i1 ) + B ern (0.5) · G i1 . Then, we simulated phenotypic data according to formula (3) with the consideration of the following scenarios:

r Proportion of associated variants: To evaluate the effects of misclassification resulted from noncausal variants, for each group of RVs, the scenarios for main effect were considered with different proportions of associated variants p asso main = 60%, 40%, 20%. For the synergistic effect, because the combination of copies of minor alleles at more than two RVs for a certain person was very rare, we only considered the two-RV combination here so that there were total 40 × 40/2 = 800 combinations. We investigated the situations where p asso syn = 100%, 50%, 25%, 12.5% of these combinations had true contributions. r Direction of associated variant effects: We assumed that the main or synergistic effect of associated variants could be either deleterious or protective. In the bidirectional effect simulations, we always fixed half of associated RVs as deleterious, and the rest as protective. r Effect size and residual variance: We simulated the phenotypes with two main effect sizes β = 0.6, 0.8, and synergistic effects γ = 0.6, 0.8, 1.0, 1.2, 1.4. We fixed the total variance at 1, out of which the variance for A, C, and E component was 1/3, respectively. Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

315

Figure 2. Power curves for detecting main effect of BLVCM under different effect sizes and proportion of causal combinations.

Figure 3. Power curves for detecting synergistic effect of BLVCM

To evaluate the power, we first treated the BF as a test statistic found BF cut-offs to control the type I error below 0.05. Such cut-offs are dependent on sample sizes, priors for parameters, number of RVs and MAF of RVs, and should be determined through simulations. The BF is largely affected by the prior distributions exerted on the interrogated parameters, and overly diverse prior tends to yield a very small BF [Kass and Raftery, 1995]. Therefore, we fixed the hyperparameters σβ2 and σγ2 to be 1. Under such settings, we found through the simulations that the BF threshold was 2, which was used to evaluate the power. The empirical powers for detecting main effect association calculated from 1,000 replicates are shown in Figure 2. When effect size β equals 0.6, the powers are above 0.8 regardless of whether the effects are in the same direction or not. It can be seen that when RVs with bidirectional main effects are involved, the power is slightly higher. It may be due to the fact that αk follows a three-dimensional categorical distribution in our model, and missing a category could lead to a redundant estimate, which means that we use a overly complex model to fit a simple dataset. When β rises to 0.8, the powers are all above 0.8. Then, we investigated the powers for interaction effect association detection under various scenarios, including different effect sizes, small proportion of interaction effect, and bidirectional interaction effect. To evaluate the power, we set up a cut-off of BF to control the empirical type I error rate below 0.05. In this simulation, there was no main effect for any single RV, and RVs could only contribute to the phenotype in a synergistic manner. More specifically, when copies of minor alleles were present in at least two RVs, there could be a synergistic effect. In contrast, the powers for detecting synergistic effect, plotted in Figure 3, are lower than those

for the main effect. When the effect size is smaller than 1.0, the powers are all below 0.5. Although the co-occurrence of copies of minor allele in a person among a moderate group of RVs is fairly rare, after all the combinations are collapsed, the results indicate that the large effect size is essential to achieve the high power if there are low proportion of causal RV combinations.

316

Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

under different effect sizes and proportion of causal combinations.

Comparison of different methods We compared the performance with other methods widely used for RV association detection. For the main effect association detection, the BLVCM is compared to the Granvil (http://www.well.ox.ac.uk/GRANVIL/) [Morris and Zeggini, 2010] and the sequence kernel association test (SKAT) with the linear kernel [Wu et al., 2011]. The Granvil is a representative of those models based on so-called burden tests and the SKAT is a common variance-based test, which turns out more flexible and robust for neutral and bidirectional RV effects. Because these methods are not designed to deal with twin data, we simulated the data based on the configuration above, except all individuals became independent rather than twins. Note that independent design is a special case in our model with the A and C components being zero. Thus, we excluded these components from the Gibbs sampler and fixed them to be zero for comparison with other models. For each scenario, 1,000 replicates were run to obtain the empirical false-positive rates (FPRs) and true-positive rate (TPRs). As a means of comparison for these methods that utilize different statistical frameworks, we plotted the receiver operating characteristic (ROC) curves from the FPRs and TPRs, and the results are shown in Figure 4.

Figure 4. ROC curves of three competitive methods under different settings: (A) 10% causal RVs, effect size 0.6; (B) 10% causal RVs, effect size 0.8; (C) 30% causal RVs, effect size 0.6; (D) 10% deleterious and 10% protective RVs, effect size 0.6; (E) 10% deleterious and 10% protective RVs, effect size 0.8; and (F) 20% deleterious and 20% protective RVs, effect size 0.6

Based on a visual comparison, as shown in Figure 4C, when there are 30% causal RVs, all three methods have nearly the same area under curve (AUC), which means that they share similar performances when causal RVs affect phenotypes in the same direction. This is consistent with the previous re-

view, which shows that burden tests have strong power when relatively a large percentage of RVs are causal [Wu et al., 2011]. When only 10% causal RVs are involved (Fig. 4A), all methods suffer from power loss; however, the Granvil has the poorest performance while the SKAT and the BLVCM Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

317

Figure 5. ROC curves of three competitive methods when effect sizes of causal variants are nonuniform : (A) 10% deleterious causal RVs, (B) 10% bidirectional causal RVs.

perform almost in the same pattern. We also notice that larger effect size increases the power (Fig. 4B), and the SKAT is slightly better than the BLVCM. Such results demonstrate that the BLVCM has the competitive capability in terms of detecting main RV effect compared to the SKAT, and is better than the Granvil in the case of lower proportion of causal RVs. Then, we investigated the performance when both deleterious and protective RVs were involved. From Figure 4E and F, it is evident that the AUC of Granvil is barely larger than 0.5, indicating that it hardly has power in this case. This is because the bidirectional effects cancel out each other when pooled together, which is quite consistent with the previous reviews. It seems that the power of the BLVCM is approximately equal to the SKAT in these cases. Our method assumes that effect sizes are of the same magnitude across all RVs, but this may not be true in real data. We further investigated the performance of the BLVCM when such assumptions do not hold. Instead of simulating the uniform effect size for each causal variant, we sampled the effect sizes from a normal distribution N(0, 1) so that each causal variant had a unique contribution. We considered two situations: (1) 10% RVs were deleterious were involved, (2) 10% RVs were deleterious or protective. The power comparisons are shown in Figure 5. It seems that in this situation, SKAT is more powerful than the BLVCM, but the margin is modest. This implies that the BLVCM is robust and violence of such assumption does not significantly lower the power. Although the Granvil and SKAT with the linear kernel are not designed for interaction effect detection, we still included them into the simulations to see their performances when only interaction effect association is involved. First, we investigated the case where all combinations of minor alleles produce a deleterious effect. The results (Fig. 6) show that the Granvil achieves the highest power, and the BLVCM is in the second place when all the combinations of RVs interact with each other. It seems that in this case the strong sensi-

318

Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

tivity of the burden test still holds even if only a synergistic effect is involved, though this is hardly the reality. Then, when 50% of combinations have synergistic effect, the BLVCM is more powerful than the Granvil with large effect size. When less than 25% of combinations have a synergistic effect, the BLVCM is more powerful than the Granvil and SKAT. Similar to the main effect, the Granvil has no any power when bidirectional RV effects are involved, although the BLVCM maintains its performance. By investigating Figures 3 and 6A–C, it seems that the power of BLVCM is comparable to that in the previous twin design. This implies that although by using larger pedigrees, we increase the likelihood of seeing a variant that is rare in the population, there is no benefit in terms of statistical power for twin design. As a sensitivity analysis, we also tested different alternatives for the threshold value λ and the hyper-parameters σβ2 and σγ2 apart from the prespecified values. Based on the total variance fixed at 1, we ran the simulations with λ = 0.1, and compared it with 0.2 under various effect sizes. It turns out that they basically share similar performance in terms of power though fewer samples are found within (0, 0.1) when the effect size is large. Previously, it has been shown that too flat variance prior leads to small BFs. In our simulations, we found that different values for σβ2 and σγ2 did change the BFs a lot, but the values within a moderate range have small influence on the power.

LDL-C Study We applied the BLVCM to real data of an LDL-C study. The study included 664 Finnish individuals from two sequenced twin sets (FTNICO and FTALKO) with their LDL-C levels, out of whom 208 were MZ twins. The genotypic data contained SNPs across 22 chromosomes for these individuals. Genotyping of these two sets was performed on

Figure 6. Empirical power curves of three competitive methods: (A) 100% causal RV combinations, (B) 50% causal RV combinations, (C) 25% causal RV combinations, (D) 12.5% causal RV combinations, (E) 25% causal RV combinations (bidirectional), and (F) 12.5% causal RV combinations (bidirectional).

the Human670-QuadCustom Illumina BeadChip (Illumina, Inc., San Diego, CA). Imputation was performed using the software IMPUTE v2.1.0 [Howie et al., 2009]. Before performing the RV association analysis, the OpenMx R package [Boker et al., 2011] was used to fit the variance components, and determine the appropriate heritability model. We compared the ACE and AE model, and the result from LRT suggested that the AE model is more suitable for this data than the ACE model. We wished to focus on rarer variants in

this analysis so that we only included the RVs of MAF less than 2%, and those with higher MAF were excluded. Nearly all RVs were investigated, including those in noncoding regions. We selected 51,018 genes of a variety of types, along with their boundaries from the Ensembl database. After filtering for RVs, there were 27,990 gene regions left, within the boundaries of which at least two RVs resided, and 19,086 regions had possible interaction effects. Although correlation was not commonly observed in RVs, for the purpose of Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

319

Figure 7. BFs with log transformation for main effect of genome-wide gene regions. X axis: the start bp of a gene. Y axis: log10(BF). efficiency, we selected one representative RV for collinear RVs in a region and eliminated the others. We standardized the LDL-C levels to make the total variance as 1, and in this study, we set σβ2 and σγ2 at 4. The BFs with logarithm transformation for main and synergistic effect are plotted in Figures 7 and 8. Although the Bayesian framework entails a loss function to make a decision, in the context of GWA analysis, it may be preferable to control the type I error rate in the classical framework [Good, 1992]. Here, we would like to find a cut-off to control the type I error below 10–4 , and any gene region with its BF greater than its cut-off is treated as significant. When the MCMC method is adopted, a threshold value can be determined by permutation within or between chains. How-

ever, the within-chain phenotype permutation does not have theoretical justification and gives overly conservative results [Che and Xu, 2010], although the between-chain method is computationally markedly intensive for large scale analysis. Instead, in this study, we obtain the cut-offs through simulations. It is worth noting that a simple model tends to produce higher BFs under null hypothesis due to the Occam’s razor so that the BFs for the main effect detection rely on the number of RVs. Therefore, we first generate random genotypic data for some given numbers of RVs with all hyperparameters fixed, and run the model for hundreds of thousands of replications using real phenotypic data to collect the estimated BFs. Such BFs provide us with the empirical distributions of BF for those numbers of RVs under the null model, from

Figure 8. BFs with log transformation for interaction effect of gene regions on ch22. X axis: the start bp of a gene region. Y axis: log10(BF).

320

Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

Table 1.

Significant gene regions with main effect of RVs associated with LDL-C

Ensembl Gene ID ENSG00000159189 ENSG00000174348 ENSG00000232993 ENSG00000183929 ENSG00000251348 ENSG00000145777 ENSG00000069011 ENSG00000213187 ENSG00000260151 ENSG00000205869 ENSG00000255599 ENSG00000260254 ENSG00000089234 ENSG00000188655 ENSG00000137807 ENSG00000259807 ENSG00000197689 ENSG00000168597 ENSG00000260389 ENSG00000104866 ENSG00000213889 ENSG00000167554 ENSG00000223608

Gene Region

Start bp

End bp

Chr

Gene Biotype

RV

log10(BF)

Cut-off (< 10–4 )

C1QC PODN RP11–334A14.5 DUSP5P HSPD1P11 TSLP PITX1 RP1–23E21.2 RP11--748L13.6 KRTAP5–1 AP000997.1 AP000997.2 BRAP RNASE9 KIF23 RP11--426C22.4 TBC1D29 C17orf65 WBP11P1 LRRC68 PPM1N ZNF610 AP001415.1

22970123 53527854 53535610 228744885 95104703 110405760 134362615 93801694 27632365 1605572 115498783 115509281 112080797 21024252 69706585 29228491 28886584 42253354 30091626 45595050 45992035 52839498 39378846

22974603 53551174 53551174 228788150 95106404 110413722 134370503 93802478 27633846 1606513 115500461 115517680 112123790 21029090 69740764 29231352 28890507 42264082 30094597 45651335 46005768 52870375 39382920

1 1 1 1 5 5 5 6 10 11 11 11 12 14 15 16 17 17 18 19 19 19 21

protein coding protein coding antisense pseudogene pseudogene protein coding protein coding pseudogene lincRNA protein coding lincRNA lincRNA protein coding protein coding protein coding lincRNA NA NA pseudogene protein coding protein coding protein coding sense intronic

9 27 18 23 3 6 10 2 2 3 2 8 39 9 40 4 6 9 4 74 19 36 8

1.548 0.972 1.167 1.067 2.426 2.899 3.369 3.778 3.369 3.540 3.778 1.922 0.791 1.789 0.737 3.431 2.120 2.090 1.984 0.823 1.091 0.7503 2.047

1.444 0.868 1.081 0.952 2.189 1.663 1.389 2.639 2.639 2.189 2.639 1.506 0.673 1.444 0.659 1.931 1.663 1.444 1.931 0.330 1.053 0.716 1.506

which we can easily determine the significance level of interest. Then, we found a function to interpolate the cut-offs for other numbers of RVs. It turns out that given the pre-specified hyper-parameters σβ2 and σγ2 = 4 and λ = 0.2, following the principle depicted in formula (4), the cut-off for a moderate number (K ∈ [1, 100]) of RVs under such type I error rate can be approximately fitted  using thefollowing  function, 0.1  1 – erf √ (1 – (1/3)K ) 4, 500 – K 1.2 2   . cut-of f = 0.1 K 1.2 K K erf √ (1 – (1/3) ) + (1/3) 2 The derivation of this cut-off function can be found in Supplementary (Appendix 3) in more detail. Table 1 lists all gene regions with its main effect BF greater than its simulation-based cut-off. We notice that many significant regions are clustered on chromosome 1, 11, and 19, which is consistent with the previous Genome-wide linkage analysis in the Qu´ebec Family Study in 2004 [Boss´e et al., 2004]. However, it is the first time to show the potential involvement of RVs in these regions. Among these results, two regions (in ENSG00000255599 and ENSG00000260254) on chromosome 11 located on 11q23.2 adjacently replicate the findings in that Canadian study, which reports the suggestive genome-wide regions related to LDL-C level. More recently, the variants in APOA1/APOC3/APOA4/APOA5 gene cluster located near these regions have been found associated with LDL-C level [Al-Bustan et al., 2013; Pollin et al., 2008]. Furthermore, the finding of two gene regions on chromosome 19 (in ENSG00000104866 and ENSG00000213889) located from 45.6 to 46.0 Mb also supports the region identified in the Canadian study. It is already known that there are a handful of genes related to LDL-C within this area [Willer et al., 2008], so there possibly is some linkage between the RVs in these regions and those causal gene variations. Besides,

ENSG00000183929 located at 22.87 Mb on chromosome 1 is next to GALNT2 (23.0 Mb) gene, whose polymorphism has been shown to play a vital role in serum lipid levels in Mulao and Han population [Li et al., 2011], and is associated with HDL cholesterol and Triglycerides [Kathiresan et al., 2008]. Two regions in ENSG00000174348 and ENSG00000232993 on chromosome 1 are close to gene PCSK9, in which a recent whole-exome sequencing study on rare coding variants identified novel RVs significantly associated with LDL-C [Lange et al., 2014]. Synergistic effect BFs above its cut-off level are listed in Table 2 for all gene regions. Among these, frequent mutations in gene ENSG00000091262 (ABCC6) on chromosome 16 were previously found to be significantly associated with an increase in the prevalence of coronary artery disease and some characteristics of cholesterol level [Trip et al., 2002]. Our result indicates that the interaction of RVs in this gene appears to significantly affect the LDL-C level in the Finnish population, which further corroborates such findings. Another region in gene ENSG00000104866, which is also in Table 1, suggests that this region has both main and synergistic effects. In conclusion, despite its exploratory nature, our results confirm several previous findings in the Finnish population, and provide further insights into the potential involvement or linkage of RVs in LDL-C level or its related genes.

Discussion There is urgent demand for the exploration of the effect of a group of RVs within a gene or pathway, and more importantly, potential RV–RV interactions within a gene, or different genes involved in the same pathway. As shown in our simulation and real data study, the BLVCM is an effective Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

321

Table 2.

Significant gene regions with synergistic effect of RVs associated with LDL-C

Ensembl Gene ID ENSG00000204344 ENSG00000258871 ENSG00000177699 ENSG00000091262 ENSG00000156873 ENSG00000260465 ENSG00000081665 ENSG00000104866 #

Gene Region

Start bp

End bp

Chr

Gene Biotype

RV

log10(BF)

Cut-off# (< 10–4 )

STK19 RP3--514A23.2 AC011944.2 ABCC6 PHKG2 RP11--63M22.2 ZNF93 LRRC68

31938868 73019288 79271231 16242785 30759591 66754800 19902620 45595050

31949228 73061833 79278268 16317379 30772490 66765688 19932560 45651335

6 14 15 16 16 16 19 19

protein coding lincRNA antisense protein coding protein coding antisense protein coding protein coding

15 51 13 87 15 17 32 74

2.327 2.254 2.223 2.222 2.331 2.390 2.590 3.369

2.192 2.192 2.192 2.192 2.192 2.192 2.192 2.192

The cut-off of BF for the synergistic effect is the same across all gene regions.

method for the association analysis for RV synergistic effects. As far as we know, there is no available study that tries to detect synergistic effects between RVs, and there is no definitive study that shows such effects exist. However, our method provides a more powerful tool to probe them if they truly exist. The methodology allows for the uncertainty of variant effects by introducing latent variables in a multiplicative way to approximate and capture the main RV effect and the RV– RV synergistic effect. One benefit of the Bayesian hierarchical modelling is that it can deal with more sophisticated models than the classical statistical framework to better reflect relationships in reality. By treating αk and θi as latent variables and using the hierarchical structure, we can collapse a set of RVs and combinations of RVs separately, and simultaneously estimate the gene-level regression coefficient β along with the variant-specific random variable αk , and the synergistic effect under the classical twin design. Moreover, in the Bayesian framework, various prior distributions can be assigned to αk for estimating the coefficient β. When the hyperparameters for the prior distributions are properly selected, both global and marginal effects are identifiable, and the estimation of model parameters is stable. Furthermore, various biological information and potential correlation of RVs can be built and reflected by some joint prior distribution of latent variables. For synergistic effect detection, by assigning the categorical distribution to θi , the hierarchical model is able to deal with the situation where protective or deleterious interaction effects are simultaneously involved. The model treats the effects of combinations as latent variables, and predicts their associations in a data-driven way. For a moderate group of RVs, the number of θi to be estimated is small compared to the sample size. If there are too many RVs collapsed, the model may deal with too many latent variables so that it can over-fit the data. Nevertheless, the Bayesian framework automatically penalizes those models with more parameters, known as Occam’s razor, so including more RVs does not necessarily improve the sensitivity. The expansion of the BLVCM to the Collapsing method [Li and Leal, 2008] is clear and straightforward. The collapsing of RV combinations boosts the power for detecting RV–RV interaction effects. The results from our simulations indicate that with well controlled type-I error rate, our model outperforms the SKAT and Granvil in detecting RV–RV interaction effects in most simulated scenarios. The gain of power results from the fact that our model collapses synergistic effects and takes into account the occurrence of

322

Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

neutral RVs or RV–RV combinations and different directions of synergistic effects in a gene or pathway, which is a common phenomenon in current sequencing data. The BLVCM can not only test the hypothesis of the evidence of a global association effect, but also predict the marginal effect of each RV through interrogating the posterior of αk given the significant global association. This is an appealing characteristic of random effects in the BLVCM: they can be used as the basis of making inference of individual effects, in which way our model shows both robustness and flexibility. For the marginal effects, due to the incorporation of location information of minor allele occurrence, the BLVCM manages to pinpoint the most contribution of individual effects. Compared to BFs, P-values have a striking and fundamental limitation [Ioannidis, 2008] because they heavily depend on some factors that affect the power of the test, such as low MAF of the variant and the sample size. The BLVCM employs the Bayesian framework which provides an alternative approach to assessing RV associations. Unlike other model selection methods [Quintana et al., 2011], in formula (3), V(ij ) is usually a small set due to rare copies of minor allele each person can carry, which drastically reduces the number of predictors in the regression equation, and therefore avoids large scale matrix computation in implementation, which makes the Gibbs sampler computationally much faster. In doing so, we include only one slope coefficient β on the right-hand side of formula (3) so that it takes much less time to compute than other Bayesian models [Yi and Zhi, 2011; Yi et al., 2011]. We implemented the BLVCM in C++ using the Gibbs sampling procedure described in Supplementary (Appendix 2), and an R package for the BLVCM is currently available from the first author. The computational time depends on the number of sample size, the number of RVs under investigation and the number of iterations. Twin design is generally more computationally time-consuming than independent design due to the additional estimation of ACE components. On an Intel Core 2 3.2 GHz PC, the computational time for analyzing one gene containing 40 RVs using 600 individuals with 30,000 iterations is approximately 3 sec for independent design, and 10 sec for twin design. Even for a sample size as huge as 10,000, our simulation results show that it takes around 25 and 118 sec, respectively. Thus, compared to other Bayesian model selection methods, our method drastically improves the efficiency without the loss of power and makes it directly applicable to large-scale genome sequencing data.

Finally, some important limitations need to be considered. One problem is that the model does not provide adjustment for population structure. A recent study [Mathieson and McVean, 2012] demonstrates that RVs can show a stratification that is systematically different from CVs. Failing to account for population stratification may lead to spurious findings. Further research on this topic could explore a way to incorporate such information. Acknowledgements Liang He was supported by the EU FP7 (EpiMigrant, 279143). Samuli Ripatti was supported by the Academy of Finland (251217), EY FP7 project BioSHaRE (261433), the Finnish foundation for Cardiovascular Research, and the Sigrid Juselius Foundation. We thank Professor Jaakko Kaprio for providing access to the twin data. Phenotypic and genotypic data collection in the twin cohort has been supported by the Wellcome Trust Sanger Institute, ENGAGE – European Network for Genetic and Genomic Epidemiology, FP7-HEALTH-F4-2007, grant agreement number 201413, National Institute of Alcohol Abuse and Alcoholism (grants AA-12502, AA-00145, and AA-09203 to R.J. Rose and AA15416 and K02AA018755 to D.M. Dick) and the Academy of Finland (grants 100499, 205585, 118555, 141054, and 264146 to J. Kaprio). The authors have declared no conflict of interest.

References Al-Bustan SA, Al-Serri AE, Annice BG, Alnaqeeb MA, Ebrahim GA. 2013. Resequencing of the APOAI promoter region and the genetic association of the –75G > A polymorphism with increased cholesterol and low density lipoprotein levels among a sample of the Kuwaiti population. BMC Med Genet 14:90. Bafumi J, Gelman A, Park DK, Kaplan N. 2005. Practical issues in implementing and understanding Bayesian ideal point estimation. Polit Anal 13:171–187. Basu S, Pan W. 2011. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol 35:606–619. Becker KG. 2004. The common variants/multiple disease hypothesis of common complex genetic disorders. Med Hypotheses 62:309–317. Boker S, Neale M, Maes H, Wilde M, Spiegel M, Brick T, Spies J, Estabrook R, Kenny S, Bates T and others 2011. OpenMx: an open source extended structural equation modeling framework. Psychometrika 76:306–317. Boss´e Y, Chagnon YC, Despr´es J-P, Rice T, Rao DC, Bouchard C, P´erusse L, Vohl M-C. 2004. Genome-wide linkage scan reveals multiple susceptibility loci influencing lipid and lipoprotein levels in the Quebec Family Study. J Lipid Res 45:419–426. Che X, Xu S. 2010. Significance test and genome selection in Bayesian shrinkage analysis. Int J Plant Genomics 2010:893206. doi:10.1155/2010/893206. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH. 2010. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11:446–450. Frazer KA, Murray SS, Schork NJ, Topol EJ. 2009. Human genetic variation and its contribution to complex traits. Nat Rev Genet 10:241–251. Frigyik BA, Kapila A, Gupta MR. 2010. Introduction to the Dirichlet Distribution and Related Processes. Seattle, WA: University of Washington. Gelman A, Hill J. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models, 1st ed. Cambridge, UK: Cambridge University Press. Gerke J, Lorenz K, Cohen B. 2009. Genetic interactions between transcription factors cause natural variation in yeast. Science 323:498–501. Gibson G. 2012. Rare and common variants: twenty arguments. Nat Rev Genet 13:135– 145. Good IJ. 1992. The Bayes/non-Bayes compromise: a brief review. J Am Stat Assoc 87(419):597–606. Han F, Pan W. 2010. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered 70:42–54. Hindorff LA, MacArthur J, Morales J, Junkins HA, Hall PN, Klemm AK, Manolio TA. 2011. A catalog of published genome-wide association studies. Available at: www.genome.gov/gwastudies. Hoffmann TJ, Marini NJ, Witte JS. 2010. Comprehensive approach to analyzing rare genetic variants. PloS One 5:e13584. Howie BN, Donnelly P, Marchini J. 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5:e1000529.

Ioannidis JPA. 2008. Effect of formal statistical significance on the credibility of observational associations. Am J Epidemiol 168:374–383. Iyengar SK, Elston RC. 2007. The genetic basis of complex traits: rare variants or “common gene, common disease”? Methods Mol Biol Clifton NJ 376: 71–84. Jeffreys H. 1998. Theory of Probability, 3rd ed. New York, NY: Oxford University Press. Kass RE, Raftery AE. 1995. Bayes factors. J Am Stat Assoc 90(430):773– 795. Kathiresan S, Melander O, Guiducci C, Surti A, Burtt NP, Rieder MJ, Cooper GM, Roos C, Voight BF, Havulinna AS and others 2008. Six new loci associated with blood low-density lipoprotein cholesterol high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet 40:189–197. Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards JB. 2012. The empirical power of rare variant association methods: results from sanger sequencing in 1,998 individuals. PLoS Genet 8:e1002496. Lange LA, Hu Y, Zhang H, Xue C, Schmidt EM, Tang Z-Z, Bizon C, Lange EM, Smith JD, Turner EH and others 2014. Whole-exome sequencing identifies rare and low-frequency coding variants associated with LDL cholesterol. Am J Hum Genet 94:233–245. Li B, Leal SM. 2008. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83: 311–321. Li Q, Yin R-X, Yan T-T, Miao L, Cao X-L, Hu X-J, Aung LHH, Wu D-F, Wu J-Z, Lin W-X. 2011. Association of the GALNT2 gene polymorphisms and several environmental factors with serum lipid levels in the Mulao and Han populations. Lipids Health Dis 10:160. Liu DJ, Leal SM. 2010. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 6:e1001156. Madsen BE, Browning SR. 2009. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 5:e1000384. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A and others 2009. Finding the missing heritability of complex diseases. Nature 461:747–753. Mathieson I, McVean G. 2012. Differential confounding of rare and common variants in spatially structured populations. Nat Genet 44:243–246. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN. 2008. Genome-wide association studies for complex traits: consensus uncertainty and challenges. Nat Rev Genet 9:356–369. Meyers KJ, Chu J, Mosley TH, Kardia SL. 2010. SNP-SNP interactions dominate the genetic architecture of candidate genes associated with left ventricular mass in african-americans of the GENOA study. BMC Med Genet 11:160. Morey RD, Rouder JN. 2011. Bayes factor approaches for testing interval null hypotheses. Psychol Methods 16:406–419. Morris AP, Zeggini E. 2010. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 34: 188–193. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. 2011. Testing for an unusual distribution of rare variants. PLoS Genet 7:e1001322. Nelder JA, Wedderburn RWM. 1972. Generalized linear models. J R Stat Soc Ser Gen 135:370–384. Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, Richards JB, Ciampi A, Greenwood CMT. 2013. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol 37:366–376. Pollin TI, Damcott CM, Shen H, Ott SH, Shelton J, Horenstein RB, Post W, McLenithan JC, Bielak LF, Peyser PA and others 2008. A null mutation in human APOC3 confers a favorable plasma lipid profile and apparent cardioprotection. Science 322:1702–1705. Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei L-J, Sunyaev SR. 2010. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 86:832–838. Quintana MA, Berstein JL, Thomas DC, Conti DV. 2011. Incorporating model uncertainty in detecting rare variants: the Bayesian risk index. Genet Epidemiol 35:638– 649. Robert CP. 1995. Simulation of truncated normal variables. Stat Comput 5:121–125. Robinson R. 2010. Common disease multiple rare (and distant) variants. PLoS Biol 8:e1000293. Stephens M, Balding DJ. 2009. Bayesian statistical methods for genetic association studies. Nat Rev Genet 10:681–690. Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, and others. 2012. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337: 64–69. Trip MD, Smulders YM, Wegman JJ, Hu X, Boer JMA, ten Brink JB, Zwinderman AH, Kastelein JJP, Feskens EJM, Bergen AAB. 2002. Frequent mutation in the ABCC6 Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

323

gene (R1141X) is associated with a strong increase in the prevalence of coronary artery disease. Circulation 106:773–775. van Dongen J, Slagboom PE, Draisma HHM, Martin NG, Boomsma DI. 2012. The continuing value of twin studies in the omics era. Nat Rev Genet 13:640–653. Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC, Timpson NJ, Najjar SS, Stringham HM and others 2008. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet 40: 161–169.

324

Genetic Epidemiology, Vol. 38, No. 4, 310–324, 2014

Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. 2011. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89:82–93. Yi N, Liu N, Zhi D, Li J. 2011. Hierarchical generalized linear models for multiple groups of rare and common variants: jointly estimating group and individualvariant effects. PLoS Genet 7:e1002382. Yi N, Zhi D. 2011. Bayesian analysis of rare variants in genetic association studies. Genet Epidemiol 35:57–69.

Bayesian latent variable collapsing model for detecting rare variant interaction effect in twin study.

By analyzing more next-generation sequencing data, researchers have affirmed that rare genetic variants are widespread among populations and likely pl...
1MB Sizes 0 Downloads 2 Views