Empirical Bayes estimates for correlated hierarchical data with overdispersion.

MAIN PAPER (wileyonlinelibrary.com) DOI: 10.1002/pst.1635

Published online 2 September 2014 in Wiley Online Library

Empirical Bayes estimates for correlated hierarchical data with overdispersion Samuel Iddi,a,b Geert Molenberghs,c,b * Mehreteab Aregay,d,b and George Kalemab An extension of the generalized linear mixed model was constructed to simultaneously accommodate overdispersion and hierarchies present in longitudinal or clustered data. This so-called combined model includes conjugate random effects at observation level for overdispersion and normal random effects at subject level to handle correlation, respectively. A variety of data types can be handled in this way, using different members of the exponential family. Both maximum likelihood and Bayesian estimation for covariate effects and variance components were proposed. The focus of this paper is the development of an estimation procedure for the two sets of random effects. These are necessary when making predictions for future responses or their associated probabilities. Such (empirical) Bayes estimates will also be helpful in model diagnosis, both when checking the fit of the model as well as when investigating outlying observations. The proposed procedure is applied to three datasets of different outcome types. Copyright © 2014 John Wiley & Sons, Ltd. Keywords: beta-binomial; combined model; conjugacy; empirical bayes; generalized linear mixed model; logistic-normal model; maximum likelihood; negative-binomial; partial marginalization; posterior; prediction; random effects; strong conjugacy

1. INTRODUCTION

316

Random effects have been used in many multilevel or hierarchical models to address association. Such association arises from measuring outcomes repeatedly on the same unit, or from clustering of responses, in life science studies and beyond. Commonly, normal random effects are used in generalized linear mixed models (GLMM) for non-Gaussian outcomes [1–3] and the linear mixed model as a special case for continuous Gaussian outcomes [4]. A key assumption is a conditional independence of these outcomes given the random effects. Relying on conditional independence, the random effects variance components are used to quantify the amount of dependency. For example, in the random intercept model for a continuous outcome, the random effects variance captures the between-subject variability, whereas the error variance quantifies the within-subject variability. The ratio of the between variability and the total variability yields the so-called intra-class correlation [4]. For non-Gaussian outcomes, the nonlinearity of the GLMM makes this less straightforward, although some progress has been made [5]. Overdispersion refers to the phenomenon where the variability in the data is not fully captured by the posited model. Even though the normal random effects capture overdispersion to some extent [6–9], additional model extension may be needed to fully capture variability. Such an extension was proposed by [10,11], who proposed a so-called combined model, which introduces an additional set of random effects at the level of the observation within a subject, often conjugate to the outcome model in the sense of [12]. The authors showed that the two sets of random effects simultaneously improve the ability to capture both overdispersion and correlation. Consequently, for a number of applications, the fit considerably ameliorated.

Pharmaceut. Statist. 2014, 13 316–326

Molenberghs [10,11] used maximum likelihood estimation by analytically integrating their model over the conjugate random effect, which is possible precisely thanks to the conjugacy property and then further numerically integrating over the normal random effects. This enables the use of standard statistical software such as the SAS procedure NLMIXED or the R function nlme. However, estimation of the two random effects has not been studied for the combined model. Nevertheless, so-called predictions of the random effects find several uses, including but not limited to prediction of expected (future) responses, their corresponding probabilities, checking the fit of the model, and identifying outlying subjects or clusters [13]. Using empirical Bayes prediction principles, we develop a procedure to predict these random effects and their accompanying standard errors. Comparison is made with a fully Bayesian prediction approach. The methodology is applied to three sets of data, with different outcome types. Here is how this paper is organized. In Section 2, three case studies are presented, with count, time-to-event, and binary outcomes, respectively. These serve as backdrop for the methodology presented in Section 3. The data are analyzed in Section 4. Finally, Section 5 is devoted to conclusions.

a

Department of Statistics, University of Ghana, Legon-Accra, Ghana

b

I-BioStat, KU Leuven - University of Leuven, B-3000 Leuven, Belgium

c

I-BioStat, Universiteit Hasselt, B-3590 Diepenbeek, Belgium

d Division of Biostatistics & Epidemiology, Medical University of South Carolina, Charleston, SC, USA

*Correspondence to: Geert Molenberghs, I-BioStat, Universiteit Hasselt, B-3590 Diepenbeek, Belgium. E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd.

S. Iddi et al.

2. CASE STUDIES 2.1. A clinical trial in epileptic patients This study is described in full detail in [14]. In summary, the data are obtained from a randomized, double-blind, parallel group multi-center study for the comparison of placebo with a new anti-epileptic drug (AED), in combination with one or two other AED’s. Patients were randomized after a 12-week stabilization period for the use of AEDs during which the number of epileptic seizures were counted. After this period, 45 patients were assigned to the placebo group and 44 to the new active treatment group. For 16 weeks, patients were followed (double-blind) and the number of seizures counted. Some were followed for up to 27 weeks over a long-term open-extension study. The outcome of interest was the number of weekly epileptic seizures experienced since the last time the outcome was measured. Researchers were interested in knowing whether or not addition of the new treatment helps in reducing the number of epileptic seizures. Research interest is in the treatment effect; in particular, one wants to ascertain whether it evolves more rapidly in one treatment arm than the other. Moreover, there are strong individual differences in the number of epileptic seizures per unit of time. Not only does this generate overdispersion, one is interested in these individual rates, because they may have implications beyond the clinical trials, for when and how particular patients are treated. Evidently, such individual predictions hinge on the adequacy and hence flexibility of the statistical model used. As shown in [11], using a GLMM [9] leads to inadequate fit because variance and correlation in the repeated sequence cannot be modeled sufficiently flexibly. Evidently, the impact of using too simple a model may lead to incorrect individual-level predictions. 2.2. Recurrent asthma attacks in children Asthma attacks are commonly occurring among young children between the ages of 6 and 24 months. To prevent the occurrence of asthma among high-risk children, a prevention trial was set up to identify a new way of applying existing anti-allergic drugs. The data from this study has previously been studied in [15]. The trial was conducted with such children randomly assigned to either a placebo or drug, and the asthma events that developed over time were recorded in a diary. Frequently, a patient has more than one asthma event, typical numbers lying between 1 and 5. The different events are thus clustered within a patient and ordered over time. This ordering can be taken into account in the model. The data are presented in a calendar time format, where the time at risk for a particular event is the time from the end of the previous event (asthma attack) to the start of the next event (start of the next asthma attack). A particular patient has different periods at risk throughout follow-up, which are separated either by an asthmatic event that lasts one or more days or by a period in which the patient was not under observation. The start and end dates of each such risk period are required, together with the status indicator to denote whether the end of the risk period corresponds to an asthma attack or not. Like in the previous case study, also here, it is of interest to gauge individual severity of the diseases by constructing appropriate individual measures. 2.3. Jimma longitudinal family survey of youth

Pharmaceut. Statist. 2014, 13 316–326

3. METHODOLOGY In Section 3.1, a general formulation of the combined model covering all types of non-Gaussian outcomes is presented. Sections 3.2 and 3.3 are devoted to likelihood-based estimation machinery for the fixed and random effects, respectively. A Bayesian approach is the subject of Section 3.4. 3.1. Combined model Non-Gaussian outcomes are conveniently modeled using the exponential family of distributions [12]. Over the years, it has been extended to accommodate overdispersion on the one hand and data hierarchies on the other. Molenberghs et al. [10] flexibly brought these together in a so-called combined model. For the response Yij of subject i (i D 1, : : : , N) measured at occasion j (j D 1, : : : , ni ), the combined model is generally expressed as n o fi Yij jbi , , ij , D exp 1 yij ij ‰.ij / C c.yij , / . Note that the parameters bi , , and ij parameterize ij and hence the mean and variance functions in the following way. The conditional mean E.Yij jbi , , ij , / D ‰ 0 .ij / D cij D ij ij for g.ij / D x 0ij C z0ij bi . Here, x ij is a p-dimensional vector of covariates with fixed coefficients , zij is a q-dimensional vector of covariates with random coefficients bi N.0, D/, ij is a random variable with mean and variance 2 , and g./ is a link function. Although the overdispersion parameter is formally present, the use of ij may obviate its need in practice. The advantage of the latter is the ability to apply (empirical) Bayes estimation, the theme of this manuscript. Note that what distinguishes the combined model from the classical GLMM is the introduction of the additional random variable ij , in the conditional mean, to fully address extra-model dispersion. Generally, the ij can have a subject-specific and/or measurement-specific mean; they can but do not have to be assumed correlated within a subject. Also, ij can assume a distributional form depending on the type of response distribution involved. The beta and gamma

Copyright © 2014 John Wiley & Sons, Ltd.

317

The data for this study were collected from the Jimma Longitudinal Family Survey of Youth (JLFSY) in the relatively large Ethiopian

town of Jimma, as well as the smaller towns of Yebu, Serbo, and Sheki, and nearby rural areas. The study started in 2005. Adolescents from 3700 households were interviewed by means of a questionnaire administered by trained interviewers. Two rounds of the study were carried out, 2 years apart. A total of about 700 adolescents were included. More than 90% of the study subjects responded to the second round of the study. The binary outcome used here is the adolescents’ current school attendance coded as 0 (not currently attending) or 1 (currently attending). Current school attendance was 90.2% (male) and 91.1% (female) in the first round of the survey; the corresponding figures for the second round were 93.5% and 92.8%, respectively. We examine whether or not the percentage of school attendance depends on adolescents’ labor status to support themselves or their families, their residential status, age, and gender. Once again, we are interested, not only in establishing which covariables explain the outcomes sequence but also in individual-level predictions, for example, to identify and further scrutinize subjects that are at elevated risk of not attending school.

S. Iddi et al. distributions are obvious choices for ij when dealing with binary and count data, respectively, because of their conjugate status relative to the Bernoulli and Poisson models, respectively. Conjugacy allows for explicit expressions for the marginal moments and distributions. However, in the presence of the normal random effect, strong conjugacy as introduced by Molenberghs et al. [10] is required to still allow for such closed forms. Strong conjugacy applies in the normal, Poisson, and Weibull cases but not in the binary and binomial cases. In the latter cases, no closed forms exist when the logit link is used. An exception is the probit link, which allows for closed forms, even though no (strong) conjugacy applies [10]. The random effects bi handle correlation and some part of the overdispersion. All features taken together, this model provides enough flexibility to adequately describe the mean, variance, and covariance structures. The absence of bi and ij , respectively, reduces the model to the beta-binomial and binomial normal model for binary data and to the negative-binomial and Poisson-normal model for count data, respectively. 3.2. Estimation of fixed effects Molenberghs et al. [10] suggested a number of estimation strategies, including pseudo-likelihood and Bayesian estimation but settled for marginal maximum likelihood to estimate all model parameters. Conditioned on the two random effects, the observed data likelihood contribution is specified as follows:

L. / D

8 ni Z N Z

Empirical Bayes Estimation of Coalescence Times from Nucleotide Sequence Data.

Two modeling strategies for empirical Bayes estimation.

Bayes and empirical Bayes estimators of abundance and density from spatial capture-recapture data.

Empirical Bayes methods for smoothing data and for simultaneous estimation of many parameters.

Empirical Bayes analysis of RNA-seq data for detection of gene expression heterosis.

Powers of the likelihood ratio test and the correlation test using empirical bayes estimates for various shrinkages in population pharmacokinetics.

Empirical-Bayes adjustments for multiple comparisons are sometimes useful.

Flexible Modeling of Epidemics with an Empirical Bayes Framework.

Semiparametric Bayes hierarchical models with mean and variance constraints.

Comparison of safety effect estimates obtained from empirical Bayes before-after study, propensity scores-potential outcomes framework, and regression model with cross-sectional data.

An efficient empirical Bayes method for genomewide association studies.

Empirical Bayes for DCM: A Group Inversion Scheme.

Graphs and stochastic relaxation for hierarchical Bayes modelling.

SURE Estimates for a Heteroscedastic Hierarchical Model.

Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data.

An Empirical Bayes Method for Multivariate Meta-analysis with an Application in Clinical Trials.

Estimation of heterogeneous variances using empirical Bayes methods: theoretical considerations.

Empirical Bayes method for reducing false discovery rates of correlation matrices with block diagonal structure.

Safety effects of section control--an empirical Bayes evaluation.

An empirical Bayes approach to network recovery using external knowledge.

An empirical Bayes formulation of cohort models in cancer epidemiology.

An Empirical Bayes Mixture Model for Effect Size Distributions in Genome-Wide Association Studies.

A Comparison of Methods for RNA-Seq Differential Expression Analysis and a New Empirical Bayes Approach.

Empirical Bayes scan statistics for detecting clusters of disease risk variants in genetic studies.