Biometrics 70, 910–919 December 2014

DOI: 10.1111/biom.12208

Semiparametric Estimation in Generalized Linear Mixed Models with Auxiliary Covariates: A Pairwise Likelihood Approach Li Liu1 and Liming Xiang2, * 2

1 School of Mathematics and Statistics, Wuhan University, Hubei 430072, P.R. China School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371 ∗ email: [email protected]

Summary. Auxiliary covariates are often encountered in biomedical research settings where the primary exposure variable is measured only for a subgroup of study subjects. This article is concerned with generalized linear mixed models in the presence of auxiliary covariate information for clustered data. We propose a novel semiparametric estimation method based on a pairwise likelihood function and develop an estimating equation-based inference procedure by treating both the error structure and random effects as nuisance parameters. This method is robust against misspecification of either error structure or random-effects distribution and allows for dependence between random effects and covariates. We show that the resulting estimators are consistent and asymptotically normal. Extensive simulation studies evaluate the finite sample performance of the proposed estimators and demonstrate their advantage over the validation set based method and the existing method. We illustrate the method with two real data examples. Key words:

Auxiliary; Generalized linear mixed model; Pairwise likelihood; Semiparametric.

1. Introduction Incomplete covariate data are common in biomedical studies in which some variables may be missing or cannot be measured precisely owing to technical difficulties or practical considerations. Auxiliary covariate data typically arise from twostage validation designs where the primary exposure variable is often assessed only for a subset of the study subjects (often referred to as a validation set) and an easily obtained auxiliary variable may be measured to provide surrogate information for the subjects without primary exposure assessments. For example, in the nurses health study to understand risk factors (such as fat intake) for breast cancer (Rosner, Willett, and Spiegelman, 1989), detailed and accurate dietary information was collected only from the subjects in the validation set, while a self-administered quantitative food questionnaire was used for the whole study cohort to obtain surrogate measurements for long-term fat intake. The issue of missing data has been well recognized in the literature for decades. Little and Rubin (2002) reviewed various missing mechanisms. When covariate data are missing completely at random, the simplest way of dealing with the auxiliary data problem is to consider the validation set only, excluding the non-validation set from the analysis. This naive method is valid in terms of consistency, but is not efficient. Various methods have been developed to improve efficiency by including the non-validation set in the analysis, such as the parametric models discussed by Schafer (1987) and the conditional score approach of Tsiatis and Davidian (2001). However, either these methods may be sensitive to model misspecification or their robustness against non-normal measurement error remains unexplored. Other than the auxiliary covariate information, another complexity is the presence of clustered data due to the data

910

collection procedure employed in medical settings, where the study subjects may be correlated within each cluster. For example, in a retrospective study of recurrent urinary tract infections (UTI), elderly women living in the same nursing home are expected to be correlated in terms of contracting UTI due to their exposure to the same environmental risk factors. Given the hierarchical nature of clustered data, random effects are typically introduced to account for possible dependence among subjects within clusters. As a class of important statistical models, generalized linear mixed models (GLMMs) provide a flexible approach for analysis of clustered data by incorporating both inter-cluster heterogeneity and intra-cluster association. With some parametric assumptions made for the error structure, GLMMs have been widely studied in the literature based on likelihood inference (Breslow and Clayton, 1993; McCulloch and Searle, 2001). However, the full-likelihood based method may become intractable in the presence of auxiliary covariates because the probability mechanism of the missingness of covariates has to be specified explicitly, which is often infeasible in practice. By treating the error structure nonparametrically, Zhou, Chen, and Cai (2002) studied the auxiliary covariate problem in a random effects logistic regression model and proposed a semiparametric method for statistical inference through the conditional joint distribution of the outcome variable and random effects given covariates. Chen and Lin (2010) extended this semiparametric MLE method by using a kernel smoother in the context of GLMMs with only continuous auxiliary covariates allowed. To obtain the likelihood based on the conditional joint distribution, both existing studies imposed the condition of independence between covariates and random effects. However, the random effects may in practice © 2014, The International Biometric Society

Pairwise Likelihood Approach for GLMMs with Auxiliary Covariates depend on covariates. For example, institute (cluster) level characteristics such as poor hygiene and a lack of resources have been associated with both institute level means of UTI recurrence and the history of UTI. The GLMM ignoring these correlations between the random effects and covariates would yield inconsistent estimators, as recognized by Neuhaus and McCulloch (2006). Although the previous two methods were nonparametric with regard to the error structure, they imposed the normal distribution assumption on the random effects, which may result in misspecification of the mixing distribution and thus lead to very biased estimators of regression parameters (Verbeke and Lesaffre, 1996; Zhang and Davidian, 2001; Agresti, Caffo, and Ohman-Strickland, 2004). Moreover, it is notable that both methods of Zhou et al. (2002) and Chen and Lin (2010) lack theoretical justifications for their resulting estimators. In this article, we propose a novel semiparametric estimation method with well-established theoretical properties and being applicable to more general and practical situations. We consider the auxiliary covariate problem within a GLMM framework where both the random error structure and random effects are treated nonparametrically. As it is difficult to verify the correctness of the high-dimensional joint distribution of data in practice, we propose to deduce the regression parameters through a semiparametric pairwise likelihood, which is a pseudolikelihood constructed by multiplying the likelihood contributions from all pairs of observations within clusters and incorporating information on the non-validation set. The pairwise likelihood belongs to the broader class of composite likelihoods, defined by Lindsay (1988), which have good theoretical properties as well as many successful applications. For an overview of the composite likelihood methods, please refer to Varin, Reid, and Firth (2011). The pairwise likelihood approach has been applied in various areas concerning spatial data (Besag, 1974; Cox and Reid, 2004), regression under non-standard situations (Liang and Qin, 2000), correlated binary data (Kuk and Nott, 2000; Kuk, 2007), correlated count data with overdispersion (Henderson and Shimakura, 2003; Chatelain, Lambert-Lacroix, and Tourneret, 2009), multivariate survival data (Parner, 2001) and recurrent event data (Huang, Qin, and Wang, 2010). In line with the conditional likelihood derived by Kalbfleisch (1978), we construct the pairwise likelihood conditioning on the order statistics of paired outcomes. With the conditional error distribution estimated by an empirical density function, we develop semi-parametric estimators for regression coefficients by maximizing the pairwise likelihood directly. This method avoids full likelihood analysis, thus reducing computational cost significantly. In view that the intercept is canceled out in the pairwise likelihood, we propose an estimating equation-based procedure to estimate it in a similar spirit to the approach of Lindsay (1988). Since the random effects are eliminated from the estimation procedure and regarded as nuisance, our method does not impose any restrictions on the mixing distribution. The resulting estimators are thus robust to misspecification of the random effect structure and allow the exposure variable to be associated with the random effects, opening up a wide range of applications. Furthermore, we show that the proposed estimators have good asymptotic properties under mild conditions.

911

The rest of the article is organized as follows. Section 2 introduces notations and the model formulation. In Section 3, we develop the semi-parametric pairwise likelihood estimation for the regression coefficients and the estimating equationbased procedure for the intercept. Theoretical properties of the resulting estimators are discussed in this section as well. We conduct intensive simulations in Section 4 and illustrate the methodology with two real data examples in Section 5. We conclude the article with a few remarks in Section 6. 2. Notation and Model Formulation Consider a study including K independent clusters with nk  subjects each for k = 1, . . . , K. Let {Xki , Zki } , i = 1, . . . , nk , k = 1, . . . , K, be the covariate vector for individual i in cluster k, where the component Xki is not always observed with dimension p and the rest of the covariate vector Zki is completely observed with dimension q. Without loss of generality, we can assume Xki to be a scalar variable, that is, p = 1. We assume in addition that there is an easily observed discrete auxiliary measurement denoted by a variable Wki for the missing covariate Xki . Taking the hierarchical structure of data into account, we introduce cluster-specific random effects uk and assume that they are independent and identically distributed with mean 0 and variance σT2 . We treat the distribution of uk as a nuisance parameter throughout this article. Furthermore, we allow for correlation between the random effects and the covariates, providing more flexible model structures. For example, in many practical studies covariate Xki could be decomposed to cluster component Xka , subject component c Xib and interactive component Xki , where Xka is a function of random effects uk . Given (Xki , Zki , Wki ) and uk , the conditional distribution of the outcome variable Yki is assumed to belong to the canonical exponential family: Pβ (Yki |Xki , Zki , Wki , θ) = exp{[Yki ηki − b(ηki )]/a(φ) + c(Yki , φ)}, (1) where a(·), b(·) and c(·, ·) are known functions, φ is a dispersion parameter, ηki is related to the random effects and covariates by ηki = α + uk + β1 Xki + β2 Zki with intercept α and slope vector β = (β1 , β2 ) to be estimated, and θ = (α, u1 , . . . , uk ) . The parameters in the model above are identifiable when 1, Xki , and Zki are linearly independent. In fact, if (α(1) , β(1) ) and (α(2) , β(2) ) are two sets of parameters satisfying model (1), that is, Pβ(1) (Yki |Xki , Zki , Wki , θ (1) ) = Pβ(2) (Yki |Xki , Zki , Wki , θ (2) ), where θ (1) = (α(1) , u1 , . . . , uk ) and θ (2) = (α(2) , u1 , . . . , uk ) , then it (1) (1) (2) is easy to see that α(1) + β1 Xki + β2 Zki = α(2) + β1 Xki + (2) β2 Zki , which gives that (α(1) , β(1) ) = (α(2) , β(2) ). For cluster k, we denote the validation data set by Vk , the non-validation data set by V k and their corresponding sizes by nk1 and nk2 , respectively. nk1 + nk2 = nk . Then for individual i, the observed data of cluster k are (Yki , Xki , Zki , Wki ) if i ∈

912

Biometrics, December 2014

Vk and (Yki , Zki , Wki ) if i ∈ V k . Note that model (1) implies that given (Xki , Zki , uk ) , the auxiliary variable Wki provides no extra information about the regression model. Two examples in the GLMM context for the analysis of binary and count outcomes, respectively, are given below. Example 1 (random effects logistic regression model) Given the random effects uk and covariates {Xki , Zki } , the regression model for binary outcomes is postulated as Pβ (Yki = 1|Xki , Zki , θ) = K   k=1 i

Semiparametric estimation in generalized linear mixed models with auxiliary covariates: a pairwise likelihood approach.

Auxiliary covariates are often encountered in biomedical research settings where the primary exposure variable is measured only for a subgroup of stud...
163KB Sizes 0 Downloads 6 Views