Variable Selection and Inference Procedures for Marginal Analysis of Longitudinal Data with Missing Observations and Covariate Measurement Error.

HHS Public Access Author manuscript Author Manuscript

Can J Stat. Author manuscript; available in PMC 2016 December 01. Published in final edited form as: Can J Stat. 2015 December ; 43(4): 498–518. doi:10.1002/cjs.11268.

Variable Selection and Inference Procedures for Marginal Analysis of Longitudinal Data with Missing Observations and Covariate Measurement Error Grace Y. Yi, Department of Statistics and Actuarial Science University of Waterloo, Waterloo, Ontario, Canada N2L 3G1

Author Manuscript

Xianming Tan, and The Methodology Center Pennsylvania State University University Park, PA16802 Runze Li Department of Statistics and The Methodology Center, Pennsylvania State University, University Park, PA16802-2111

Summary

Author Manuscript

In contrast to extensive attention on model selection for univariate data, research on model selection for longitudinal data remains largely unexplored. This is particularly the case when data are subject to missingness and measurement error. To address this important problem, we propose marginal methods that simultaneously carry out model selection and estimation for longitudinal data with missing responses and error-prone covariates. Our method have several appealing features: the applicability is broad because the methods are developed for a unified framework with marginal generalized linear models; model assumptions are minimal in that no full distribution is required for the response process and the distribution of the mismeasured covariates is left unspecified; and the implementation is straightforward. To justify the proposed methods, we provide both theoretical properties and numerical assessments.

Keywords Longitudinal data; Marginal analysis; Measurement error; Missing data; Model selection; Simulation-extrapolation

Author Manuscript

1 Introduction Longitudinal studies have proven to be useful in studying changes of response over time, and have been widely conducted in applications. A variety of modeling approaches have been developed for analysis of longitudinal data. It is often of primary interest to characterize how the mean response is related to the covariates as this relationship features the average response at population level (e.g., Lin and Carroll 2006). To facilitate modeling the dependence of the mean response on covariates, marginal approaches, such as generalized estimating equations (GEE), are often used (e.g., Liang and Zeger 1986, Lin and Carroll 2001). However, the use of existing methods is impeded by challenging issues associated with longitudinal data.

Yi et al.

Page 2

Author Manuscript Author Manuscript

It is common that longitudinal studies collect a large number of covariates in which some of them have no effect on the response. Including such covariates in modeling and inferential procedures would greatly degrade the quality of the results. Variable selection thus becomes necessary and critical for valid analysis of longitudinal data. With univariate data, there has been a large body of variable selection methods. Miller (2002) provides a comprehensive account on earlier work of variable selection, including variable selection criteria such as BIC (Schwarz 1978) and algorithms for the best subset selection. Recent developments on variable selection techniques including LASSO (Tibshirani 1996), SCAD (Fan and Li 2001) and their extensions are summarized by Fan and Li (2006). Modern regularization methods have been proposed for longitudinal data. Fu (2003) proposed the penalized marginal estimating equation with a bridge penalty for longitudinal data. Fan and Li (2004) extended the SCAD procedure for partially linear models with longitudinal data. Bondell, Krishna and Ghosh (2010) proposed simultaneous selection of the fixed and random factors using a penalized joint log likelihood for the linear mixed model. Ni, Zhang and Zhang (2010) proposed a double-penalized likelihood approach for simultaneous model selection and estimation for semiparametric mixed models. The validity of these methods, however, relies on two important conditions: (1) data are complete and (2) variables are precisely measured. Nevertheless, these conditions are frequently violated in reality (e.g., Tsiatis, DeGruttola and Wulfsohn 1995; Wang et al. 2008). Although longitudinal studies are designed to collect data on every individual in the studies at each assessment, missing observations arise ubiquitously. With estimation being the primary focus, many authors have shown that seriously biased results can be yielded if missing data are not properly accounted for (e.g., Robins, Rotnitzky and Zhao 1995).

Author Manuscript

It is typical that measurement error on covariate is present in data collection with various reasons (Carroll et al. 2006). For instance, variables may be difficult to observe precisely for many reasons such as physical location or cost (e.g., exposure to radiation). A covariate may represent an average of a certain quantity over time (e.g. cholesterol level), and any practical way of measuring such a quantity necessarily features measurement error. It has been demonstrated that ignoring covariate measurement error in model selection procedures may give rise to seriously misleading results. Liang and Li (2009) and Ma and Li (2010) proposed the SCAD for the independent and identically distributed (i.i.d) data in the presence of measurement error in covariates. These methods cannot, however, directly apply to the setting of longitudinal data where the i.i.d assumption breaks down as longitudinal measurements are typically correlated, nor can they handle the data with missing observations.

Author Manuscript

The goal of this paper is to address these practical and important problems. We develop simultaneous variable selection and estimation procedures that can handle “imperfect” longitudinal data with missingness and measurement error. Theoretical properties are established. Although extensive research has been directed towards each of the aspects of model selection, incomplete longitudinal data and measurement error, those methods cannot be simply applied to handle those features at the same time. Simultaneously addressing these issues is more challenging because those characteristics could interactively affect inference about response parameters. Developing asymptotic results becomes more difficult for

Can J Stat. Author manuscript; available in PMC 2016 December 01.

Yi et al.

Page 3

Author Manuscript

combining all the features in the inferential procedures. To the best of our knowledge, there is no precedent work addressing these features simultaneously. Our method is developed for the marginal framework which includes a broad class of mean structures. Our method has several appealing features. Its implementation is straightforward; the method requires minimal model assumptions for the response process; and the distribution of the true covariates is left unspecified, hence protecting us from model misspecification of the response and covariate processes. Our method can be applied to a wide range of settings due to its simplicity.

Author Manuscript

The remainder is organized as follows. In Section 2 we introduce the basic model setup and notation. Section 3 presents the methodology that allows simultaneous estimation and model selection to handle longitudinal data with incomplete observations and measurement error. Asymptotic results are established in Section 4. We apply the proposed method to analyze the longitudinal data arising from the National Population Health Survey, and the results are reported in Section 5.1. To assess the performance of the proposed method, we conduct simulation studies in Sections 5.2 and 5.3. General remarks are included in the last section.

2 Notation and Framework We consider the setting where the response variable is subject to missingness and some covariates are error-contaminated. For subject i at time point j, let Yij be the response variable that is subject to missingness, and Xij and Zij be the associated covariate vectors, where the Xij are subject to measurement error and the Zij are precisely measured, i = 1, 2, …, n, and j = 1, 2, …,m. Denote Yi = (Yi1, Yi2, …, Yim)T,

and

.

Author Manuscript

2.1 Response Process Let μij = E(Yij|Xi,Zi) and υij = var(Yij|Xi,Zi) be the conditional expectation and variance of Yij, respectively, given covariates Xi and Zi. Consider a regression model: (1)

where is the vector of regression parameters, and g(․) is a link function. If necessary, the intercept may be included in βz by adding a unit vector to covariates Zi. Furthermore, it is assumed that υij = h(μij; ϕ) where h(․;․) is a given function and ϕ is the dispersion parameter that is known or may be estimated. We treat ϕ as known here with emphasis on estimation of the β parameter.

Author Manuscript

The model setup implicitly implies that

and


Yi et al.

Page 4

Author Manuscript

for i = 1, 2, …, n and j = 1, 2, …,m. Given the model assumptions for the marginal mean and variance structures, it is natural to perform estimation of β using the generalized estimating equations (GEE) (Liang and Zeger 1986). For i = 1, 2, …, n, let μi = (μi1, …, μim)T, and

be the matrix of the

Author Manuscript

derivatives of the mean vector μi with respect to β. Let be the (conditional) covariance matrix of Yi, given covariates Xi and Zi, where Ai = diag(υij, j = 1, 2, …,m), Ci = [ρijk] is the correlation matrix with diagonal elements being 1, and ρijk is the (conditional) correlation coefficient of response components Yij and Yik for j ≠ k, given covariates Xi and Zi. For i = 1, 2, …, n, define

, and (2)

where the dependence on the parameters of the correlation matrix is suppressed in the notation. If model (1) is correctly specified, then Φ(β) are unbiased estimation functions for β. Under regularity conditions, solving Φ(β) = 0 leads to a consistent estimator for β, where the correlation matrix Ci can be replaced by the moment estimate, or alternatively, we may use working independence matrix Ai to replace Vi (e.g., Liang and Zeger 1986).

Author Manuscript

2.2 Missing Data Process Let Rij be 1 if Yij is observed, and 0 otherwise. Let Ri = (Ri1, Ri2, …,Rim)T be the vector of (non-)missing data indicators, i = 1, 2, …, n. Drop-outs or monotone missing data patterns are considered here. That is, Rij = 0 implies Rij′ = 0 for all j′ > j. We assume that Ri1 = 1 for every subject i. To reflect the dynamic nature of the observation process over time, we assume that

Author Manuscript

, where is the vector including observed response components, and R̃ij = {Ri1, …,R,i,j−1} is the history of the missing data indicators up to (but not including) time j. This assumption implies that the missingness probability depends on the observed responses but not unobserved response components, given covariates, thus leading to a missing at random mechanism (MAR), as defined by Little and Rubin (2002). This assumption is not essential to the development here, but is merely to ensure model identifiability related to the missing data process. More discussion on this is included in Section 6. Denote λij = P(Rij = 1|R̃ij,Xi,Zi,Yi) and πij = P(Rij = 1|Xi,Zi,Yi). Note that . Logistic regression models are commonly used to model the drop-out process (e.g., Diggle and Kenward 1994). Namely,


Yi et al.

Page 5

Author Manuscript

(3)

where uij is the vector consisting of the information of the covariates Xi and Zi and the observed responses

, and α is the vector of regression parameters.

Let Mi be the random drop-out time for subject i, and mi be a realization, i = 1, 2, …, n. Define

, where λit is determined by model (3). Let (4)

be the vector of score functions contributed from subject i. Write θ = (αT, βT)T and d = dim(θ).

Author Manuscript

2.3 Measurement Error Model LetWij be an observed measurement of the covariate Xij, i = 1, 2, …, n; j = 1, 2, …,m. Write . Suppose the measurement error is postulated by a classical additive measurement error model. That is, conditional on covariates (Xi,Zi) and responses Yi, (5)

where the eij are assumed to follow N(0,Σe) with covariance matrix Σe (e.g., Carroll et al. 2006).

Author Manuscript

In the presence of measurement error, one often needs additional data sources, such as a validation subsample or repeated surrogate measurements, to estimate the parameters associated with the measurement error model (Carroll et al. 2006). If none of such data are available, one may conduct sensitivity analyses based on background information about the measurement process to assess the impact of different degrees of measurement error on estimation of β (e.g., Yi and Lawless 2007, Yi 2008). To focus the discussion on the parameter β of interest, here we treat error distribution parameters as known; extensions to accommodating estimation of such parameters are presented in Section 6.

3 Methodology

Author Manuscript

In this section we describe an algorithm that can simultaneously conduct model selection and estimation as well as adjust for missingness and measurement error effects. In particular, we use the inverse probability weighted GEE method to accommodate effects of response incompleteness, and employ the simulation-extrapolation (SIMEX) procedure (Cook and Stefanski 1994) to reduce the bias induced from covariate mesurement error. The algorithm consists of the following four steps.


Yi et al.

Page 6

Step 1. Simulation

Author Manuscript

Let K be a given positive integer (say, K = 500) and Ψ = {ψ1, ψ2, …, ψM} be a sequence of M (say, M = 10) non-negative numbers taken from [0, ψM] with ψ1 = 0 (say, ψM = 1). For i = 1, 2, …, n and j = 1, 2, …,m, generate eijk ~ N(0,Σe) for k = 1, 2, …,K, and set

Step 2. Estimation Let Δi = diag{I(Rij = 1)/πij, j = 1, 2, …,m} be the weight matrix accommodating missingness, where I(․) is the indicator function. Define

Author Manuscript

and

where the Si(α) are determined by (4). be Qi(θ) with Xij replaced with the simulated covariatesWij(k, ψ), then solve

Let

Author Manuscript

for θ, and let θ̂ (k, ψ) denote the resulting solution. Let , and , then the covariance matrix of θ̂ (k, ψ) is estimated by

Consequently, we calculate

Author Manuscript Can J Stat. Author manuscript; available in PMC 2016 December 01.

Yi et al.

Page 7

Author Manuscript

Step 3. Extrapolation For r = 1, 2, …, d, let θr̂ (ψ) be the rth element of θ̂(ψ). Then for each r fit a regression model to the sequences {(ψ, θ̂r(ψ)) : ψ ∈ Ψ} and extrapolate it to ψ = −1. Let θ̃r = θ̃r(−1), and θ̃ = (θ̃1,⋯, θ̃d)T.

Author Manuscript

Next, apply the Cholesky decomposition for each Σ̂(ψ): Σ̂(ψ) = L̂(ψ)D̂(ψ)LT̂ (ψ), where L̂(ψ) is a lower triangular matrix with its diagonal entries being 1, and D̂ (ψ) is a diagonal matrix. Then we elementwisely fit a regression model to each of the sequences {(ψ,L̂(ψ)) : ψ ∈ Ψ} and {(ψ, D̂(ψ)) : ψ ∈ Ψ}, respectively, and extrapolate it to ψ = −1. Let L̂(−1) and D̂(−1) respectively denote the lower triangular and diagonal matrices with the elements corresponding to the extrapolated values, and define

This procedure guarantees the positive definiteness of Σ̃. Step 4. Variable Selection We define a quadratic loss function:

Author Manuscript

(6)

where Vn is a pre-specified weight matrix. The choice of matrix Vn is not unique. In principle, any positive definite matrix can be considered as a weight matrix in forming (6). But in application, certain matrices are more suitable than others. When Σ̃ is not singular, we set Vn to be Σ−̃ 1 in the formulation of (6). If Σ̃ is near singular, we consider a ridge-type weight matrix by setting Vn = (Σ̃ + τnId)−1, where τn is a ridge parameter, and Id is a d × d unit matrix.

Author Manuscript

The quadratic loss function is a second-order approximation to the negative of the true loglikelihood function, and plays a similar role to that of the log-likelihood function for the penalized likelihood method. Given a tuning parameter λ, let pλ (․) be a penalty function with tuning parameter λ. Define a penalized quadratic loss function to be

Motivated by the one-step sparse maximum likelihood estimate proposed by Zou and Li (2008), we use a weighted L1-penalty


Yi et al.

Page 8

Author Manuscript

where θj̃ is the jth element of θ̃, and is the first order derivative of the penalty function pλ(u). Thus, the penalized quadratic loss function ℓp(θ) is convex, and the resulting minimization procedure can be carried out easily by employing existing algorithms, such as the LARS algorithm (Efron et al. 2004). With a proper choice of the tuning parameter, the minimizer of ℓp(θ) with a weighted L1 penalty automatically has some zero components, hence achieving the purpose of variable selection. In our following numerical studies, we will focus on the LASSO penalty (Tibshirani 1996), pλ(|θj|) = λ|θj|; and the SCAD penalty whose first order derivative is defined to be

Author Manuscript

where sign(u) is the sign function of u, and a = 3.7 (Fan and Li 2001). As demonstrated in the usual context without missing data nor measurement error, it is critical to choose a suitable value of tuning parameter λ in order to achieve satisfactory performance of the selection procedure. Now we describe an algorithm for choosing an optimal tuning parameter within a given set of candidates. To emphasize the dependence of λ, we denote θ̂(λ) = arg minθ ℓp(θ), and let dfλ be the number of nonzero elements in θ̂(λ). Define

Author Manuscript

Then, the optimal tuning parameter λ* is chosen as the one that has the smallest BIC(λ), and the corresponding estimate θ̂(λ*), denoted θ̂, is taken as the estimate of parameter θ. The BIC tuning parameter selector can select a proper tuning parameter to achieve oracle property in penalized least squares for linear models (Wang, Li and Tsai 2007) and for penalized likelihood for generalized linear models (Zhang, Li and Tsai 2010).

Author Manuscript

Conceptually, the proposed method has the appeal of robustness; it requires minimal model assumptions for the response process for which only the marginal mean structure needs to be modulated; and moreover, the distribution of the true covariates Xi is left unspecified. The proposed method extends the ordinary SIMEX methods by including additional steps of missingness correction and model selection. Our approach is ready to implement but requires specification of values of K, M and ψM. In principle, specifying large values of K and M can help reduce Monte Carlo sampling error induced in the simulation step. Another source of uncertainty comes from choosing a suitable regression function in the extrapolation step. Experience shows that the proposed method often works reasonably well by specifying ψM to be 1 or 2 and taking a quadratic regression function for the extrapolation step.


Yi et al.

Page 9

Author Manuscript

4 Asymptotic Results To establish asymptotic results for the penalized quadratic loss estimator θ̂, certain conditions are required. Since the penalized quadratic loss function ℓp(θ) involves a weight matrix Vn, a penalty function pλ(·) and the SIMEX estimator θ̃, required conditions should pertain to these quantities. Specifically, in addition to standard conditions, the following assumptions are imposed for establishing the asymptotic properties of θ̂: A. As n → ∞, B. As n → ∞,

, where V is a positive definite matrix; , where Σ is a positive definite matrix;

C. Define

Author Manuscript

where θ0 = (θ10, …, θd0)T is the true value of θ, and the dependence of λ on sample size n is explicitly indicated by λn. Assume that both an and bn tend to 0 as n → ∞. Assumption A ensures proper behavior of the weight matrix, and it is easily satisfied by a suitable choice of Vn. Assumption C is a standard condition for model selection and is assumed by many authors (e.g., Fan and Li 2001, Zou and Li 2008). Assumption B requires the SIMEX estimator θ̃ to have an asymptotic normal distribution. This assumption is not as stringent as it appears. In fact, Carroll et al. (1996) established the asymptotic normality for the SIMEX estimator under the condition that the extrapolation function is known.

Author Manuscript

Using the techniques by Fan and Li (2001), we can show that with probability tending to 1, estimator θ̂ satisfies

This indicates how the convergence rate of estimator θ̂ depends on the tuning parameter λn as well as the penalty function. To ensure the -consistence of the estimator, one needs to make an = Op(n−1/2), often by specifying a very small tuning parameter λn. We next derive the asymptotic distribution of θ̂ and establish the oracle property for θ̂.

Author Manuscript

Suppose some components of θ0 are zero, and we write so that θI0 contains all non-zero elements and θII0 includes all zero elements. For ease of exposition, we assume θI0 contains the first d1 elements and θII0 contains the last d2 components with d1 +d2 = d. We partition V into 2 by 2 blocks with Vkl being a dk × dl matrix for k, l = 1, 2, and define V1 = [V11,V12]. Theorem 1 Assume the preceding regularity conditions. If


Yi et al.

Page 10

Author Manuscript

− convergent estimator then with probability tending to 1, any from the forgoing algorithm must have the properties that 1. 2.

obtained

̂ = 0; θII , as n → ∞.

Author Manuscript

The theorem suggests that the proposed procedure can correctly identify significant variables and remove irrelevant variables. Furthermore, it yields that the proposed estimator θ̂ is often more efficient than the SIMEX estimator θ̃, which is particularly the case when the limit matrix V is identical to Σ−1. In this instance, V1Σ = [Id1, 0d1d2], and hence, where 0d1d2 is a d1 × d2 zero matrix. If we partition Σ in the same way as for V, and let Σkl denote the corresponding block matrix for k, l = 1, 2, then

Therefore, is positive definite, implying that θÎ is more efficient than θ̃I, where θ̃I is the SIMEX estimator of θI. We conclude this section with a simple but informative example to illustrate the rationale of constructing the quadratic loss function (6) for variable selection.

Author Manuscript

Example Suppose that {(Xi, Yi) : i = 1, …, n} is a random sample chosen from the linear model

where θ is the regression parameter vector of dimension d, and the error term εi ~ N (0, σ2). Without loss of generality, assume that σ2 = 1 for simplicity. Let Y = (Y1, …, Yn)T, and be the n × d matrix of the covariates. If using the least squares method to estimate θ, then the resulting estimator is θ̃ = (XTX)−1XTY, which has the properties

Author Manuscript

and


Yi et al.

Page 11

where θ0 is the true value of θ.

Author Manuscript

Note that (7)

Since the first term ‖y − XTθ̃‖2 on the right side of (7) does not involve parameter θ, to perform variable selection, it is natural to approximate the left side of (7) using the second term of the right side of (7), which is the quadratic loss function defined in (6) with the weight matrix taken as Vn = {cov(θ̃)}−1 = XTX.

Author Manuscript

We then define the penalized quadratic loss function and follow the Step 4 procedure in Section 3 to perform variable selection. Let θ̂ denote the resulting estimator. Then this estimator has the properties stated in Theorem 1. In fact, according to Fan and Li (2001), the oracle estimator is θ̂II = 0 and

with asymptotic normality:

where XI is the n × d1 submatrix of X corresponding to θI in the linear model, and V11 is the d1 × d1 submatrix of

corresponding to θI.

5 Numerical Studies 5.1 Analysis of NPHS Data

Author Manuscript

The National Population Health Survey (NPHS) is a longitudinal study that was conducted every second year, beginning in 1994/1995. The survey questionnaire is designed to collect health information and related socio-demographic information by following a group of Canadian household residents. The questions for the NPHS include many aspects of in-depth health information such as health status, use of health services, chronic conditions and activity restrictions. Social background questions, including age, sex and income level, are contained in the questionnaire. A primary objective of the NPHS study is to understand how population health is influenced by multiple covariates or risk factors. For instance, distress is an important variable associated with the health profile for an individual. It is measured by Distress Scale - K6 in the NPHS data set, ranging from MIN=0 to MAX=24 with a higher value indicating more severe distress. A binary variable is created to feature the distress status with 0 indicating distress free, and 1 otherwise.

Author Manuscript

We analyze the data with the first 5 cycles that were carried out in 1994/1995 (Cycle 1), 1996/1997 (Cycle 2), 1998/1999 (Cycle 3), 2000/2001 (Cycle 4), and 2002/2003 (Cycle 5), respectively. The data set contains the measurements for 1089 subjects. Covariates include: (1) Body Mass Index (BMI); (2) health status, measured by the Health Utilities Index Mark (HUI) from eight attributes: vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain and discomfort. HUI score ranges from −0.36 to 1.00, with 1.00 standing for being completely healthy and −0.36 representing a lowest level of health state; (3) household income (INC), measured by the provincial level of household income. INC ranges from 1 to 10, where 1 denotes household income in a poorest province,


Yi et al.

Page 12

Author Manuscript

while 10 denotes household income in a riches province; (4) education, characterized by the highest education level for a subject. Dummy variables are used to represent education level, with “Lower than Middle School” as a baseline, EDU1 representing “Middle School” and EDU2 corresponding to “College Level”; (5) marital status, indicated by a binary variable MA, with value 1 for the status of being married or living common-law, and 0 for other situations including being widowed, separated, divorced or single (never married); and (6) age. Let Yij be the distress status for subject i at Cycle j, and μij be the mean of Yij conditional on the covariates, i = 1, …, 1089; j = 1, …, 5. The responses and covariates are postulated by the logistic regression model

Author Manuscript

where for subject i at Cycle j, Xij1 is the standardized body mass index, Xij2 is the standardized Health utility index, Xij3 is the standardized provincial income rank, Zij1 is the marriage status, Zij2 is the high school level education (0-No; 1-Yes), Zij3 is the college level education (0-No; 1-Yes), and zij4 is age. Among all the covariates, the Z variables are precisely measured, but the X covariates, i.e., BMI, HUI, and INC, are subject to measurement error. Let Xij = (Xij1, Xij2, Xij3)T denote the vector of true covariate values, and Wij = (Wij1,Wij2,Wij3)T be the vector of the observed

Author Manuscript

measurements. Assume the error model is given by (5) with covariance matrix , where C is the correlation matrix with diagonal elements 1. As there is no additional source, such as a validation sample, to be used to estimate the covariance matrix, we conduct sensitivity analyses to understand error effects on model selection. In particular, we take where

is specified as (0.15 × a)2, and we take

for different values of c = 0, 1/4, 1, 4 and a = 1/4, 1/2, 1, 2. When c = a = 1, the Σe is the empirical estimate obtained from the data.

Author Manuscript

In this data set, a portion of subjects have dropped out of the study. Specifically, dropout percentages are 14.1%, 24.7%, 37.4%, and 46.6% at Cycles 2, 3, 4 and 5, respectively. The drop-out process is characterized by the logistic regression model

where α0, αy, αx1, αx2, αx3, αz1, αz2, αz3 and αz4 are regressions parameters, and the dependence of drop-out probability at Cycle j on the response at the previous cycle features a possible missing at random mechanism.


Yi et al.

Page 13

Author Manuscript

We analyze the data with the selection and inference method described in Section 3. In particular, we use K = 200 and Ψ = {0, 0.25, 0.50, 0.75, 1.0, 1.25, 1.50, 1.75, 2.0} which equally divide [0, 2] into 8 subintervals. A quadratic regression function is used in the extrapolation step. We consider both LASSO type and SCAD type penalty functions when applying the proposed method; in contrast, we report on the estimation results obtained from the SIMEX method alone (i.e., the first three steps in Section 3). Table 1 displays the analysis results for the case with a = c = 1.

Author Manuscript

Although the selection methods based on the SCAD and LASSO penalties result in different estimates, they do select the same covariates, HUI Xij2 and age Zij4, for the response model, and the associated standard errors are quite close. The results suggest that heath utility index has a positive effect on reducing distress: individuals having higher HUI are less likely to have distress. As subjects become older, they are less likely to have distress. The results obtained from using the SIMEX method alone agree fairly well with the selection results except for the estimate of the marital effect, and which is indicated by insignificance of the estimates for those unselected variables. Regarding selecting variables for the drop-out process, both the SCAD and LASSO penalties yield the same results, indicating that none of the covariates are significant. Both methods suggest that the dependence on the previous response measurement is not significant, and this implies that data are missing completely at random. Again, the results obtained from the SIMEX method alone are in reasonable agreement with those findings except that there is slight evidence for significance of EDU1.

Author Manuscript

In Figure 1, we further display the sensitivity analysis results for the HUI covariate. The dark lines represent the estimates; the grey lines denote the corresponding upper and lower bounds of 95% confidence intervals, respectively. The results for the age covariate are not reported to save space, but the pattern is quite similar to that of HUI with less variable magnitudes. Finally, we comment that measurement error model (5) is used to describe the error involved in the covariate income although this variable is categorical in the data analysis. Since the income variable Xij3 is defined by categorizing a latent continuous variable into 10 classes, we have conducted additional simulation studies to assess the impact of using error model (5) for Xij3 on inference results. Our simulation studies (not reported here) offer us certain comfort that assuming error model (5) for the rank-type covariate, income, does not noticeably affect model selection and estimation of the covariate effects, although the variance estimate of the income variable seems somewhat conservative.

Author Manuscript

5.2 Simulation Study: Performance of the Proposed Method We conduct simulation studies to assess the performance of the proposed methods under various circumstances. The implementation code is available from the second author upon request. We consider the configuration of m = 5 and d1 = d2 = 5. Two sample sizes with n = 200 and 500 are considered. One thousand simulations are performed for each parameter setting.


Yi et al.

Page 14

Author Manuscript

Error-contaminated covariates Xij are generated from a d1-multivariate normal distribution N(μx, Σx), and error-free covariates Zij are independently generated from a d2-multivariate normal distribution N(μz, Σz). Surrogates Wij are generated by adding an error term to Xij : Wij = Xij + eij with eij ~ N(0,Σe). We set μx = 0.5 · 1d1 and μz = −0.2 · 1d2, where 1r is the r × 1 unit vector. Σx is taken as a d1 × d1 matrix with (i, j) element as a d2 × d2 matrix with (i, j) element

, Σz is taken

, and Σe is taken as a d1 × d1 matrix

with (i, j) element being , where we set , and ρx = ρz = 0.5. We set σe = 0.15, 0.50, 0.70, and ρe = 0, 0.1, 0.5 to feature various scenarios of measurement error. Binary response vectors Yi = (Yi1, Yi2, …, Yim)T are generated from the joint density function

Author Manuscript

(8)

where ρijk is the correlation coefficient of Yij and Yik(j ≠ k), given by , the mean μij = E(Yij|Xi,Zi) is modeled as (9)

the variance υij = μij(1 − μij), and μijk = P(Yij = 1, Yik = 1|Xi,Zi) is determined by the marginal probabilities μij and μik, together with the odds ratio ψijk between Yij and Yik which is modeled as

Author Manuscript

(10)

Set β0 = −1.6, βx = (3.0, 0, 0, 1.5, 0)T, βz = (0, 0,−1.5, 0, 0)T, and ϕ = 0.1. For the missing data process, we generate monotone missing data indicator Ri = (Ri1,Ri2, …,Rim)T according to the conditional probability . Specifically, we consider the model

Author Manuscript

where parameters are set as α0 = 0.5, α1 = 1.5, αx = (0,−1.5, 0, 0, 0)T, and αz = (3.0, 0, 0, 0, 0)T. In performing the SIMEX approach, we use K = 200 and Ψ = {0, 0.25, 0.50, 0.75, 1.0, 1.25, 1.50, 1.75, 2.0} which equally divide [0, 2] into 8 subintervals, and a quadratic regression function is used in the extrapolation step.


Yi et al.

Page 15

Author Manuscript Author Manuscript

To evaluate the performance of model selection, we compute the square root mean squared error (RMSE) for both the reduced model (i.e., the model with selected variables only) and the full model using the SIMEX method, and then calculate the ratio of these two RMSE values. The median of the ratios from 1000 simulations is reported under the column “MRMSE”. A smaller value indicates a better performance in terms of parameter estimation. We also calculate the specificity and sensitivity for each simulation. Specificity is defined as the proportion of zero coefficients that were correctly estimated to be zeros; sensitivity is the proportion of nonzero coefficients that are estimated to be non-zeros. The average of both indices over 1000 simulations are listed under the columns with corresponding headings in the tables. In addition, under fit is defined as the proportion of excluding any significant coefficients in 1000 replications, whereas exact fit is the proportion of selecting the exact sub-model. The proportions of including all significant variables and some (1,2, or ≥ 3) nonsignificant variables are also reported in the columns under over fit. We report on the results in Tables 2 and 3. As expected, the model selection performance, in terms of sensitivity, specificity and the correct fit rate, would decay as the degree of measurement error becomes more substantial. The performance of the model selection is also affected by the sample size. We note, in good agreement with the theoretical results, that a larger sample size tends to yield better results. This trend is observed in all situations, regardless that measurement errors are correlated or uncorrelated.

Author Manuscript

Next, we evaluate how measurement error and missingness may differently affect the performance of model selection. The results are displayed in Table 4 where Setting I refers to scenarios with measurement error only while Setting II corresponds to situations with missingness only. It is seen that although the results for Setting I show the same trends as those observed before, these results are fairly insensitive to the degree of measurement error, and are quite satisfactory. On the other hand, the results for Setting II demonstrate that missingness has more noticeable impact on model selection than measurement error does for the configurations we consider. 5.3 Simulation Study: Comparison of the Proposed Method with a Two-Stage Method When the dimension of covariates is not very large, one may pursue an alternative method, as pointed out by a referee. This method consists of two steps: in the fist step, we apply the SIMEX algorithm to the full dataset with all covariates included; in the second step, remove those covariates with p-value smaller than a threshold, say 0.05, and then re-apply the SIMEX algorithm to the data with the remaining covariates.

Author Manuscript

We run simulation studies to compare the performance of the proposed method and this twostep algorithm. Using the same data generation procedure as in Section 5.2, we generate data with sample size n = 200 and run simulations 300 times for each of the three model configurations: (A) ρe = 0.5, σe = 0.24; (B) ρe = 0.5, σe = 0.33; and (C) ρe = 0.5, σe = 0.41 with both drop-out and measurement errors involved. We compare the model selection performance between the SIMEX approach and the 2-stage approach. The simulation results are reported in Table 5. In terms of RMSE, the 2-stage


Yi et al.

Page 16

Author Manuscript

approach seems to outperform the SIMEX approach; it tends to produce smaller RMSE than the SIMEX approach does. But the SIMEX approach performs better than the 2-stage procedure in terms of the rate of exact correct fit. The rate of exact correct fit of the 2-stage approach can be about 30% lower than that of the proposed SIMEX approach. Overall, both methods work generally well.

6 Discussion

Author Manuscript

Missing observations and measurement error arise frequently in longitudinal studies, and present considerable challenges in inference. In this paper, we develop simultaneous procedures for model selection and estimation for longitudinal data with these features. The development is cast in a unified framework of marginal generalized linear models; this is attractive in that the full distribution of the response process is not needed. In the case where both missing data and measurement error exist, we explore a simulation-based approach that applies for many applications and establish theoretical properties of the resulting estimator. In contrast to the naive analysis of ignoring measurement error or missingness, numerical studies indicate reasonable performance of the proposed method.

Author Manuscript

In terms of model selction consistency, Leeb and Pötscher (2008) raised the superefficiency issue of sparse estimators under the linear model framework, and implied that a model selection consistent procedure must suffer from the lack of uniform consistency. As pointed out by a referee, in spite of this aspect, it is debatable whether or not model selection consistency is a useful criterion. The Akaike information criterion (AIC) (Akaike 1974) and the Bayesian information criterion (BIC) (Schwarz 1978) are perhaps the most popular classical variable selection criteria. Model selection criteria are commonly classified into two categories: an efficient selection, such as the AIC; and a consistent selection, such as the BIC. The efficiency criterion selects the model so that the average squared error is asymptotically equivalent to the minimum offered by the candidate models which are assumed to approximate the true model. The consistency criterion identifies the true model with a probability approaching to 1 as the sample size increases under the assumption that a set of candidate models contains the true model. A detailed discussion on efficiency and consistency is provided by Shao (1997).

Author Manuscript

Generally speaking, an efficient model selection procedure, such as the AIC, leads to an estimate possessing uniform consistency, while a consistent selection criterion (e.g., the BIC) yields an estimate which lacks uniform consistency. The oracle property established in the paper implies that our proposed variable selection procedure is a consistent model selection procedure, and thus, suffers from the lack of uniform consistency. The tuning parameter λ in the penalized quadratic loss function controls the model complexity. It is important to set a suitable range for the tuning parameter so that the selected λ is located around the middle of the range in the implementation of the BIC tuning parameter selector; otherwise, significant risk inflation may arise as observed by Leeb and Pötscher (2008). In the algorithm described in Section 3, the measurement error covariance matrix Σe is assumed to be known. This method is particularly useful for many practical settings. Often, no additional data sources are available to determine the parameters associated with the


Yi et al.

Page 17

Author Manuscript

measurement error model, and one must conduct sensitivity analyses to evaluate the impact on inference of varying degrees of measurement error. In situations where the parameters in covariance matrix Σe can be estimated from additional data, such as a validation subsample or repeated surrogate measurements, the proposed methods can be readily adapted by modifying the simulation step as suggested by Carroll et al. (2006).

Author Manuscript

In the formulation of the methods, we use the inverse weighting scheme to adjust for bias induced by missing observations, and model the missing data indicators with widely used logistic regression under the MAR assumption. The MAR assumption allows us to express the missingness probability as a function of the observed responses, together with covariates. This assumption is commonly adopted in marginal analysis to ensure model identifiability related to the missing data process, a necessary condition for estimating the associated parameters (e.g., Robins, Rotnitzky and Zhao 1995). The MAR assumption is, however, not essential to the development of our methods here. If the probability of missingness does depend on unobserved response components, nonidentifiability may arise and in this case conducting sensitivity analyses is a viable strategy. Our methods can be applied to handle this scenario as well.

Author Manuscript

In circumstances where only the marginal structures are modulated, correction of missingness and error effects is commonly carried out sequentially. In this paper our methods are based on correcting first missingness effects and then error effects. This is basically driven by the way of modeling the missing data process for which the missingness probability is model as a function of error-prone covariates Xi, together with responses Yi and precisely measured covariates Zi. Such a modeling scheme for the missing data process is however, not essential for our proposed method. In some applications, the missingness probability is modulated as a function of surrogate measurementsWi, along with Yi and Zi. For example, Yi, Ma and Carroll (2012) considered this modeling scheme and developed marginal methods to first adjust for error effects and then correct missingness effects. Our model selection procedures can be modified to accommodating the case where the missingness probabilities are modulated as functions of surrogate measurements Wi as well as Yi and Zi. It may be interesting to develop a sensible measure to compare the performance of model selection under these two modeling schemes for missing data processes.

Acknowledgements

Author Manuscript

The authors thank the referees for their helpful comments to improve the presentation of this work. The authors are grateful to Statistics Canada/The Research Data Centre Network for permission to use the data from the National Population Health Survey (NPHS), and thank Haocheng Li for his help to access the data. Yi’s research was supported by the Natural Sciences and Engineering Research Council of Canada. Tan’s research was supported by NIDA grants P50-DA10075. Li’s research was supported by NSF grant DMS 1512422 and NIDA, NIH grants, P50 DA036107 and P50 DA039838. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIDA, NIH or NSF.

References 1. Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974; 19:716–723.


Yi et al.

Page 18

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

2. Bondell HD, Krishna A, Ghosh SK. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics. 2010; 66:1069–1077. [PubMed: 20163404] 3. Carroll RJ, Lombard F, Küchenhoff H, Stefanski LA. Asymptotics for the SIMEX estimator in structural measurement error models. Journal of the American Statistical Association. 1996; 91:242–250. 4. Carroll, RJ.; Ruppert, D.; Stefanski, LA.; Crainiceany, CM. Measurement Error in Nonlinear Models. 2nd ed.. Chapman & Hall; 2006. 5. Cook JR, Stefanski LA. Simulation-extrapolation in parametric measurement error models. Journal of the American Statistical Association. 1994; 89:1314–1328. 6. Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics. 1994; 43:49–93. 7. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annual of Statistics. 2004; 32:407–499. 8. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001; 96:1348–1360. 9. Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association. 2004; 99:710–723. 10. Fan, J.; Li, R. Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In: Sanz-Sole, M.; Soria, J.; Varona, JL.; Verdera, J., editors. Proceedings of the International Congress of Mathematicians. Vol. III. Zurich: European Mathematical Society; 2006. p. 595-622. 11. Fu WJ. Penalized estimating equations. Biometrics. 2003; 59:126–132. [PubMed: 12762449] 12. Geyer C. On the asymptotics of constrainted M-estimation. Ann. Statist. 1994; 22:1993–2010. 13. Knight K, Fu W. Asymptotics for lasso-type estimators. Ann. Statist. 2000; 28:1356–1378. 14. Lai TL, Small D. Marginal regression analysis of longitudinal data with time-dependent covariates: a generalized method of moments approach. Journal of the Royal Statististical Society, Ser. B. 2007; 69:79–99. 15. Liang H, Li R. Variable selection for partially linear models with measurement Errors. Journal of American Statistical Association. 2009; 104:234–248. 16. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986; 73:13–22. 17. Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001; 96:1045–1056. 18. Lin X, Carroll RJ. Semiparametric estimation in general repeated measures problems. Journal of the Royal Statistical Society, Ser. B. 2006; 68:69–88. 19. Little, RJA.; Rubin, DB. Statistical Analysis with Missing Data. 2nd Ed.. New York: Wiley; 2002. 20. Ma Y, Li R. Variable selection in measurement error models. Bernoulli. 2010; 16:274–300. [PubMed: 20209020] 21. Miller, A. Subset Selection in Regression. 2nd edition. Boca Raton, Florida: Chapman & Hall/ CRC; 2002. 22. Ni X, Zhang D, Zhang HH. Variable selection for semiparametric mixed models in longitudinal studies. Biometrics. 2010; 66:79–88. [PubMed: 19397585] 23. Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995; 90:106–121. 24. Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978; 6:461–464. 25. Shao J. An asymptotic theory for linear model selection. Statistica Sinica. 1997; 7:221–264. 26. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996; 58:267–288. 27. Tsiatis AA, DeGruttola V, Wulfsohn MS. Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. Journal of the American Statistical Association. 1995; 90:27–37.


Yi et al.

Page 19

Author Manuscript

28. Wang CY, Huang Y, Chao EC, Jeffcoat MK. Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorable missing data. Biometrics. 2008; 64:85–95. [PubMed: 17608787] 29. Wang H, Li R, Tsai C-L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007; 94:553–568. [PubMed: 19343105] 30. Yi GY. A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates. Biostatistics. 2008; 9:501–512. [PubMed: 18199691] 31. Yi GY, Lawless JF. A corrected likelihood method for the proportional hazards model with covariates subject to measurement error. Journal of Statistical Planning and Inference. 2007; 137:1816–1828. 32. Yi GY, Reid N. A note on mis-specied estimating functions. Statistica Sinica. 2010; 20:1749– 1769. 33. Yi GY, Ma Y, Carroll RJ. A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error. Biometrika. 2012; 99:151–165. 34. Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. Journal of American Statistical Association. 2010; 105:312–323. 35. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2008; 36:1509–1533. [PubMed: 19823597]

Author Manuscript

Appendix Proof of Theorem 1 We use the related techniques in Zou and Li (2008) to prove Theorem 1. Let u = (u1, ⋯, ud)T and consider

. Define

Author Manuscript

Then

Let û(n) = arg minu[δn(u) − δn(0)], then

.

Let w denote the limiting distribution of . Thus, w ~ N(0, Σ) by Assumption B given in Section 3. Using the related techniques in Zou and Li (2008), we can show that as n → ∞,

Author Manuscript Can J Stat. Author manuscript; available in PMC 2016 December 01.

Yi et al.

Page 20

Author Manuscript

where u2 is the d2 dimensional subvector of u which is written as being a d1-dimensional subvector.

with u1

Noting that V1 = [V11,V12] and δn(u) − δn(0) = T1 + T2 + T3, we have that as n → ∞,

Author Manuscript

The unique minimum of δ(u) is

since δn(u) − δn(0) is a convex function of u.

with û1(n) being d1-dimensional. By epiconvergence Partition (Geyer 1994; Knight and Fu 2000), we conclude that as n → ∞,

By the definition of w,

Author Manuscript

As n → ∞, , so we have that . We next show that θ̂II = 0 with probability tending to one. It suffices to prove that for j = d1 + 1,…, d,

If θĵ ≠ 0, by the KKT conditions, then we must have (11)

Author Manuscript

where {a}j stands for the jth element of a. By Assumption B, θj̃ → 0 in probability as n → ∞. Then the assumption in Theorem 1 implies that the right-hand side of (11) goes to ∞ in probability as n → ∞. However, the left-hand side of (11) can be shown to be of order Op(1). Thus,


Yi et al.

Page 21

Author Manuscript Author Manuscript Author Manuscript Fig. 1.

Author Manuscript

Sensitivity analysis results for the HUI covariate using the SIMEX procedure alone, using the SCAD penalty in combination with the SIMEX procedure, and using LASSO penalty in combination with the SIMEX procedure.


Author Manuscript

Author Manuscript

Author Manuscript

0.601 (0.091)

1.064 (0.168)

LASSO

SIMEX

1.751 (0.047)

1.754 (0.047)

1.444 (0.200)

SCAD

LASSO

SIMEX

Int.

Drop-out Model

0.581 (0.091)

SCAD

Int.

Response Model

0.046 (0.102)

0.000

0.000

Respons

−0.083 (0.049)

0.000

0.000

BMI

−0.016 (0.043)

0.000

0.000

BMI

0.088 (0.047)

0.000

0.000

HUI

−0.553 (0.067)

−0.312 (0.065)

−0.559 (0.065)

HUI

0.112 (0.056)

0.000

0.000

INC

−0.076 (0.047)

0.000

0.000

INC

0.006 (0.113)

0.000

0.000

Martial

−0.309 (0.100)

0.000

0.000

Martial

0.305 (0.131)

0.000

0.000

EDU1

−0.138 (0.132)

0.000

0.000

EDU1

0.332 (0.194)

0.000

0.000

EDU2

0.125 (0.178)

0.000

0.000

EDU2

0.005 (0.008)

0.000

0.000

Age

−0.045 (0.007)

−0.036 (0.006)

−0.030 (0.006)

Age

Analysis of NPHS Data with Three Methods: Estimate and Standard Errors (in parenthesis) (a = 1, c = 1)

Author Manuscript

Table 1 Yi et al. Page 22


Author Manuscript

Author Manuscript

500

0.0

200

0.5

0.1

0.0

0.5

0.1

ρe

n

0.207 0.788 1.051

.50

.70

1.004

.70

.15

0.738

.50

0.980

.70 0.198

0.727

.15

0.200

1.223

.70

.50

0.920

.15

0.307

.50

1.106

.70

.15

0.950

.50

1.077

.70 0.298

0.837

.15

0.305

.50

MRMSE

.15

σe

0.900

0.899

0.977

0.873

0.890

0.977

0.863

0.888

0.976

0.867

0.894

0.953

0.862

0.873

0.959

0.881

0.887

0.963

Specif.

0.886

0.934

0.988

0.903

0.933

0.988

0.907

0.939

0.988

0.829

0.869

0.973

0.845

0.885

0.972

0.853

0.901

0.970

Sensit.

0.278

0.160

0.031

0.240

0.167

0.028

0.232

0.153

0.028

0.378

0.293

0.063

0.363

0.248

0.070

0.333

0.236

0.080

Under-fit

0.501

0.604

0.889

0.463

0.578

0.889

0.445

0.579

0.894

0.304

0.440

0.753

0.296

0.432

0.767

0.347

0.459

0.771

Exact fit

0.083

0.073

0.052

0.115

0.089

0.056

0.136

0.101

0.049

0.140

0.118

0.119

0.159

0.133

0.106

0.175

0.151

0.106

1

0.051

0.050

0.010

0.056

0.050

0.011

0.061

0.052

0.010

0.078

0.048

0.038

0.077

0.072

0.036

0.077

0.072

0.024

2

Over-fit

Author Manuscript

Simulation Results for Response Model Parameters β

0.087

0.113

0.018

0.126

0.117

0.017

0.126

0.116

0.020

0.101

0.102

0.027

0.106

0.115

0.022

0.069

0.083

0.019

≥3

Author Manuscript



Author Manuscript

Author Manuscript

500

0.0

200

0.5

0.1

0.0

0.5

0.1

ρe

n

0.518 0.511 0.596

.50

.70

0.707

.70

.15

0.510

.50

0.733

.70 0.510

0.504

.15

0.504

0.915

.70

.50

0.693

.15

0.570

.50

0.839

.70

.15

0.664

.50

0.790

.70 0.506

0.607

.15

0.548

.50

MRMSE

.15

σe

0.949

0.940

0.995

0.931

0.931

0.993

0.926

0.929

0.993

0.942

0.952

0.987

0.938

0.943

0.989

0.952

0.944

0.990

Specif.

0.983

0.994

0.998

0.994

0.994

0.998

0.993

0.996

0.998

0.889

0.917

0.971

0.923

0.932

0.974

0.944

0.943

0.964

Sensit.

0.045

0.017

0.005

0.015

0.015

0.006

0.015

0.011

0.006

0.269

0.204

0.085

0.176

0.152

0.079

0.123

0.143

0.109

Under-fit

0.795

0.814

0.972

0.786

0.811

0.968

0.775

0.815

0.970

0.518

0.625

0.840

0.588

0.630

0.858

0.666

0.654

0.842

Exact fit

0.066

0.051

0.015

0.062

0.052

0.017

0.066

0.046

0.015

0.114

0.089

0.063

0.107

0.112

0.056

0.125

0.096

0.036

1

0.027

0.040

0.003

0.038

0.030

0.001

0.047

0.030

0.001

0.053

0.042

0.008

0.061

0.056

0.005

0.040

0.045

0.008

2

Over-fit

Author Manuscript

Simulation Results for Drop-out Model Parameters α

0.067

0.077

0.004

0.099

0.093

0.009

0.098

0.098

0.009

0.046

0.040

0.005

0.069

0.050

0.002

0.046

0.063

0.005

≥3

Author Manuscript



Author Manuscript

Author Manuscript

0.0

I (n = 200)

Can J Stat. Author manuscript; available in PMC 2016 December 01. 1.030 1.038

.50 .70 0.189

0.333

0.860

.70 .15

0.760

0.987

1.000

1.000

1.000

0.959

0.994

0.999

0.931

0.989

0.999

0.974

0.998

0.999

0.998

0.985

0.996

0.999

0.974

0.994

0.998

Specif.

0.993

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

0.982

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

Sensit.

0.020

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.046

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

Under-fit

0.915

0.999

0.999

0.997

0.731

0.960

0.996

0.545

0.928

0.996

0.836

0.986

0.991

0.990

0.898

0.970

0.991

0.827

0.957

0.984

Exact fit

Setting I: only measurement error is present; Setting II: only missingness is present

II (n = 500)

0.5

0.329

.50

0.873

.70 .15

0.717

.50

0.1

0.328

.15

0.0

1.081

.70

I (n = 500)

1.033

0.247

0.329

.50

0.883

.70 .15

0.803

.50

0.866

.70 0.325

0.726

.15

0.338

.50

MRMSE

.15

σe

II (n = 200)

0.5

0.1

ρe

Setting

0.062

0.001

0.001

0.003

0.255

0.039

0.004

0.426

0.070

0.004

0.090

0.014

0.009

0.009

0.097

0.029

0.008

0.167

0.043

0.016

1

Author Manuscript

Effect on Model Selection: Measurement Error versus Drop-out

0.004

0.000

0.000

0.000

0.015

0.001

0.000

0.029

0.002

0.000

0.020

0.000

0.000

0.001

0.005

0.001

0.001

0.004

0.000

0.000

2

Over-fit

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.008

0.000

0.000

0.000

0.000

0.000

0.000

0.002

0.000

0.000

≥3

Author Manuscript


Author Manuscript

Part

Drop-out

Drop-out

Response

Response

Drop-out

Drop-out

Response

Response

Drop-out

Drop-out

Response

Response

2-stage

SIMEX

2-stage

SIMEX

2-stage

SIMEX

2-stage

SIMEX

2-stage

SIMEX

2-stage

SIMEX

0.768

0.592

0.652

0.526

0.574

0.498

0.593

0.485

0.346

0.525

0.618

0.461

MRMSE

Sensit.

0.967

0.967

0.967

0.960

0.083

0.096

0.096

0.125

0.963

0.963

0.963

0.950

0.104

0.108

0.108

0.146

0.937

0.933

0.986

0.936

0.950

0.957

0.960

0.940

0.129

0.129

0.125

0.179

Model C: (ρe = 0.5, σe = 0.41)

0.933

0.931

0.991

0.935

Model B: (ρe = 0.5, σe = 0.33)

0.943

0.931

0.990

0.934

Model A: (ρe = 0.5, σe = 0.24)

Specif.

Under-fit

0.633

0.567

0.804

0.504

0.642

0.563

0.842

0.542

0.700

0.571

0.854

0.550

Exact fit

Author Manuscript

Method

0.146

0.200

0.050

0.242

0.142

0.242

0.042

0.229

0.146

0.246

0.038

0.233

1

Author Manuscript

Model Selection: 2-stage versus SIMEX

0.075

0.092

0.021

0.058

0.088

0.063

0.008

0.063

0.046

0.063

0.008

0.071

2

Over-fit

0.017

0.013

0.000

0.017

0.025

0.025

0.000

0.021

0.025

0.025

0.004

0.021

≥3

Author Manuscript



A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error.

A mixture of hierarchical joint models for longitudinal data with heterogeneity, non-normality, missingness, and covariate measurement error.

Covariate measurement error correction methods in mediation analysis with failure time data.

Jointly Modeling Event Time and Skewed-Longitudinal Data with Missing Response and Mismeasured Covariate for AIDS Studies.

Variable selection in strong hierarchical semiparametric models for longitudinal data.

Estimating Measurement Error of the Patient Activation Measure for Respondents with Partially Missing Data.

High Dimensional Variable Selection with Error Control.

Sensitivity analysis for missing outcomes in time-to-event data with covariate adjustment.

The SIMEX approach to measurement error correction in meta-analysis with baseline risk as covariate.

Variable Selection in Nonparametric Classification via Measurement Error Model Selection Likelihoods.

The impact of covariate measurement error on risk prediction.

Sensitivity analysis for causal inference under unmeasured confounding and measurement error problems.

Estimation and inference of error-prone covariate effect in the presence of confounding variables.

A case study of normalization, missing data and variable selection methods in lipidomics.

Meta-analysis based variable selection for gene expression data.

Reverse attenuation in interaction terms due to covariate measurement error.

Bayesian Covariate Selection in Mixed-Effects Models For Longitudinal Shape Analysis.

Variable selection for binary classification using error rate p-values applied to metabolomics data.

Bayesian inference for longitudinal data with non-parametric treatment effects.

A functional inference for multivariate current status data with mismeasured covariate.

Enabling network inference methods to handle missing data and outliers.

Imputation of missing covariate values in epigenome-wide analysis of DNA methylation data.

Variable importance and prediction methods for longitudinal problems with missing variables.

Estimating excess hazard ratios and net survival when covariate data are missing: strategies for multiple imputation.