Psychological Methods 2014, Vol. 19, No. 2, 281–299

© 2013 American Psychological Association 1082-989X/14/$12.00 DOI: 10.1037/a0034274

Generalized Linear Models With Coarsened Covariates: A Practical Bayesian Approach Timothy R. Johnson and Michelle M. Wiest

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

University of Idaho Coarsened covariates are a common and sometimes unavoidable phenomenon encountered in statistical modeling. Covariates are coarsened when their values or categories have been grouped. This may be done to protect privacy or to simplify data collection or analysis when researchers are not aware of their drawbacks. Analyses with coarsened covariates based on ad hoc methods can compromise the validity of inferences. One valid method for accounting for a coarsened covariate is to use a marginal likelihood derived by summing or integrating over the unknown realizations of the covariate. However, algorithms for estimation based on this approach can be tedious to program and can be computationally expensive. These are significant obstacles to their use in practice. To overcome these limitations, we show that when expressed as a Bayesian probability model, a generalized linear model with a coarsened covariate can be posed as a tractable missing data problem where the missing data are due to censoring. We also show that this model is amenable to widely available general-purpose software for simulation-based inference for Bayesian probability models, providing researchers a very practical approach for dealing with coarsened covariates. Keywords: coarsened variables, censoring, missing data, generalized linear models, Bayesian statistics

coarsened covariates, such as dropping cases with coarsening or using the coarsened variable as a surrogate, can lead to less efficient inferences at best, and incorrect inferences at worst. Although methods for properly handling coarsened covariates have been proposed, they are not as well known or commonly used. This may be due, in part, to significant programming difficulty and computational issues. To address this problem, we propose a practical and general approach to handling coarsened covariates using Bayesian probability models in the context of generalized linear models. We show that these models can be implemented with existing software. The rest of this article is organized as follows. We first introduce the issue of coarsened covariates in the context of generalized linear models, and review the limitations of ad hoc methods for dealing with them. We then discuss model-based methods for making inferences in the presence of coarsened covariates and highlight some of the difficulties of the implementation of these methods in practice. To address these difficulties, we propose the use of Bayesian probability models, building on the likelihood of the model-based approach and treating coarsened covariates as a missing-data problem. Finally, we demonstrate the implementation and utility of this methodology with real and simulated data, showing that it yields more accurate results than ad hoc methods, and can be done using existing software.

Coarsened variables are commonly encountered in research. Variables may be coarsened during the process of data collection, maintenance, or dissemination for a variety of reasons (Haitovsky, 1983). Coarsening may be used to simplify the task for respondents or observers, when measurement at a finer resolution is not possible or practical. It may also be used to “simplify” data analysis. Coarsening is also used as a method of statistical disclosure control to avoid compromising the confidentiality of respondents (Willenborg & de Waal, 1996, 2000). A variable is said to be coarsened if its value is known only up to a range or set of possible values. For example, a respondent’s age might be coarsened into ranges such as up to 20 years, 20 –25 years, and more than 25 years. A respondent’s highest degree earned may be coarsened as no more than a high school (high school or less than high school), associate/junior college, and college (bachelor’s or graduate degree). Another example is the practice of “top-coding” or “bottomcoding” for statistical disclosure control where variables are censored above or below a certain value, respectively, so that respondents corresponding to outliers cannot be as easily identified. The problem of and solutions to coarsened or censored response variables in regression is well known in statistics (McCool, 1982). However, the issue of coarsened covariates is less well understood. What is known is that ad hoc approaches to handling

Generalized Linear Models (GLMs) With Coarsened Covariates

This article was published Online First December 23, 2013. Timothy R. Johnson and Michelle M. Wiest, Department of Statistical Science, University of Idaho. Correspondence concerning this article should be addressed to Timothy R. Johnson, Department of Statistical Science, University of Idaho, 875 Perimeter Drive, MS 1104, Moscow, ID 83844-1104. E-mail: trjohns@ uidaho.edu

Consider a GLM (Nelder & Wedderburn, 1972) where the mean of a response variable Yi is a function of a vector of p observed quantitative covariates, x=i ⫽ (xi1, xi2, . . . , xip), and another covariate Zi, such that 281

JOHNSON AND WIEST

282

g[E(Y iⱍxi, Zi ⫽ z)] ⫽ ␤0 ⫹ ␤1xi1 ⫹ ␤2xi2 ⫹ · · · ⫹␤pxip ⫹ ␦z, (1) if Zi is a quantitative variable, or g[E(Y iⱍxi, Zi ⫽ z)] ⫽ ␤0 ⫹ ␤1xi1 ⫹ ␤2xi2 ⫹ · · · ⫹␤pxip ⫹ ␦z ,

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

(2) if Zi is categorical, so that Zi ⫽ 1, 2, . . . , K indicates the category. For a GLM, the distribution of Yi, conditional on xi and Zi ⫽ z, is a member of the exponential family such as a normal, binomial, or Poisson distribution. The parameters of interest are usually ␤= ⫽ (␤0, ␤1, . . . , ␤p) and ␦ in Equation 1 or ␦= ⫽ (␦1, ␦2, . . . , ␦K⫺1) in Equation 2, where usually ␦K ⫽ 0 for identification. The issue here is how to arrive at valid inferences concerning ␤ and ␦, or functions thereof, when Zi has been coarsened. Coarsening can be described as a many-to-one mapping of the values of a variable into two or more sets. Here we let Si denote a set of values such that Zi 僆 Si. If Zi is continuous, then Si is usually an interval. If Zi is discrete or categorical, then Si is a finite set of values or categories. Figure 1 shows how a continuous and discrete or categorical variable might be coarsened. Here, Zi is the original variable that is coarsened into a set Si. Specific values of Si are denoted as s1, s2, and s3. Figure 1a shows how a continuous variable Zi is coarsened into three consecutive intervals: s1 ⫽ (⫺⬁, zl], s2 ⫽ (zl, zr], and s3 ⫽ (zr, ⬁), where zl and zr denote the left and right cut points, respectively. In the age example mentioned earlier, these intervals would be s1 ⫽ (⫺⬁, 20], s2 ⫽ (20, 25],

contains the value of Zi is known. This is also known as intervalcensoring. Figure 1b shows how a discrete or categorical variable Zi is coarsened into sets of values. Here, there are five distinct values or categories, but these are coarsened by grouping them into the sets s1 ⫽ {1, 2}, s2 ⫽ {3}, and s3 ⫽ {4, 5}. For the highest educational degree example mentioned earlier, these sets are s1 ⫽ {less than high school, high school degree}, s2 ⫽ {associate/junior college degree}, s3⫽{bachelor’s degree, graduate degree}. Thus, for a coarsened discrete or categorical variable, one may only know that one of several possible values/categories was realized. Another common and related case is a quantitative variable that has been right- or left-censored, as in the practice of bottom- and top-coding, respectively. Figure 2 illustrates the case of a right-censored or top-coded quantitative variable. For a rightcensored continuous variable, where the values at which Zi is censored is zr and above, then Si ⫽



[zr, ⬁), if Zi ⱖ zr , Zi ,

otherwise,

so that sr ⫽ [zr, ⬁). Similarly for the discrete case, Si ⫽



{4, 5}, if Zi ⱖ 4, Zi ,

otherwise,

so that s1 ⫽ 1, s2 ⫽ 2, s3 ⫽ 3, and s4 ⫽ {4, 5}. Note that this notation can account for situations when a variable is not coarsened over certain values or categories such as si ⫽ Zi or, for example, s1 ⫽ {associate/junior college degree}.

s3 ⫽ (25, ⬁). Thus, for a coarsened continuous variable, only the interval that

Figure 1. Illustrations of the coarsening of a continuous variable (a) and a categorical or discrete variable (b). For the continuous variable, the coarsening is due to interval-censoring at the left and right cut-points zl and zr, respectively.

Ad Hoc Methods for Coarsened Covariates A common ad hoc approach to dealing with coarsened covariates is to use Si as a surrogate to Zi by treating the coarsened variable Si as either a quantitative variable (e.g., s1 ⫽ 1, s2 ⫽ 2, etc.) or as a categorical variable with as many categories as there are sets. There is considerable research, however, to show that this practice can compromise inferences. Austin and Hoch (2004) reported the results of a simulation study that showed bias in the estimates of the effects of covariates, both observed and those subject to censoring, in a linear regression model when using a surrogate for the censored covariate. They observed that the bias increases with the magnitude of the effect of the coarsened covariate, the degree of association between the censored and uncensored covariates, and the degree of censoring. Johnson (2006) observed similar results in a simulation study with intervalcensoring of a quantitative covariate in the case of Poisson regression, with the additional result that the coarsening can induce over-dispersion resulting in invalid confidence intervals and p values, even in situations where there is no bias in the estimates. Johnson also discussed some theoretical work in generalized linear models with measurement error in covariates (e.g., Caroll, 1989; Stefanski, 1985), which shows that coarsening can bias inferences not just for the effect of that covariate but also for other noncoarsened covariates. Lipsitz, Parzen, Natarajan, Ibrahim, and Fitzmaurice (2004) considered the case where a covariate is coarsened for only a

GENERALIZED LINEAR MODELS WITH COARSENED COVARIATES

283

already occurred and is irreversible. This issue then is how to arrive at correct inferences with coarsened covariates. That is the problem addressed here.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Model-Based Approach to Account for Coarsened Covariates

Figure 2. Illustrations of the coarsening of a continuous variable (a) and a discrete variable (b) by right-censoring. For the continuous case (a), zr represents the right-censoring limit.

subset of observations. This can occur, for example, when two data sets collected using different protocols for coding a covariate are merged or when a covariate is coarsened for only sub-samples for statistical disclosure control. In those cases, an option is to base inferences on only those observations free of covariate coarsening, an approach referred to as complete-case analysis in the missing data literature. They identified two limitations of this approach. One is potential bias in the estimates. This may occur if the covariate is not completely coarsened completely at random (CCAR; Heitjan & Rubin, 1991), meaning that the probability of coarsening does not depend on Zi or any other variables in the data set, missing or observed. The other limitation is lower statistical efficiency. The reduction in the sample size by omitting incomplete cases reduces statistical precision (e.g., wider confidence intervals and less powerful tests). The practice of “dichotomizing” covariates, a special case of coarsening where a quantitative covariate is coarsened into a two-category covariate, is relatively common and appears to be based on some misconceptions concerning the consequences of coarsening (DeCoster, Iselin, & Gallucci, 2009). Dichotomizing can result in spurious main effects and interactions, a loss of power in statistical tests concerning the effect of the covariate, and the failure to detect important relationships between the response variable and the uncoarsened covariate such as nonlinearity (e.g., Cohen, 1983; Irwin & McClelland, 2003; MacCallum, Zhang, Preacher, & Rucker, 2002; Maxwell & Delaney, 1993; Vargha, Rudas, Delaney, & Maxwell, 1996). Vargha et al. (1996) showed that problems arise also for inferences concerning other covariates that are not dichotomized. A related literature has found similar problems with censored covariates (e.g., Austin & Brunner, 2003; Lynn, 2001; Rigobon & Stoker, 2007; Thompson & Nelson, 2003). For these reasons, it has been consistently recommended that covariate coarsening be avoided. In practice, however, it may at times be necessary to make use of data where coarsening has

A model-based approach to accounting for coarsened covariates is to use the joint distribution of the response variable Yi, the unobserved covariate Zi, and the coarsened covariate Si to derive the marginal distribution of Yi and Si since Zi is unobserved. Here, we assume coarsening at random (CAR; Heitjan & Rubin, 1991). CAR is a weaker assumption than CCAR in that it means that the probability of coarsening does not depend on Zi. Let f(yi; xi, z) and h(z, xi) denote the probability distribution functions corresponding to the conditional distribution of Yi given Zi and the marginal distribution of Zi, respectively.1 As is shown below, the distribution of Zi allows one to model associations between Zi and the observed covariates, xi. The conditional distribution of Si given Zi ⫽ z is the indicator function I(z 僆 s) ⫽



1, if z 僆 s, 0, if z 僆 s,

since Si ⫽ s is possible if and only if z 僆 s. The joint distribution of Yi, Zi, and Si is then given by the product of these distributions, f(yi ; xi, z) h(z; xi) I(z 僆 s).

(3)

Since Zi is not observed it is natural to consider the marginal distribution of Yi and Si, and to use this to construct a marginal likelihood function for the purposes of inference. It is convenient to consider the cases of continuous and discrete/categorical covariates separately.

Coarsened Continuous Covariates The likelihood function for a GLM with a coarsened continuous covariate can be written as L(␪) ⫽

兿 兰 f(yi; xi, z) h(z; xi)dz. i⫽1 s n

(4)

i

This likelihood is obtained by integrating Equation 3 over z 僆 si to obtain the marginal likelihood of Yi and Si. For a continuous covariate, Zi could, for example, be assumed to be normallydistributed with mean E(Ziⱍxi) ⫽ ␣0 ⫹ ␣1xi1 ⫹ · · · ⫹␣p xip

(5)

and variance Var(Zi|xi) ⫽ ␴2, so that the regression of Zi on xi is a normal linear model. In practice, one specifies effectively two regression models— one for Yi and one for Zi— both with unknown parameters. The likelihood function depends on ␪ which contains all unknown parameters, namely ␤ from Equations 1 or 2, ␦ from Equation 1 or ␦ from Equation 2, and ␣ ⫽ (␣0, ␣1, . . . , ␣p)= from Equation 5, and also any additional parameters such as 1 To keep the notation simple, the dependence of the distribution of Yi on ␤ and ␦ and the distribution of Zi on ␣ (defined later) has been omitted in the expressions for the distributions.

JOHNSON AND WIEST

284

variance or scale parameters. Since the uncoarsened covariate Zi is unobserved, the marginal likelihood in Equation 4 integrates over the possible values of Zi within the known interval Si. The marginal likelihood is like that for a mixed model where Zi is a random effect, except here it represents a specific missing covariate value rather than a general respondent-specific effect (Ibrahim, Chen, Lipsitz, & Herring, 2005).

Coarsened Discrete and Categorical Covariates

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

The likelihood function for a GLM with a coarsened categorical or discrete covariate can be written as L(␪) ⫽

n

兿 兺 f(yi; xi, z) h(z; xi), i⫽1 s

(6)

i

which is similar to the continuous case in Equation 4 except since Zi is discrete so the integration is replaced by summation over z 僆 si. If Zi is categorical with K categories, a reasonable model for the covariate might be a multinomial logistic model of the form h(z; xi) ⬀ exp(␣0z ⫹ ␣1z xi1 ⫹ · · · ⫹␣pz xip), where all ␣jK ⫽ 0 for identification. Alternatively, an ordinal regression model could be used. If Zi is discrete, then h is some assumed probability mass function, such as a log-linear Poisson model if Zi is a count, so that h(z; xi) ⫽

␭ize⫺␭i z!

,

where log(␭i) ⫽ ␣0 ⫹ ␣1xi1 ⫹ ··· ⫹ ␣pxip. Again, a model is specified for Zi as well as Yi. Here, the marginal likelihood is obtained by summing over the possible values of the missing covariate Zi, as determined by Si, weighted by the probabilities P(Zi ⫽ z|xi) ⫽ h(z; xi). In this sense, Equation 6 is a kind of finite mixture model.

Maximum Likelihood Estimation The likelihoods given in Equations 4 and 6 are computed by integrating and summing, respectively, over the possible values of the missing covariate Zi within Si, weighting by the probability density or mass of the covariate. However, the summation or integration over Zi 僆 Si makes these likelihoods cumbersome. Direct numerical maximization of the likelihoods is feasible for the case of coarsened categorical or discrete covariates (Lipsitz et al., 2004), or when the covariate and response variable are jointly normally distributed (e.g., Rigobon & Stoker, 2007; Ronning & Kukuk, 1996; Winship & Mare, 1984). In other situations, however, direct numerical maximization can be computationally much more expensive, particularly for the case of coarsened continuous covariates. Furthermore, even in cases which are not as challenging computationally, software is not widely available. To understand the issue, consider an approach that has been proposed to circumvent direct summation or integration of a likelihood by using an expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). EM algorithms are based on the likelihood of the complete data including all observations, missing and observed. Here, this complete-data likelihood

Lc(␪) ⫽

n

兿 f(yi; xi, zi) h(zi; xi) I(zi 僆 si), i⫽1

(7)

where Z1, Z2, . . . , Zn are the missing data, and Y1, Y2, . . . , Yn and S1, S2, . . . , Sn are the observed data. The EM algorithm maximizes the conditional expectation of the log of this likelihood, given the observed data. The EM algorithm circumvents the summation or integration necessary in Equations 4 and 6 with the calculation of an expectation. For a coarsened categorical or discrete covariate, the expectation can be expressed in closed form as shown by Lipsitz et al. (2004). However, for coarsened continuous covariates, the expectation cannot in general be expressed in closed form. Johnson (2006) proposed using Monte-Carlo EM (MCEM; Wei & Tanner, 1990) algorithms for generalized linear models and generalized linear mixed models with normally distributed intervalcensored covariates. Giovanini (2008) and May, Ibrahim, and Chu (2011) proposed similar MCEM algorithms for censored covariates. These methods use simulations to approximate the expectation. However, this approach, while effective in theory, can be prohibitive in practice. May et al. noted the limitation of maximum likelihood methods is that (a) a unique algorithm may need to be programmed for each application and (b) the simulations necessary in the case of continuous coarsened covariates can be very expensive computationally. Other researchers have noted the computational limitation for the continuous coarsened covariate case (e.g., Gómez, Espinal, & Lagakos, 2003; Tsimikas, Bantis, & Georgiou, 2012). Furthermore, MCEM algorithms tend to be difficult to monitor for convergence, and EM algorithms do not easily provide standard errors or likelihood ratios for confidence intervals and tests. The additional effort necessary to program and execute these algorithms may unfortunately dissuade researchers from using them.

Bayesian Probability Models for Coarsened Covariates To address the limitations of likelihood-based inference for GLMs with coarsened covariates, we propose a simulationbased approach with Bayesian probability models. The computations, although simulation-based, are relatively quick. Furthermore, these models can be implemented using available software without the need to program the simulation algorithm itself. As in the EM algorithms that have been proposed for coarsened covariates, the Zi are treated as missing data, but in a Bayesian probability model, missing data can be treated simply as additional unknown parameters (Ibrahim et al., 2005; Tanner & Wong, 1987). A Bayesian probability model for a GLM with a coarsened covariate Zi is defined in terms of the posterior distribution k(␪, z|y) of the model parameters ␪ and the missing values of the covariate z ⫽ (z1, z2, . . . , zn)=. This can be written as k(␪, zⱍy) ⬀ m(␪)

n

兿 f(yi; xi, zi) h(zi; xi) I(zi 僆 si), i⫽1

(8)

where I(zi 僆 si) is an indicator function for the inclusion of the coarsened covariate zi in set si so that

GENERALIZED LINEAR MODELS WITH COARSENED COVARIATES

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

I共zi 僆 si兲 ⫽



1,

if zi 僆 si ,

0,

if zi 僆 si ,

and m(␪) is the prior distribution for ␪.2 This model applies to both continuous and discrete/categorical coarsened covariates since Si can be an interval or countable set of values. In the Bayesian approach to a coarsened/missing covariate, zi appears in the model as a parameter, whereas in Equations 4 and 6, z was only used as a variable of integration or summation, respectively. Viewing zi as a parameter in Equation 8, the term I(zi 僆 si) reflects the fact that, given Si ⫽ si, zi has zero probability/ density when zi is not in si. The role that zi plays here is analogous to that of random effect with a truncated distribution. Including an unobserved variable in a Bayesian probability model has become a standard technique (Tanner & Wong, 1987). Such methods have proven to be useful in other applications such as mixed effects models (Zeger & Karim, 1991), latent variable models (Arminger & Muthén, 1998), and limited response variable models (Albert & Chib, 1993).

Simulation-Based Bayesian Inference Including missing/latent data as unobserved parameters can make models more amenable to Markov chain Monte Carlo (MCMC) methods for simulating realizations from a posterior distribution (Gilks, Richardson, & Spiegelhalter, 1996). The summation or integration over the possible values of the coarsened covariate, which can be difficult for maximum likelihood inference, is avoided. Inferences are based entirely on simulated realizations of the parameters of interest from the posterior distribution. Bayesian inference is based on the posterior distribution of the parameters of a model. A MCMC algorithm simulates realizations of parameters from a posterior distribution for the purpose of inference. The idea is that the posterior distribution can be analyzed descriptively using an appropriate sample of realizations. For example, the mean, median, or mode of the posterior distribution of a parameter can serve as its point estimate. These can be approximated by the corresponding statistic from a sample from the posterior distribution. An interval estimate of a parameter can be made using quantiles of the posterior distribution, such as the 2.5% and the 97.5% quantiles to produce a 95% credibility interval. These can also be approximated by using the corresponding sample quantiles. One can also approximate posterior probabilities, such as the probability that a parameter is greater than zero, using the proportion of observations of that parameter in the sample that exceed zero. When inferences are based on approximations using descriptive statistics of a simulated sample of realizations from posterior distributions, it is critical that the sample be sufficiently representative of the posterior distribution. The iterative nature of MCMC algorithms is such that often the initial samples are not representative of the posterior distribution, and may depend on initial values. These are discarded prior to analysis as a “burn-in” period. A variety of diagnostic methods can be used to determine whether a burn-in period is sufficiently large and that the algorithm has converged to a point where the simulated sample is representative of the posterior distribution (Cowles & Carlin, 1996). One simple method is to plot the simulated realizations of parameters over

285

iterations in what is commonly called a trace plot. Another method is to compute one or more statistics to assess convergence. In our analyses, we relied largely on plots and a simple diagnostic measure due to Gelman and Rubin (1992; see also Brooks & Gelman, 1998) to assess convergence.3 Another issue is the size of the sample after the burn-in period. The size of this sample depends on how dependent the sampled realizations are over iterations, and how much accuracy is needed in using sample statistics to approximate properties of the posterior distribution. We relied on inspection of cumulative means, quantiles, and approximate Monte Carlo standard errors of posterior means to determine when samples were sufficiently large, and in most cases, we used sample sizes in excess of what appeared to be minimally sufficient.

Software Implementation In addition to the computational advantages of Bayesian probability models using MCMC, generalized linear models with coarsened covariates are amenable enough to simulation that, while some familiarity with MCMC methods and software is required, users need not develop and program their own MCMC algorithms. A variety of packages have been developed that will, given a user-specified model, build and implement a MCMC algorithm for the specified model automatically. This removes the burden of developing and programming the algorithm so that users can focus on model specification and inference. As with other statistical packages for analyses ranging from regression to structural equation modeling, the user specifies only the model, not the computational algorithm that is used to arrive at inferences. Packages of this sort include WinBUGS (Lunn, Spiegelhalter, Thomas, & Best, 2009), OpenBUGS (Thomas, O’Hara, Ligges, & Sturtz, 2006), and PROC MCMC in SAS/STAT (SAS Institute, 2008). We use Just Another Gibbs Sampler (JAGS; Plummer, 2003). JAGS is similar to WinBUGS and OpenBUGS, and it uses a nearly identical syntax for model specification. However, JAGS offers an unambiguous way to specify a censored variable through the dinterval command, and it interfaces very easily with R (R Development Core Team, 2012) using the rjags package (Plummer, 2011). In the following examples and in the appendices, we demonstrate the use of JAGS for models with coarsened covariates. The approach builds naturally on the use of JAGS and related packages for linear and generalized linear models without coarsened covariates (see Congdon, 2003, 2007; Kruschke, 2011).

Prior Specification A component of any Bayesian probability model is the prior distribution, denoted generally as m(␪) in Equation 8. The prior distribution reflects the uncertainty concerning the model parameters prior to observation of any data so that, in principle, one could obtain, say, the prior mean or credibility interval of a parameter. Priors can be based on expert opinion or previous 2 Although Zi and z are conventionally used as the name and realization of the covariate, respectively, here for simplicity we use zi to represent both since it represents an unknown realization of the covariate. 3 We used a cutoff of Rˆ ⬍ 1.05 for all parameters of interest in analyses, although in most cases we were able to obtain values of Rˆ very near 1.

JOHNSON AND WIEST

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

286

research. In the latter case, the posterior distribution from a previous study could serve as the prior distribution in another study. However, in many cases little or no expert opinion or information may be available to guide the specification of a prior distribution. In these cases, it is usual to use a relatively vague or noninformative distribution as a prior distribution. In the examples in this article, we used non-informative distributions. Details on the specifications of these distributions are given in the appendices. How to specify a suitably non-informative distribution depends somewhat on the parameter. For a practical discussion of the specification of informative and non-informative prior distributions, see Gelman and Hill (2007). It is useful to note that Bayesian models for missing data and coarsened covariates can be interpreted as specifying a prior distribution for the missing data. This is because the term h(zi; xi) I(zi 僆 si) in Equation 8 can be viewed as being proportional to a prior for the unknown parameter zi. The Bayesian approach to missing data is to treat each missing realization as an unknown parameter. Here, zi is unknown due to the coarsening, but it is known to be within the interval or set si, and within this interval or set the distribution is specified to be proportional to h(zi; xi). This prior for zi is effectively a truncated distribution, and it could be interpreted as an informative distribution because the researchers are using their knowledge of the coarsening and their expert opinion or previous empirical evidence concerning its distribution to specify this prior distribution.

Examples The following examples demonstrate the analysis of GLMs with a coarsened covariate using a Bayesian probability model as proposed in the previous section. In both examples, the specific model is a logistic regression model with one observed covariate and one coarsened covariate. For the first example, we constructed artificial data with known parameters and imposed different kinds and degrees of coarsening on a continuous covariate and observed the results for different methods of dealing with the coarsened covariate. In the second example, we demonstrate the analysis of real data with a coarsened categorical covariate. Further details of the implementation of these analyses using JAGS are given in the appendices.

Example: Coarsened Continuous Covariate We simulated data based on a logistic regression model for a response variable Yi ⫽ 0, 1, 2, . . . , m with a binomial distribution such that Yi ⬃ Binomial(␲i, m) where ␲i and m ⫽ 50 are the event probability and the number of trials, respectively. The mean structure was defined as E(Yi|xi, zi) ⫽ m␲i and

冉 冊

log

␲i

1 ⫺ ␲i

⫽ ␤0 ⫹ ␤1xi ⫹ ␦zi ,

where ␤0 ⫽ 0 and ␤1 ⫽ ␦ ⫽ 1. The values of xi were uniformly distributed between ⫺1 and 1. The distribution of Zi given xi was specified as a gamma distribution, Gamma(␬, ␭i), where ␬ and ␭i are the “shape” and “rate” parameters, respectively, so that E(Zi|xi) ⫽ ␬/␭i and Var(Zi|xi) ⫽ ␬/␭i2. For this example, the shape parameter was fixed at ␬ ⫽ 0.25 to make the variability in the marginal

distribution of the two covariates similar. The relationship between Zi and xi was specified using a gamma regression model with a log link function so that the rate parameter, and thus the mean, depends on xi such that log[E(Zi|xi)] ⫽ ␣0 ⫹ ␣1xi, where ␣0 ⫽ 0 and ␣1 ⫽ 1. This simulates a logistic regression analysis where the covariate Zi has a right-skewed distribution, and where the covariates are partially confounded as might be the case in an observational study. Using this model, we simulated data sets with varying degrees of coarsening, and varying percentages of coarsened observations (50%, 75%, and 100%). The coarsening was based on median, quartile, and quintile splits of the covariate to create a coarsened covariate with two, four, and five categories, respectively. The observations that were coarsened in the cases of 50% and 75% coarsened observations were selected at random. For each of the nine combinations of the levels of these factors, we simulated 100 data sets with n ⫽ 100 observations each. To each data set, we applied four analyses described as follows. Full data analysis. This analysis uses the observed covariate before coarsening. It is included here as a baseline for comparison to the other approaches that have to accommodate the coarsened covariate. Missing data analysis. The missing data analyses uses the methodology proposed in this article. The Bayesian approach to missing data treats a missing realization of a covariate as an unknown parameter with a distribution that is truncated according to the way the covariate was coarsened. Coarsened data analysis. This analysis is based on the coarsened covariate itself by using a surrogate to the covariate. In the analyses reported below, this involved treating an intervalcensored covariate as a factor. Note that this approach does not yield a meaningful estimate of ␦ in the case of an interval-censored covariate since its surrogate is categorical. Complete case analysis. A complete case approach to missing data uses only the cases that do not have missing values. Here, the complete case analysis was based on those observations for which the covariate was not coarsened, and so uses only a subset of the data. Note that this approach is not applicable when 100% of the cases are coarsened. All four analyses were based on Bayesian probability models using JAGS. Diffuse priors were specified for the model parameters. The prior for ␤0, ␤1, ␣0, and ␣1 were independent normal distributions with mean zero and variance 1,000. The prior for the shape parameter ␬was specified as a diffuse gamma distribution with shape and rate parameters of 0.001. The first 5,000 samples were used as a “burn-in” phase. Inferences were based on the next 50,000 samples. Further details on the implementation of the missing data analysis using JAGS are given in Appendix A. Here, we focus on inferences concerning ␤1 and ␦ which reflect the effects of xi and Zi, respectively. Figure 3 shows box plots depicting the empirical distribution from the simulation of the posterior means of ␤1 and ␦ for each level of coarsening, percentage of coarsened observations, and method of analysis. Again, note that the complete case analysis is not applicable when 100% of cases are coarsened, and estimates of ␦ are not meaningful for the coarsened data analysis. There are several results to note from the outcome of the simulation shown in Figure 3. The first is that there is a relatively close agreement in the distribution of the estimates of ␤1 and ␦ for the full and missing data analyses, and the distributions are consistent with the known values of ␤1 ⫽ ␦ ⫽ 1.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

GENERALIZED LINEAR MODELS WITH COARSENED COVARIATES

287

Figure 3. Distributions of the posterior means of ␤1 (a) and ␦ (b) for data with an interval-censored covariate by method of analysis, percentage of coarsened cases, and degree of coarsening. The true value of the parameter (i.e., ␤1 ⫽ ␦ ⫽ 1) is shown by a horizontal line. The points and line segments within each box plot show the mean and 95% confidence interval for the posterior mean, respectively.

Estimates based on the missing data analysis show perhaps slightly more variability, which is consistent with the loss of information due to coarsening, but the increase in variability is relatively small with the exception of the case of estimates of ␦ where the percentage of coarsened cases is large and the coarsening is severe such as for a median split. These results show that the missing data approach proposed in this article for coarsened covariates can recover parameters despite coarsening. In comparison, the two ad hoc methods can be more problematic. The coarsened data analysis which uses the coarsened covariate as a factor exhibits a tendency to overestimate ␤1, particularly in more severe cases of coarsening. Conversely, the bias diminishes as the number of categories into which the covariate is coarsened increases. This is consistent with other simulation studies that have examined the effects of coarsened confounded covariates (e.g., Johnson, 2006). The complete case analysis does not incur bias but provides less precise estimates of ␤1. The variability of the posterior means of ␤1 increases as the percentage of coarsened observations increases, which would be expected since the effective sample size is decreasing. The credibility intervals (not shown) for ␤1 are also appreciably wider for the

complete case analysis for the same reason. These results show that typical ad hoc methods may not perform as well as a missing data approach to coarsened covariates due to bias in the case of the coarsened data analysis and less precision in the case of the complete case analysis. We also considered the case of a right-censored or top-coded covariate, where values that exceed a certain value are not observable. Using the same model, we generated data where the covariate was right-censored at the quartiles (i.e., the 25th, 50th, and 75th percentiles), resulting in 75%, 50%, and 25% coarsened observations, respectively. We applied the same four methods of analysis. For the coarsened data analysis, the unobserved value of a coarsened covariate was replaced by the corresponding quartile as its surrogate.4 The distributions of the posterior means of ␤1 and ␦ are shown in Figure 4. In comparison to the previous simulation study, here estimates from all four methods are meaningful for each parameter and conditions 4 Other values could also be used. The results vary depending on the value of the surrogate, but the general effect is the same.

JOHNSON AND WIEST

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

288

Figure 4. Distributions of the posterior means of ␤1 (a) and ␦ (b) for data with a right-censored covariate by method of analysis and percentage of coarsened cases. The true value of the parameter (i.e., ␤1 ⫽ ␦ ⫽ 1) is shown by a horizontal line. The points and line segments within each box plot show the mean and 95% confidence interval for the posterior mean, respectively.

since only some of the cases are coarsened by definition, and the surrogate for the coarsened covariate is on the same scale as the covariate. Again, the full and missing data analyses are in close agreement with each other and the known values of the parameters, although the posterior means show somewhat greater variability for the missing data analysis due to the loss of information. The estimates of ␤1 and ␦ are upwardly biased, with the bias increasing with the percentage of coarsened observations. Finally, the estimates of ␤1 and ␦ based on the complete case analysis are expectedly less precise due to the smaller effective sample sizes. Overall, the simulations illustrate problems with and limitations of using coarsened or complete data analyses to accommodate a coarsened covariate. The former can result in biased estimates for the effect of an observed covariate, and does not allow one to make inferences concerning the effect of the coarsened covariate— only that of its surrogate. The latter neglects information from the cases with a coarsened covariate and can only be used when some complete cases are available. In comparison, the missing data approach can provide better estimates of the effects of both observed and coarsened covariates, and is more widely applicable.

Example: Coarsened Categorical Covariate This example uses data from the General Social Survey (Smith, Marsden, & Kim, 2011) to model the relationship between type and strength of religious identity, and responses to an attitude question concerning abortion. The sample was a subset of n ⫽ 1,981 cases of respondents who identified their religious preference as either Protestant or Catholic. The binary response variable was the answer to the question “Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if the woman wants it for any reason.” Here, Yi ⫽ 1 if the answer is “yes,” and Yi ⫽ 0 if the answer is “no.” The covariates were the answers to questions concerning the respondent’s reli-

gious identity and the strength of their religious identity. The religious identity question asked depended on whether a respondent identified themselves as Protestant or Catholic as their religious preference. Protestants were asked “When it comes to your religious identity, would you say you are a fundamentalist, evangelical, mainline, or liberal Protestant, or do none of these describe you?” Catholics were asked “When it comes to your religious identity, would you say you are a traditional, moderate, or liberal Catholic, or do none of these describe you?” Nesting the answer to the religious identity question within the religious preference question creates a nine-level categorical variable. This variable is a likely candidate for coarsening since respondents might simply be identified as Protestant or Catholic rather than, say, “fundamentalist Protestant” or “liberal Catholic.” We selected approximately 75% of the cases at random in the sample and coarsened the religious identity variable so that those respondents could only be identified as Protestant or Catholic. Unlike the case of a coarsened quantitative covariate, a coarsened categorical covariate requires some uncoarsened cases for identification (Lipsitz et al., 2004). The other covariate is the strength of religious affiliation in response to the question “Would you call yourself a strong [religious preference] or a not very strong [religious preference]?” where [religious preference] is either Protestant or Catholic. Responses were coded as simply “yes” or “no.” To model the relationship between the response to the abortion question and the strength and identity of the respondents’ religious preference, we considered the binary logistic regression model log

冉 冊 ␲i

1 ⫺ ␲i

⫽ ␤0 ⫹ ␤1xi ⫹ ␦zi ,

where ␲i ⫽ P(Yi ⫽ 1|xi, zi), xi is an indicator variable for whether the respondent stated that they had a strong religious affiliation, and zi is an integer-valued variable representing one of the nine

GENERALIZED LINEAR MODELS WITH COARSENED COVARIATES

religious identity categories.5 Note that the term ␦ models the effect of identity as a categorical variable. We imposed the identification constraint 9j⫽1 ␦ j ⫽ 0 to avoid specifying an arbitrary reference category. We applied the same four methods of analysis as were used in the simulations. For the missing data analysis, we assumed a multinomial logit model for Zi so that



This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

P(Zi ⫽ z|xi) ⬀ exp(␣z0 ⫹ ␣z1xi). Diffuse normal priors were specified for all model parameters with mean zero and variance 1,000. The first 5,000 samples were discarded as a burn-in phase and inferences were based on the next 50,000 samples. Appendix B gives the details on how to implement this analysis using JAGS, including the JAGS model file for the missing data analysis. First, we focus on inferences concerning religious identity (␦1, ␦2, . . . , ␦9) on the response to the abortion question. Figure 5 shows the posterior means and 95% quantile-based credibility intervals for ␦1, ␦2, . . . , ␦9. The pattern in the posterior distributions for the analyses is similar to what was observed in the previous example with the interval-censored covariate. The posterior means are in relatively close agreement, but the credibility intervals for the complete case analysis and missing data analyses are wider due to the loss of information in the former and smaller sample in the latter. The overall trend appears to be that there is some effect for religious identity, even within religious preference. More conservative religious identities are less likely to respond “yes” to the question than less conservative religious identities. For the coarsened data analysis, the posterior distributions within the larger religious preference categories are the same since there the model can be written as

冉 冊

log

␲i

1 ⫺ ␲i



⫽ ␤0 ⫹ ␤1xi ⫹ ␦zⴱ , i

where zi* ⫽



1, if zi ⬍ 5, 2, if zi ⬎ 4,

which is equivalent to the full data analysis but with the constraints ␦1ⴱ ⫽ ␦1 ⫽ ␦2 ⫽ ␦3 ⫽ ␦4 and ␦2ⴱ ⫽ ␦5 ⫽ ␦6 ⫽ ␦7 ⫽ ␦8 ⫽ ␦9 since the religious identity variable is coarsened into just Catholic versus Protestant. This analysis cannot make any conclusions about the effect of differences in religious identity within religious preference. It only allows inferences about Catholics versus Protestants. Furthermore, the apparent effect of religious preferences is affected by that of religious identity and the distribution of religious identity within Catholics and Protestants. Thus, an analysis based on the coarsened covariate can give misleading results concerning the effect of religious preference. Now consider the posterior distribution of ␤1 that represents the effect of strength of religious identity. Figure 6 shows the posterior means and 95% credibility intervals for ␤1. The posterior means for the full and missing analyses are in relatively close agreement. The width of the credibility intervals for these analyses increases with the degree of missing information because of less information in the missing data analysis, and even less in the complete case analysis. The posterior distribution for the coarsened data analysis appears to show a somewhat smaller value of ␤1.

289

Recall that coarsening of one covariate can bias inferences concerning the effects of uncoarsened covariates, particularly when the covariates are associated. This may be due to a relationship between strength and type of religious identity. Figure 7 appears to show that more conservative religious identities tend to be associated with stronger religious identities. However, in making inferences concerning the strength of religious identity, the coarsened data analysis controls for only religious preference (e.g., Catholic), not religious identity (e.g., liberal Catholic). This suggests that inferences concerning the effect of strength of religious identity based on the coarsened data analysis may be biased.

Extensions: Multiple Coarsened Covariates and Interactions Most discussions of coarsened covariates consider only the case of a single coarsened covariate with an additive effect. Johnson (2006), however, did discuss in detail the case of GLMs with two or more coarsened continuous covariates, and interactions between the observed and coarsened covariates. The Bayesian approach proposed in this article can be extended to cases with multivariate coarsened covariates and/or interactions involving observed and coarsened covariates.

Multiple Coarsened Covariates We consider first the case of multivariate coarsened covariates. For illustration, consider a linear model with p observed covariates and q ⱖ 1 coarsened quantitative covariates: E共Y iⱍxi, Zi1, ⫽ z1, Zi2 ⫽ z2, . . ., Ziq ⫽ zq兲 ⫽ ␤0 ⫹ ␤1xi1 ⫹ ␤2xi2 ⫹ · · · ⫹␤pxip ⫹ ␦1z1 ⫹ ␦2z2 ⫹ · · · ⫹␦qzq . p observed covariates

q observed covariates

(9)

The relationship between the observed and coarsened covariates could be modeled using q linear models where the mean structure of the k-th coarsened covariate Zik is E共Zikⱍxi兲 ⫽ ␣0k ⫹ ␣1kxi1 ⫹ ␣2k ⫹ ␣2kxi2 ⫹ · · · ⫹␣pkxip . (10) The specification of the model is completed by making distribution assumptions concerning the marginal distribution of Zi1, Zi2, . . . , Ziq and the conditional distribution of Yi given Zi1 ⫽ z1, Zi2 ⫽ z2, . . . , Ziq ⫽ zq. In a normal linear model, the latter would be a normal distribution with mean defined in Equation 9 and variance ␴2. The marginal distribution of coarsened continuous covariates could be specified as a multivariate normal distribution with mean given in Equation 10 and covariance matrix ⌽. A multivariate normal distribution is an attractive choice since it is a tractable multivariate distribution where the associations among the variables are captured by ⌽. The mean vector defined by Equation 10 and the covariance matrix ⌽ can be left unknown and estimated 5 The parameterization for the effect of religious identity is equivalent to that from using indicator/dummy variables so that ␦zi ⫽ 9j⫽1 ␦ jdij where dij ⫽ 1 if the i-th respondent has the j-th religious identity, and dij ⫽ 0 otherwise.



JOHNSON AND WIEST

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

290

Figure 5. Posterior means and 95% credibility intervals for ␦1, ␦2, . . . , ␦9 for data with a coarsened categorical covariate.

from the data.6 This model can be implemented in JAGS by specifying a multivariate normal distribution for the q coarsened covariates. Appendix C shows the JAGS model file for the model discussed above and discusses details concerning its implementation. The assumption that each coarsened covariate is normally distributed can be restrictive. Other distributions may better fit the data, such as the gamma-distributed covariate in the examples given earlier, but specifying a tractable joint distribution of two or more non-normal variables can be more difficult. The multivariate normal model can be generalized somewhat, however, by assuming that the transformed covariates are normally distributed. Consider replacing the last set of terms in Equation 9 with ␦1t1(z1) ⫹ ␦2t2(z) ⫹ · · · ⫹␦qtq(zq) , q 共transformed兲 coarsened covariates

where t1, t2, . . . , tq are specified transformation functions. For example, it might be reasonable to assume that a right-skewed covariate is normally distributed on the log scale, so that t j共z j兲 ⫽ ez j .

The joint distribution of two or more categorical covariates is also categorical, and thus does not present any new issues. For a combination of coarsened continuous and categorical covariates, one approach would be to model the marginal distribution of the categorical covariates, and the conditional distribution of the continuous covariates given the categorical covariates, allowing the mean and/or covariance structure of the continuous covariates to vary over categories.

Interactions Interactions between observed and coarsened covariates can be accommodated by generalizing the mean structure to include products of the observed and coarsened covariates. Consider a model where the j-th covariate, xj, interacts with a single coarsened covariate. Here, we can generalize Equations 1 and 2 to include an interaction term such that g[E(Y iⱍxi, Zi ⫽ z)] ⫽ ␤0 ⫹ ␤1xi1 ⫹ ␤2xi2 ⫹ · · · ⫹␤pxip ⫹ ␦z ⫹ ␥x jz, (11) if Zi is a quantitative covariate, or g关E共Y iⱍxi, Zi ⫽ z兲兴 ⫽ ␤0 ⫹ ␤1xi1 ⫹ ␤2xi2 ⫹ · · · ⫹␤pxip ⫹ ␦z ⫹ ␥zx, (12) if Zi is a categorical covariate. Here, the terms ␥xjz and ␥zxj represent the effect of the interaction between the j-th observed covariate and the missing covariate, Zi. Interactions in the Bayesian approach proposed in this article can be addressed by modifying the mean structure appropriately as shown above, and by specifying a prior distribution for ␥ or ␥z. Appendix D demonstrates how this can be done by modification of the JAGS model

Figure 6. Posterior means and 95% credibility intervals for ␤1 for data with a coarsened categorical covariate.

6 Note that for identification it is necessary to either fix either ␦j or the variance of Zij, although this can be done post hoc in the sample of simulated realizations by appropriate scaling.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

GENERALIZED LINEAR MODELS WITH COARSENED COVARIATES

291

Figure 7. Distribution of strength of religious identity over type of religious identity within religious preference.

files shown in Appendices A and B. This approach could be extended to interactions between the coarsened covariate and other observed covariates, or to the case discussed above with multiple coarsened covariates. From a modeling perspective, one can invoke the full flexibility of the (generalized) linear model framework, regardless of whether covariates are observed or coarsened.

Example: Spurious Effects From Coarsening Maxwell and Delaney (1993) showed that covariate coarsening can result in spurious effects. They considered a case of a linear model with two covariates that are both coarsened using median splits, but only one of the covariates has an effect prior to coarsening. They showed that if the covariates are correlated, then it is possible to observe a spurious effect for a covariate that does not have an effect prior to coarsening. This is due to the confounding of the covariates and the fact that using a surrogate to a coarsened covariate introduces measurement error that does not allow the confounding to be properly taken into account. Here, we use a brief simulation study to replicate this phenomenon and also to show that the missing data method can avoid such spurious effects. This also serves as an example of the implementation of this methodology with multiple coarsened covariates, and with interactions involving coarsened covariates. First, consider the case of a linear model with a normallydistributed response variable, an observed two-category covariate with n/2 observations per category, and two covariates with a bivariate normal distribution with zero means, unit variances, and correlation ␳. The mean structure of the model is E(Y iⱍxi, Zi1 ⫽ z1, Zi2 ⫽ z2) ⫽ ␤0 ⫹ ␤1xi ⫹ ␦1z1 ⫹ ␦2z2 , with ␤0 ⫽ ⫺1/2, ␤1 ⫽ ␦1 ⫽ 1, and ␦2 ⫽ 0 so that Yi depends on xi and Zi1 but not Zi2. The observed covariate xi is an indicator variable. The variance structure is such that Yi has unit variance given xi, Zi1 ⫽ z1, and Zi2 ⫽ z2. We generated 100 samples of observations based on this model for each combination of sample sizes n ⫽ 50, 100, 200, and ␳ ⫽ 0, 0.7. In contrast to the previous simulations, in these we included a range of sample sizes to show its effect on the posterior means. For each sample, the covariates Zi1 and Zi2 were coarsened using median splits. For each sample,

we applied a missing data analysis and a coarsened data analysis, where the latter assumed the above model but replacing z1 and z2 with indicator variables as surrogates to the covariates based on the median splits. For the missing data analysis, a bivariate normal linear model was specified regressing Zi1 and Zi2 on xi with unknown parameters, variances, and covariance. Diffuse priors were specified for all parameters as described in the appendices. Figure 8 shows the distributions of the posterior means of the parameters of the mean structure. First, note that regardless of sample size or correlation, the posterior means from the missing data analysis are consistent with the known values from the underlying model. In particular, consider the distribution of the posterior mean of ␦2 which represents the effect of Zi2 that is known to be null. Although the values of the parameters are not directly comparable across approaches due to the change in scale, the value of ␦2 should be near zero in both cases. The posterior means of ␦2 do not exhibit a systematic tendency to show a positive or negative effect of Zi2. Now compare this to the distribution of the posterior means of the model parameters from the coarsened data analysis that used the surrogates to Zi1 and Zi2. Note the distribution of the posterior means of ␦2 relative to zero when ␳ ⫽ 0.7. Here, there is a systematic tendency for the coarsened data analysis to exhibit a spurious effect for the covariate as the posterior means tend to be systematically greater than zero. Thus, the coarsened data analysis can lead to an incorrect conclusion concerning the effect of Zi2 while the missing data analysis is robust to such spurious effects. Now consider a more complicated model to illustrate spurious effects in a model with an interaction between the observed covariate and a coarsened covariate. The model is an extension of the previous model with E(Y iⱍxi, Zi1 ⫽ z1, Zi2 ⫽ z2) ⫽ ␤0 ⫹ ␤1xi ⫹ ␤2z1 ⫹ ␤3z2 ⫹ ␥2xiz2 , with ␥1 ⫽ 1 and ␥2 ⫽ 0. Here, there is an interaction between xi and Zi1, meaning that the effect of xi depends on Zi1 (and vice versa), but there is no interaction between xi and Zi2, and furthermore there is no effect of Zi2 at either level of xi (i.e., no main effect for Zi2). We generated data from this model using the same distributions, parameters, and design as in the previous simulation aside from the change in the mean structure.

JOHNSON AND WIEST

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

292

covariates are difficult to implement, are computationally intensive, and are not widely available in statistical packages. We have shown that when generalized linear models with coarsened covariates are posed as Bayesian probability models with missing or censored data, these models can be readily implement in available software such as JAGS. Those with some familiarity with simulation-based Bayesian inference will find the implementation relatively straight forward. We should comment on the issue of the specification of the distribution of the covariate prior to coarsening. Accounting for the coarsened covariate assumes a particular parametric model for the distribution of the covariate, which should be carefully considered just as one would consider the specified parametric distribution of the response variable in a GLM. In some cases, data from other sources, such as cases with uncoarsened values, may provide such information. For example, if the covariate has been observed to be right-skewed, then an appropriately parameterized gamma distribution as we used in our examples might be appropriate. The nature of the covariate can also help guide the specification of its distribution. For a covariate that is a count, for example, one might consider an appropriate distribution for count data such as Poisson or negative binomial. Consideration of the distribution of the

Figure 8. Distributions of the posterior means of ␤0, ␤1, ␦1, and ␦2 by method of analysis. The true values are ␤0 ⫽ ⫺1/2, ␤1 ⫽ ␦1 ⫽ 1, and ␦2 ⫽ 0. The points and line segments within each box plot show the mean and 95% confidence interval for the posterior mean, respectively.

Figure 9 shows the distributions of the posterior means of the parameters of the mean structure. We also report the results for the posterior means of two contrasts, c1 ⫽ ␦1 ⫹ ␥1 and c1 ⫽ ␦2 ⫹ ␥2, which are the effects of Zi1 and Zi2 at the second level of xi (note that ␦1 and ␦2 are these effects at the first level of xi). Again the posterior means from the missing data analysis are consistent with the known parameter values. The missing data analyses appear to accurately capture the interaction between xi and Zi1, and the null effect of Zi2 and its null interaction with xi. However, the coarsened data analysis shows a spurious effect/interaction for Zi2 when ␳ ⫽ 0.7. Although Zi2 has no effect at either level of xi, an analysis based on coarsened data is subject to spurious effects, including spurious interactions. The missing data analysis is not subject to these spurious effects.

Discussion MacCallum et al. (2002) suggested that a lack of knowledge of statistical methods might, in part, be responsible for the practice of dichotomization of variables. We would further suggest that a similar lack of knowledge might also be partially responsible for the failure of researchers to properly account for coarsened variables when the coarsening is unavoidable or irreversible. Moreover, methods that have been proposed to deal with coarsened

Figure 9. Distributions of the posterior means of ␤0, ␤1, ␦1, ␦2, ␥1, ␥2, c1 ⫽ ␦1 ⫹ ␥1, and c2 ⫽ ␦2 ⫹ ␥2 by method of analysis. The true values are ␤0 ⫽ ⫺1/2, ␤1 ⫽ ␦1 ⫽ 1, ␦2 ⫽ 0, ␥1 ⫽ 1, and ␥2 ⫽ 0. The points and line segments within each box plot show the mean and 95% confidence interval for the posterior mean, respectively.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

GENERALIZED LINEAR MODELS WITH COARSENED COVARIATES

covariate is analogous to that of the distribution of the response variable, since the covariate is effectively regressed on the other regressors. For censored continuous covariates, similar issues arise in the specification of models for limited dependent variables (e.g., Long, 1997; Maddala, 1983). The class of models considered in this article is fairly broad since it is based on the theory of generalized linear models. We also considered generalizations with respect to the coarsened covariates including quantitative and categorical covariates, multiple coarsened covariates, and interactions involving coarsened covariates. There are some possible generalizations, however, that are beyond the scope of this article. One is the case of mixed multilevel models with coarsened covariates. To date, most discussions of coarsened covariates have focused on models with only fixed effects. An exception is Johnson (2006), who considered generalized linear mixed models with normally-distributed coarsened covariates. In these situations, a coarsened covariate can be conceptualized as a random effect, and failure to account for this random effect means that the model does not properly account for the conditional independence of the responses. In principle, such models could be implemented using JAGS, but given the complexity of these models, further work looking at these models specifically may be worthwhile. Another potential area of future work is models that are more flexible with regard to the distribution of the covariate. Earlier we mentioned considerations in specifying a distribution of the covariate. Another approach might be to consider mixture or semi-parametric distribution methods that have been developed for random effects (e.g., Zhang & Davidian, 2001).

References Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669 – 679. doi:10.1080/01621459.1993.10476321 Arminger, G., & Muthén, B. O. (1998). A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis– Hastings algorithm. Psychometrika, 63, 271–300. doi:10.1007/ BF02294856 Austin, P. C., & Brunner, L. J. (2003). Type I error inflation in the presence of a ceiling effect. The American Statistician, 57, 97–104. doi:10.1198/ 0003130031450 Austin, P. C., & Hoch, J. S. (2004). Estimating linear regression models in the presence of a censored independent variable. Statistics in Medicine, 23, 411– 429. doi:10.1002/sim.1601 Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434 – 455. Caroll, R. J. (1989). Covariance analysis in generalized linear measurement error models. Statistics in Medicine, 8, 1075–1093. doi:10.1002/sim .4780080907 Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement, 7, 249 –253. doi:10.1177/014662168300700301 Congdon, P. (2003). Applied Bayesian modelling. New York, NY: Wiley. doi:10.1002/0470867159 Congdon, P. (2007). Bayesian statistical analysis (2nd ed.). New York, NY: Wiley. Cowles, M. K., & Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91, 883–904. doi:10.1080/01621459.1996.10476956 DeCoster, J., Iselin, A. R., & Gallucci, M. (2009). A conceptual and empirical examination of justifications for dichotomization. Psychological Methods, 14, 349 –366. doi:10.1037/a0016956

293

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39, 1–22. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York, NY: Cambridge University Press. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457– 472. doi:10.1214/ ss/1177011136 Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain Monte Carlo in practice. doi:10.1007/978-1-4899-4485-6 Giovanini, J. (2008). Generalized linear mixed models with censored covariates (Unpublished doctoral dissertation). Oregon State University, Corvallis. Gómez, G., Espinal, A., & Lagakos, S. W. (2003). Inference for a linear regression model with an interval-censored covariate. Statistics in Medicine, 22, 409 – 425. doi:10.1002/sim.1326 Haitovsky, Y. (1983). Grouped data. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of statistical sciences (Vol. 3, pp. 527–536). New York, NY: Wiley. Heitjan, D. F., & Rubin, D. B. (1991). Ignorability and coarse data. Annals of Statistics, 19, 2244 –2253. doi:10.1214/aos/1176348396 Ibrahim, J. G., Chen, M. H., Lipsitz, S. R., & Herring, A. H. (2005). Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association, 100, 332–346. doi:10.1198/016214504000001844 Irwin, J. R., & McClelland, G. H. (2003). Negative consequences of dichotomizing continuous predictor variables. Journal of Marketing Research, 40, 366 –371. doi:10.1509/jmkr.40.3.366.19237 Johnson, T. R. (2006). Generalized linear models with ordinally-observed covariates. British Journal of Mathematical and Statistical Psychology, 59, 275–300. doi:10.1348/000711005X65762 Kruschke, J. K. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Burlington, MA: Academic Press. Lipsitz, S., Parzen, M., Natarajan, S., Ibrahim, J., & Fitzmaurice, G. (2004). Generalized linear models with coarsened covariate. Applied Statistics, 53, 279 –292. doi:10.1046/j.1467-9876.2003.05009.x Long, S. J. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage. Lunn, D., Spiegelhalter, D., Thomas, A., & Best, N. (2009). The BUGS project: Evaluation, critique and future directions. Statistics in Medicine, 28, 3049 –3067. doi:10.1002/sim.3680 Lynn, H. S. (2001). Maximum likelihood inference for left-censored HIV RNA data. Statistics in Medicine, 20, 33– 45. doi:10.1002/10970258(20010115)20:1⬍33::AID-SIM640⬎3.0.CO;2-O MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19 – 40. doi:10.1037/1082-989X.7.1.19 Maddala, G. S. (1983). Limited-dependent and qualitative variables in econometrics. doi:10.1017/CBO9780511810176 Maxwell, S. E., & Delaney, H. D. (1993). Bivariate median-splits and spurious statistical significance. Psychological Bulletin, 113, 181–190. doi:10.1037/0033-2909.113.1.181 May, R. C., Ibrahim, J. G., & Chu, H. (2011). Maximum likelihood estimation in generalized linear models with multiple covariates subject to detection limits. Statistics in Medicine, 30, 2551–2561. doi:10.1002/sim.4280 McCool, J. I. (1982). Censored data. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of statistical sciences (Vol. 1, pp. 389 –396). New York, NY: Wiley. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A, 135, 370 –384. doi:10 .2307/2344614 Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd International Workshop on Distributed

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

294

JOHNSON AND WIEST

Statistical Computing (DSC 2003). Retrieved from http://www.ci .tuwien.ac.at/Conferences/DSC-2003/Proceedings/Plummer.pdf Plummer, M. (2011). rjags: Bayesian graphical models using MCMC (R package Version 3–5) [Computer software]. Retrieved from http:// CRAN.R-project.org/package⫽rjags Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6, 7–11. R Development Core Team. (2012). R: A language and environment for statistical computing [Computer software]. Retrieved from http://www .Rproject.org Rigobon, R., & Stoker, T. M. (2007). Estimation with censored regressors: Basic issues. International Economic Review, 48, 1441–1467. doi: 10.1111/j.1468-2354.2007.00470.x Ronning, G., & Kukuk, M. (1996). Efficient estimation of ordered probit models. Journal of the American Statistical Association, 91, 1120 –1129. doi:10.1080/01621459.1996.10476982 SAS Institute. (2008). SAS/STAT 9.2 user’s guide. Cary, NC: Author. Smith, T. W., Marsden, M. H., & Kim, J. (2011). General Social Survey, 1972–2010, cumulative file. Retrieved from http://www3.norc.org/ GSS⫹Website/ Stefanski, L. A. (1985). The effects of measurement error on parameter estimation. Biometrika, 72, 583–592. doi:10.1093/biomet/72.3.583 Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82, 528 –540. doi:10.1080/01621459.1987.10478458 Thomas, A., O’Hara, B., Ligges, U., & Sturtz, S. (2006). Making BUGS open. R News, 6, 12–17.

Thompson, M., & Nelson, K. P. (2003). Linear regression with Type I interval- and left-censored response data. Environmental and Ecological Statistics, 10, 221–230. doi:10.1023/A:1023630425376 Tsimikas, J. V., Bantis, L. E., & Georgiou, S. D. (2012). Inference in generalized linear regression models with a censored covariate. Computational Statistics & Data Analysis, 56, 1854 –1868. doi:10.1016/j.csda .2011.11.010 Vargha, A., Rudas, T., Delaney, H. D., & Maxwell, S. E. (1996). Dichotomization, partial correlation, and conditional independence. Journal of Educational and Behavioral Statistics, 21, 264 –282. Wei, G. C. G., & Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85, 699 –704. doi: 10.1080/01621459.1990.10474930 Willenborg, L., & de Waal, T. (1996). Statistical disclosure control in practice. doi:10.1007/978-1-4612-4028-0 Willenborg, L., & de Waal, T. (2000). Elements of statistical disclosure control. New York, NY: Springer. Winship, C., & Mare, R. D. (1984). Regression models with ordinal variables. American Sociological Review, 49, 512–525. doi:10.2307/ 2095465 Zeger, S. L., & Karim, M. R. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association, 86, 79 – 86. doi:10.1080/01621459.1991 .10475006 Zhang, D., & Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57, 795– 802. doi:10.1111/j.0006-341X.2001.00795.x

GENERALIZED LINEAR MODELS WITH COARSENED COVARIATES

295

Appendix A Just Another Gibbs Sampler (JAGS) Model File and Analysis for a Coarsened Quantitative Covariate According to the Bayesian probability model in Equation 8, the joint distribution of Zi and Si is

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

h共zi ; xi, ␣兲 I共zi 僆 si兲 . Recall that Si is an observed interval. In the example with the gamma-distributed covariate with a median split, Si is either (0, median) or (median, ⬁) depending on whether Zi is below or above the median, respectively. If Zi is not coarsened, then for convenience, the interval can be defined as (zi ⫺ ⌬, zi ⫹ ⌬), where ⌬ is a sufficiently small non-negative value so that the distribution of Zi is (nearly) fixed at its observed value. For the right-censored covariate, the intervals for the 50% coarsened case can be defined as (zi ⫺ ⌬, zi ⫹ ⌬) or [median, ⬁) depending on whether Zi is below or above the median, respectively. The limits of Si can be stored in a n ⫻ 2 matrix C where the i-th row of C is the lower and upper limits of Si. The following is the JAGS model file for the Bayesian probability model for the first example with the interval- or right-censored covariate. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

data { dimx

Generalized linear models with coarsened covariates: a practical Bayesian approach.

Coarsened covariates are a common and sometimes unavoidable phenomenon encountered in statistical modeling. Covariates are coarsened when their values...
889KB Sizes 0 Downloads 0 Views