This article was downloaded by: [University of Saskatchewan Library] On: 19 November 2014, At: 03:33 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Multivariate Behavioral Research Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hmbr20

Augmenting Data With Published Results in Bayesian Linear Regression a

Christiaan de Leeuw & Irene Klugkist a

b

Radboud University Nijmegen

b

Utrecht University Published online: 15 Jun 2012.

To cite this article: Christiaan de Leeuw & Irene Klugkist (2012) Augmenting Data With Published Results in Bayesian Linear Regression, Multivariate Behavioral Research, 47:3, 369-391, DOI: 10.1080/00273171.2012.673957 To link to this article: http://dx.doi.org/10.1080/00273171.2012.673957

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

Multivariate Behavioral Research, 47:369–391, 2012 Copyright © Taylor & Francis Group, LLC ISSN: 0027-3171 print/1532-7906 online DOI: 10.1080/00273171.2012.673957

Augmenting Data With Published Results in Bayesian Linear Regression Christiaan de Leeuw Radboud University Nijmegen

Irene Klugkist Utrecht University

In most research, linear regression analyses are performed without taking into account published results (i.e., reported summary statistics) of similar previous studies. Although the prior density in Bayesian linear regression could accommodate such prior knowledge, formal models for doing so are absent from the literature. The goal of this article is therefore to develop a Bayesian model in which a linear regression analysis on current data is augmented with the reported regression coefficients (and standard errors) of previous studies. Two versions of this model are presented. The first version incorporates previous studies through the prior density and is applicable when the current and all previous studies are exchangeable. The second version models all studies in a hierarchical structure and is applicable when studies are not exchangeable. Both versions of the model are assessed using simulation studies. Performance for each in estimating the regression coefficients is consistently superior to using current data alone and is close to that of an equivalent model that uses the data from previous studies rather than reported regression coefficients. Overall the results show that augmenting data with results from previous studies is viable and yields significant improvements in the parameter estimation.

Although science itself is cumulative in nature, this is often not reflected in statistical analysis. Even when numerous studies have already been published on a given topic, new data relating to that topic are usually still analysed in isolation. This is unfortunate as results from previous studies constitute a source Correspondence concerning this article should be addresssed to Christiaan de Leeuw. E-mail: [email protected]

369

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

370

DE

LEEUW AND KLUGKIST

of relevant and readily available information. By ignoring this information both the stability and the precision of the parameter estimates are lower than they could have been, and the conclusions that can be drawn are less certain and more likely to be affected by sampling variation. Due to the frequent combination of multiple testing with relatively low statistical power, this becomes especially problematic in the framework of null hypothesis significance testing. As Maxwell (2004) demonstrates, in such a situation even exact replications of studies are likely to yield conflicting conclusions. To improve both parameter estimates and consistency in the literature it would thus be beneficial if new studies could take into account the results of relevant studies that preceded them. In the social and behavioral sciences a commonly used model is multiple linear regression, which almost invariably includes separate significance tests for all regression coefficients. Combined with the often rather modest sample sizes, this makes it likely for different studies to find different sets of statistically significant predictors. Moreover, confidence intervals tend to be too wide to draw firm conclusions. The best way to address both these issues would be to use larger samples, but this is rarely a realistic option. Augmenting the data of a new study with previous results (the reported regression coefficients and standard errors only rather than the underlying data) from similar studies might therefore be an attractive alternative because this effectively increases the sample size as well. The goal of this article is therefore to develop and test a formal method for augmenting data in linear regression analyses in this manner. An obvious candidate for incorporating previous results in a new study is the prior density in a Bayesian analysis. As every textbook on Bayesian statistics is quick to point out, the posterior of one analysis can readily be used as the prior for another. It is therefore surprising that a search through the statistical literature turns up exceedingly few papers directly relevant to this issue. The topic of constructing informative priors based on published results appears to have received little attention in Bayesian statistical research. One rare exception is Yu et al. (2009), who used previously published studies to create informative priors for the coefficients of a logistic regression. Their efforts do not constitute a comprehensive or generalizable approach, however. A systematic method is not presented, and informative priors were specified for only some of the regression coefficients and separately for each of those. Spiegelhalter, Abrams, and Myles (2004) also briefly go into the topic, but their focus on the univariate case makes it relatively restricted in scope as well. Finally, Kuijper, Hoijtink, Buskens, and Raub (in press) present an approach that combines evidence for specific predictors from different linear regression analysis. This is based on Bayes Factors, however, and does not yield combined parameter estimates. In general, although it is likely that ad hoc ways of specifying informative priors using literature have been used in specific applications, more formal statistical approaches to this issue are almost entirely absent from the literature.

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

AUGMENTING DATA

371

Two approaches closely related to the present issue have received more attention. The first is combining published results through meta-analysis. Though meta-analysis does not directly incorporate data, it is conceivable that the reported regression coefficients of various linear regressions could be combined using meta-analysis to obtain a prior density for a new data set. However, most research on meta-analysis deals with the univariate case (but see Arends, Vokó, & Stijnen, 2003; Hripsime & Raudenbush, 1996; and Nam, Mengersen, & Garthwaite, 2003, among others, for examples of multivariate meta-analysis; Sutton and Higgins, 2008, also give a more general view of recent developments in meta-analysis) where an application to the present issue would call for simultaneously combining multiple regression parameters per study. Although separate meta-analyses could be done for each regression coefficient, this would mean ignoring the relation between the regression coefficients (see also Riley, 2009). The second related approach is the construction of informative priors using historical data, in which the data of previous studies are used to construct the prior (see, e.g., Ibrahim & Chen, 2000, or Chapter 5 of Gelman, Carlin, Stern, & Rubin, 2004). Using the data itself rather than the reported summary statistics is clearly preferable but may not be possible in practice. Various obstacles in obtaining the full data sets used in different studies may present themselves, including difficulties in contacting the original researchers and possibly their refusal to share their data. Wicherts, Borsboom, Kats, and Molenaar (2006) provide a clear illustration of difficulties of this kind. Even if a portion of all the full data sets can be obtained, it is likely that some data sets will remain unavailable. Because it would be undesirable to simply disregard those studies, it would be preferable to find some way of using the reported summary statistics for those studies instead. If a method using only those summary statistics can be shown to yield an adequate approximation of using the data itself, this would therefore provide a more acceptable solution when data from previous studies are not (realistically) obtainable. What is presented in this article is a Bayesian model for multiple linear regression for augmenting the data of a current study with published results of earlier studies.1 Two versions of the model are developed to suit two different situations. The first version incorporates previous studies through the prior and is applicable when studies are exchangeable. The second version has a two-level hierarchical structure with current and previous data on the lower level. This version of the model is appropriate for situations where the studies are not exchangeable and also allows study-level variables to be incorporated in the model. The two versions of the model are hereafter referred to as the replication model 1 In this article it is assumed that only one full data set, the “current study,” is available. However, and in line with the previous paragraph on using historic data, this can straightforwardly be extended to include multiple full data sets as needed.

372

DE

LEEUW AND KLUGKIST

and hierarchical replication model, respectively. For both versions it is assumed that all previous studies use the same set of predictors as the analysis on the current data. This assumption is further addressed in the discussion. Simulation studies are performed to assess the performance of both versions of the model.

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

REPLICATION MODEL In this section a Bayesian model is developed for incorporating results from earlier studies in a new linear regression analysis through the prior density of the regression coefficients. This model is for use in situations where studies are exchangeable, that is, when there is no relevant information other than the data or reported regression coefficients by which the studies can be distinguished. As noted, it is assumed that current data and previous studies all have the same set of predictors. In its general form, a Bayesian analysis consists of three parts. The first part is the probability density function for the data given the unknown model parameters; taken as a function of those model parameters, this is referred to as the likelihood function. The second part is the prior density function of the model parameters, which quantifies what is assumed to be known about the model parameters prior to observing the data. Together these yield the posterior density of the model parameters, which reflects the updated knowledge about the model parameters after having observed the data. For the replication model these three elements are described in the following three subsections. The simulation study used to evaluate the performance of the replication model is described in the next section. Exchangeability Because the assumption of exchangeability is central to the replication model, its implications are outlined here. A technical elaboration of the concept of exchangeability can be found in Bernardo and Smith (2000) and Bernardo (1996), but a more accessible treatment is also given by Gelman et al. (2004). Exchangeability of a set of observations A1 through An can be defined as stating that the joint distribution p.A1 ;    ; An / is invariant to permutation of the indices 1 through n. This concept is routinely used in statistical analysis on a single data set. Observations within the data set are considered to differ only with respect to the variables available in the data set. If not all those variables are used in a particular analysis, this implies the assumption that differences on the discarded variables are not relevant to the parameters estimated in that analysis. With respect to those discarded variables this assumption can be tested directly, but in general the exchangeability must usually simply be assumed. As Gelman et al. (2004) put it, ignorance implies exchangeability. If we know nothing about

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

AUGMENTING DATA

373

the observations beyond the variables that are recorded, we have nothing further to distinguish them and thus rationally should treat all those observations the same in the analysis. For the model developed in this section, this same principle can be applied to entire studies. Effectively, the previous and current studies can be treated as themselves being observations, which can be considered exchangeable. However, there are two main differences with the within-sample exchangeability of observations as outlined earlier. The first is that the new study is not exchangeable with the previous studies in that it provides a full data set instead of just summary statistics. But although it cannot be treated as exchangeable with the previous studies in that regard, it can be treated as such with respect to the underlying regression parameters that are estimated. The second difference with withinsample exchangeability is that here there is additional information available on each observation than just what is used as data here (i.e., the estimated regression coefficients, their standard errors, and the sample size). Information about the origin of the original study’s data and the manner in which it was collected are all reported along with a variety of other properties of that study by which it can be distinguished from the other studies. Consequently, the assumption of exchangeability of the estimated regression parameters (given sample size) of the studies entails the assumption that these estimates are not systematically affected by any of the additional known properties of those studies. The validity of this assumption will depend on what additional information on each study is available and the degree to which studies differ with regard to it. If the studies are exact replications of each other, or are sufficiently similar to be considered as such, the assumption would be reasonable. However, if this is not the case the present model would not be appropriate. In that situation the hierarchical model developed later in this article can be used because that model assumes heterogeneity between studies and also allows information about the differences between studies to be explicitly included. The exchangeability assumption in that model is shifted from the observed coefficients and data to the parameters estimated from them, possibly conditional on study-level variables. The Linear Regression Model Let the current data set consist of an outcome variable y and a predictor data matrix X containing v predictor variables x1 through xv . An additional dummy variable x0 of all 1’s is included in X to obtain the intercept. For i D 1;   ; n observations, a schematic of the data is shown in Table 1. The model used for this data is the standard multiple linear regression model. For the i th observation, yi D ’0 C ’1 x1;i C    C ’v xv;i C ©i D XiT ’ C ©i ;

(1)

374

DE

LEEUW AND KLUGKIST

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

TABLE 1 Schematic of Current Data (X and y) x0

x1

x2



xv

y

1 1 :: : 1

x1;1 x1;2 :: : x1;n

x2;1 x2;2 :: : x2;n

  :: : 

xv;1 xv;2 :: : xv;n

y1 y2 :: : yn

with ©i ∼ N.0; ¢e2 /;

(2)

where ©i is the error term and the parameters ’ D .’0 ; ’1 ;   ; ’v /T and ¢e2 are the vector of regression coefficients and the residual variance, respectively. In this model X is considered fixed, and under the usual assumption that the error terms are independent from each other it therefore follows that yjX;’;¢e2 ∼ MVN.X’; ¢e2 In /;

(3)

with In the n by n identity matrix. The full Bayesian regression model is given as p.’; ¢e2 jX; y/ / p.yjX; ’; ¢e2 /p.’; ¢e2 /:

(4)

With the probability density function for the data p.yjX; ’; ¢e2 / already defined in Equation (3), this leaves the prior density p.’; ¢e2 / to be specified. The Prior Density The reported results from previous studies are incorporated in the prior for ’. It is assumed that for each of j D 1;   ; m previous studies, the regression coefficients b0;j through bv;j as well as the associated standard errors s0;j through sv;j have been reported. These are combined into the m by v C 1 data matrices B and S , respectively. Schematics for both are given in Table 2. It is further assumed that the model used in the previous studies is the linear regression model described earlier (Equations (1) and (2)). Following common convention ’ and ¢e2 are assumed independent in the prior, thus p.’; ¢e2 / D p.’/p.¢e2 /:

(5)

AUGMENTING DATA

375

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

TABLE 2 Schematic of B and S b0

b1

b2



bv

s0

s1

s2



sv

b0;1 b0;2 :: : b0;m

b1;1 b1;2 :: : b1;m

b2;1 b2;2 :: : b2;m

  :: : 

bv;1 bv;2 :: : bv;m

s0;1 s0;2 :: : s0;m

s1;1 s1;2 :: : s1;m

s2;1 s2;2 :: : s2;m

  :: : 

sv;1 sv;2 :: : sv;m

For the residual variance it is assumed that no relevant prior information is available. Therefore, the prior for ¢e2 is specified as p.¢e2 / / .¢e2 / 1 ; a common choice of noninformative prior for the variance parameter of a normal distribution (for more details on common noninformative prior densities, see, e.g., Lynch, 2007, or Gelman et al., 2004). To incorporate B and S in the prior of ’, it is defined as the posterior of a second Bayesian model as follows: p.’/ D p.’jB; S / / p.BjS; ’/p  .’/;

(6)

where p.BjS; ’/ D

m Y

p.Bj jSj ; ’/:

(7)

j D1

The probability function p  .’/ is the initial prior for ’, which reflects what is known about ’ before B and S are observed. Nothing relevant is assumed to be known initially; therefore it is specified as a noninformative improper uniform density over the whole real line, p  .’/ / 1 (see, e.g., Lynch, 2007, or Gelman et al., 2004, for more details). As Equation (6) suggests, the data matrix S is assumed to be fixed in this model,2 leaving only the probability density for B to be specified. As the sampling distribution of regression coefficients in a Least Squares regression analysis is asymptotically normal, the multivariate normal distribution with mean vector ’ is used for the regression coefficients Bj for each study j . With S , only the estimated standard errors of the reported regression coefficients are known; the correlations between coefficients are rarely reported. These correlations could be included in the model as unknown parameters to be 2 Because S is also estimated from the data in the original study, it is subject to uncertainty as well. This uncertainty, however, is never reported and therefore cannot be incorporated in the present model. As S itself is not of interest, this has little impact on the results.

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

376

DE

LEEUW AND KLUGKIST

estimated, but the number of previous studies will typically be too small to obtain reliable estimates. Because furthermore the correlations are not themselves of interest, they are all restricted to 0. A simulation study (not reported in this article) was performed to determine the effect of this restriction relative to using more accurate values, but no notable differences in model performance were found. With this restriction the covariance matrix for the distribution of Bj becomes 0 2 1 s0;j 0 0  0 2 B 0 s1;j 0  0 C B C 2 B 0 0 s2;j    0 C Sj D B C; B : :: :: :: C :: @ :: : : : : A 2 0 0 0    sv;j and therefore

Bj jSj ;’ ∼ MVN.’; Sj /;

(8)

which completes the specification of the model in Equation (6). Posterior Density and Gibbs Sampler With the aforementioned specification for the prior the following joint posterior density is obtained: p.’; ¢e2 jX; y; B; S / / p.yjX; ’; ¢e2 /

m Y

p.Bj jSj ; ’/p  .’/p.¢e2 /

j D1

/ .¢e2 /

. n2 C1/

8
u, set Ÿk D Step 5 Return to Step 1.

p.Ÿ k jZ;“0 ;“1 ;;“m ;”;;Ÿ.

k/ /

.c 1/

385

.

jZ;“0 ;“1 ;;“m ;”;;Ÿ. k/ / .c/ .c 1/ Ÿk . Otherwise, set Ÿk D Ÿk . p.Ÿk

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

.c 1/

For proposal density ppr .:/ the normal distribution with mean at Ÿk and variance £2k is used. The value of £2k is tuned during the burn-in of the Gibbs sampler such that the acceptance rate for the Metropolis-Hastings step gets close to 0.44, the optimal acceptance rate for univariate Metropolis-Hastings sampling (see Gelman, Roberts, & Gilks, 1996).

EVALUATION OF HIERARCHICAL REPLICATION MODEL Method A simulation study was done to illustrate the performance of the hierarchical replication model. Data were generated with four predictor variables x1 to x4 and an outcome variable y, but unlike the previous set of simulations it was assumed that further relevant information about the current and previous studies was known by which they could be distinguished. In particular, the data were generated from a mixture of two populations with the relative proportions varying between studies. Consequently, if the two populations differed in the effects of x1 ;   ; x4 on y, the results of each study would depend on these proportions. For this simulation, as an example the two populations are denoted “male” and “female,” and the sample proportion of males z1 is used as a studylevel variable. Although the variable gender would typically itself be used as a predictor in each individual study, for the purpose of illustration this is assumed not to be the case in any of the studies in this example. For the simulation study, the number of previous studies was set to three. Sample sizes for both the current data and previous studies was set to 50. For the sake of simplicity, the values for z1 were not randomly sampled but were specified in advance with a value of 0.4 for the current data and 0.2, 0.6, and 0.9 for the previous studies. Two multivariate normal populations for the predictor and outcome variables were defined, one for males and one for females. Data for each of the studies were generated by drawing samples from these separate populations in the appropriate proportion as given by z1 . For the previous studies, Ordinary Least Squares linear regression was again used to obtain the regression coefficients and their standard errors. The populations were defined as follows: In both, the means and standard deviations of the x’s were 0 and 1, respectively, and the standard deviation of

386

DE

LEEUW AND KLUGKIST

TABLE 3 Correlations for the Male and Female Populations With Differences Highlighted in Bold Font

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

Male

Female

Predictor

x1

x2

x3

x4

y

x1

x2

x3

x4

y

x1 x2 x3 x4 y

1 0.1 0.1 0 0.1

0.1 1 0.1 0 0.3

0.1 0.1 1 0 0.5

0 0 0 1 0.1

0.1 0.3 0.5 0.1 1

1 0.4 0.4 0 0.1

0.4 1 0.4 0 0.3

0.4 0.4 1 0 0.5

0 0 0 1 0.4

0.1 0.3 0.5 0.4 1

y was 2. The mean of y was 5 for the male population and 3 for the female population. The correlations for the two populations are given in Table 3 with differences between the populations highlighted in bold font. The simulation was run for a total of 100 data sets. Two comparison models were run alongside the hierarchical replication model. The first was the current data model, the same as in the previous simulation. The second was the hierarchical full data model. This again represents the ideal situation where data from previous studies are available. This model has the same specification as the hierarchical replication model except that the data of the previous studies are used instead of just B and S , using the same data distribution Equation (13) as the current data. In the hierarchical full data model all studies estimated the same shared residual variance parameter ¢e2 .

Results For this simulation the same two measures of model performance were used as in the previous simulations. The first one was the difference in posterior means of “0 of the hierarchical replication model and current data model on the one hand and those of the hierarchical full data model on the other hand. Computing the mean and 95% interval of these differences over all data sets gives a measure of the accuracy of the hierarchical replication model and current data model. The second measure of model performance was the width of the 95% CI of “0 . For these the mean and 95% interval of these widths over all data sets is computed for each of the three models. This gives a measure of the certainty of the posterior estimates of the models. The results for the first measure of model performance, the difference of posterior means with those of the hierarchical full data model, are shown in Figure 4. The results are very similar to those for the nonhierarchical replication

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

AUGMENTING DATA

387

FIGURE 4 Interval plot for “0 by predictor of the difference between the posterior mean of the full data model and the posterior mean of the replication model (black lines) and the posterior mean of the current data model (gray lines). The error bars show the 95% interval of the difference; the dots are the mean difference.

model. Again the posterior means for both models are similar on average to those of the full data model, but as before the variability was much greater for the current data model than for the hierarchical replication model. No differences were found between the predictors. Thus, as with the nonhierarchical version of the replication model, the hierarchical replication model shows a clear advantage over using the current data only in this respect. In addition, it also reasonably approximates the performance of the hierarchical full data model. Results for the second measure of model performance, the widths of the 95% CI for “0 , are shown in Figure 5. Compared with the previous simulations

FIGURE 5 Bar chart for “0 of the mean width of the 95% credibility interval (CI) for each predictor for each of the three models. The error bars show the 95% interval of the CI width.

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

388

DE

LEEUW AND KLUGKIST

the improvement due to using the previous studies shows the same pattern as before, but the effect is not as large. This is in part a result of the fact that in this case the studies all come from somewhat different populations, as explained in the previous section. Consequently the results of each previous study reflects the particular population it is drawn from, and these results are therefore less informative about the population of the current study. As such the certainty of the estimates of “0 does not improve as much when including the previous results as it did for the estimates of ’ in the previous simulations. What does remain the same is that the hierarchical replication and full data models show little difference for both the mean and the variability of the CI width, and both do still perform better than the current data model for all predictors. Whatever conclusions follow from the estimates of “0 , these are essentially restricted to the specific population from which the current study is drawn, with a particular gender ratio. Clearly, if gender has no effect on the regression coefficients the conclusions would be more general in their scope. Yet in this case, when using just the current data it cannot be determined whether gender indeed has no effect (and in fact, in this simulation it does have an effect) because gender is not itself included as a predictor variable. It follows that when using the current data model, it is not really possible to determine what the scope of the conclusions actually is. The researcher would either have to draw very limited conclusions or must assume that gender has no effect. In this regard the hierarchical replication model offers an additional advantage. Besides superior estimates of “0 , the hierarchical replication model also provides information about the relation between the different studies. By looking at the estimates of ”, the research can gain an insight into how the regression coefficients of the different studies are related and what role the gender ratio plays in any differences between those studies. To assess the performance of the hierarchical replication model with respect to estimating ”, the posterior means of ”0 and ”0 C ”1 were compared with the posterior means of the hierarchical full data model as well as the true population means. The reason ”0 C ”1 rather than ”1 was used is that ”0 C ”1 estimates the mean regression coefficients for a completely male population in the same way that ”0 does so for a completely female population. Using ”0 C ”1 therefore makes it easier to compare the estimates with the true population means. In Figure 6 are shown for ”0 C ”1 the mean and 95% interval over all data sets of the difference between the posterior means with those true population means. As can be seen in the figure, neither of the models shows any bias, and the variability of the estimates was the same for both. The same applies to ”0 . No difference between the models was found for the widths of the 95% CI (not shown in a plot here) of both ”0 and ”0 C ”1 either. Thus, in addition to offering improved estimates of “0 compared with the current data model, the hierarchical replication model provides reliable information on the potential

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

AUGMENTING DATA

389

FIGURE 6 Interval plot for ”0 C ”1 by predictor of the difference between the true population means and both the posterior mean of the full data model (black lines) and the posterior mean of the replication model (gray lines). The error bars show the 95% interval of the difference; the dots are the mean difference. For all k, l, ”kl is the regression parameter of zl for the coefficients of the kth predictor variable xk .

differences between studies as well as to what extent those differences can be explained by study-level variables.

DISCUSSION The goal of this article was to develop a model for augmenting a multiple linear regression analysis on new data with published results from earlier studies. Two questions that were addressed with respect to the model were how much would be gained in comparison to using just the new data, and how large would the difference be with the ideal situations in which data from previous studies were available? For both questions the results of the simulation studies were encouraging. Performance of the two versions of the replication model was consistently superior to using current data alone with less variability in posterior means and smaller credibility intervals. Moreover, both versions of the model showed performance equivalent to models that used all the data of the previous studies rather than just reported results. It can thus be concluded that the model developed in this article offers the possibility of obtaining significantly better parameter estimates in a linear regression setting without needing to expend a prohibitive amount of time and energy trying to obtain data from the previous studies. Moreover, the hierarchical version of the model offers the advantage of being able to address questions about differences between studies and thus allows for explicit testing of the exchangeability assumption. In particular, it gives explicit information on the

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

390

DE

LEEUW AND KLUGKIST

extent to which the estimates based on the current data are representative for a more general population. This is in contrast to the usual approach of linear regression on just the current data where this question is usually implicitly ignored and findings are potentially generalized to a greater extent than would be appropriate. One important issue that still needs to be addressed is the restriction that all studies have the same set of predictors. It was imposed in order to facilitate model development, and the next logical step in the development would be to remove it in order to make the model applicable to a wider range of situations. Removing this restriction does pose a challenging problem as the reported regression coefficients (as well as their standard errors) in previous studies are affected by which other predictors were included in the regression. To be used in the present model the previous studies need to have used the same set of predictors as are of interest in the current study. When they do not this would require that, in addition to filling in missing regression coefficients, the reported regression coefficients are adjusted to the values they would have had, had the needed set of predictors been used. Although this restriction may seem to greatly limit the applicability of the model, this can to some extent be worked around in practice. A problem would arise only when predictor variables were correlated; only then would the regression coefficient of one predictor be affected by the presence or absence of the other predictor. If this correlation can be assumed to be insignificant the present model can still be used. For extraneous predictors this is as simple as removing their coefficients from the data set. For missing predictors, this can be done by setting the coefficient to 0 and the standard error to a very large value to reflect the fact that the regression coefficient for that study is unknown. Whether the correlation with other predictors is indeed small enough to justify this would need to be determined from the studies that do use that predictor or from the new data, but this does allow modest deviations from the restriction. There are other issues that remain to be addressed as well, but regardless of any limitations of the present model the conclusion is clear. Incorporating the results of previous studies in a linear regression on new data does yield significantly better parameter estimates, and it provides a more than adequate approximation of using the actual data of those previous studies. Expanding the model to make it more generally applicable in practice is thus certainly warranted.

REFERENCES Arends, L., Vokó, Z., & Stijnen, T. (2003). Combining multiple outcome measures in a meta-analysis: An application. Statistics in Medicine, 22, 1335–1353.

Downloaded by [University of Saskatchewan Library] at 03:33 19 November 2014

AUGMENTING DATA

391

Barnard, J., McCulloch, R., & Meng, X. (2000). Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10, 1281–1311. Bernardo, J. M. (1996). The concept of exchangeability and its applications. Far East Journal of Mathematical Sciences, 4, 111–121. Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. New York, NY: Wiley. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. New York, NY: Chapman & Hall/CRC. Gelman, A., Roberts, G. O., & Gilks, W. R. (1996). Efficient metropolis jumping rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 5: Proceedings of the fifth Valencia international meeting (pp. 599–607). Oxford, UK: Clarendon Press. Hripsime, A. K., & Raudenbush, S. W. (1996). A multivariate mixed linear model for meta-analysis. Psychological Methods, 1(3), 227–235. Ibrahim, J., & Chen, M. (2000). Power prior distributions for regression models. Statistical Science, 15(1), 46–60. Kuijper, R., Hoijtink, H., Buskens, V., & Raub, W. A. (in press). Combining statistical evidence from several studies: A method using Bayesian updating and an example from research on trust problems in social and economic exchange. Sociological Methods and Research. Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scientists. New York, NY: Springer. Maxwell, S. (2004). The persistence of underpowered studies in psychological research: Causes consequences, and remedies. Psychological Methods, 9(2), 147–163. Nam, I., Mengersen, K., & Garthwaite, P. (2003). Multivariate meta-analysis. Statistics in Medicine, 22, 2309–2333. Riley, R. D. (2009). Multivariate meta-analysis: The effect of ignoring within-study correlation. Journal of the Royal Statistical Society, 172, 789–811. Spiegelhalter, D., Abrams, K., & Myles, J. (2004). Bayesian approaches to clinical trials and healthcare evaluation. Hoboken, NJ: Wiley. Sutton, A. J., & Higgins, J. (2008). Recent developments in meta-analysis. Statistics in Medicine, 27, 625–650. Wicherts, J., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61, 726–728. Yu, X., Xun, P., Hu, Z., Liu, P., Shen, H., & Chen, F. (2009). Combining previously published studies with current data in Bayesian logistic regression model: An example for identifying susceptibility genes related to lung cancer in humans. Journal of Toxicology and Environmental Health, 72, 683–689.

Augmenting Data With Published Results in Bayesian Linear Regression.

In most research, linear regression analyses are performed without taking into account published results (i.e., reported summary statistics) of simila...
563B Sizes 1 Downloads 9 Views