Research Article Received 13 October 2012,

Accepted 4 February 2014

Published online 17 March 2014 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6125

A new synthesis analysis method for building logistic regression prediction models Elisa Sheng,a Xiao Hua Zhou,a,b * † Hua Chen,d Guizhou Huc and Ashlee Duncanc Synthesis analysis refers to a statistical method that integrates multiple univariate regression models and the correlation between each pair of predictors into a single multivariate regression model. The practical application of such a method could be developing a multivariate disease prediction model where a dataset containing the disease outcome and every predictor of interest is not available. In this study, we propose a new version of synthesis analysis that is specific to binary outcomes. We show that our proposed method possesses desirable statistical properties. We also conduct a simulation study to assess the robustness of the proposed method and compare it to a competing method. Copyright © 2014 John Wiley & Sons, Ltd. Keywords:

synthesis analysis; logistic regression; risk prediction model; risk factors; risk assessment; multivariate analysis

1. Introduction Building a comprehensive, multivariate regression model requires a complete dataset, one that includes all covariate variables of interest and the outcome variable of interest. In practice, the available data are often incomplete, with covariate variables and/or the outcome variable missing. Because of the limited nature of complete datasets and to capitalize on the abundance of incomplete datasets, a method, referred to as synthesis analysis, was developed to construct multivariate regression models by synthesizing information available from various sources of incomplete data. More specifically, synthesis analysis develops a multivariate regression model using the information from (1) incomplete regression models or models fit on datasets containing some, but not all, covariate variables of interest, and (2) correlations between each pair of the covariate variables of interest. For example, suppose a researcher wants to develop a comprehensive chronic disease prediction model that integrates several predictors, each of which has been evaluated and reported in disparate studies with an incomplete regression model. Ideally, this researcher would need a complete dataset, with all predictor variables of interest and the outcome variable, to fit a multivariate regression prediction model. Alternatively and more practically, he or she could use synthesis analysis methods and existing incomplete regression models to construct a multivariate prediction model if a complete dataset is not available or is simply too difficult to obtain. When time to event and data censoring are important in model development, Cox proportional hazards methods are typically used to build chronic disease prediction models; however, if event time is fixed and data censoring is not an issue, logistic regression methods are frequently employed. For example, many chronic disease prediction models, constructed using logistic regression due to a case–control study

a Department

of Biostatistics, University of Washington, Seattle, WA, U.S.A. of Statistics, Renmin University of China, Beijing, China c BioSignia, Inc., Durham, NC, U.S.A. d Institute of Applied Physics and Computational Mathematics, Beijing, 100088, China *Correspondence to: Xiao Hua Zhou, Department of Biostatistics, University of Washington, Seattle, WA, U.S.A. † E-mail: [email protected] b School

Statist. Med. 2014, 33 2567–2576

2567

Copyright © 2014 John Wiley & Sons, Ltd.

E. SHENG ET AL.

design, have been reported (e.g., [1–4]). The objective of the present study is to develop a synthesis analysis method specific to the logistic regression setting. Several practical examples exist in which synthesis analysis with logistic regression can be useful. For example, some versions of diabetes risk prediction models report fasting glucose as a key predictor (e.g., [5]), while other versions have replaced fasting glucose with hemoglobin A1c (HbA1c), as HbA1c is more commonly measured in clinical practice (e.g., [6, 7]). Although highly correlated with fasting glucose, HbA1c does not measure the same biological condition [8–10]. Researchers interested in exploring and building a prediction model using both measures might find synthesis analysis useful. To use multivariate logistic regression, it may be necessary to execute a new study that measures both parameters along with the outcome of interest; however, it may be sufficient to use synthesis analysis with existing sources of information to obtain a prediction model. Synthesis analysis could also be used to develop a disease prediction model that integrates a newly discovered genetic marker into an older clinical marker-based prediction model when a comprehensive study dataset is not yet available. The concept of synthesis analysis was first introduced by Samsa et al. [11] and further extended in Zhou et al. [12]. However, the methodology proposed in these papers is not easily extended to logistic regression models because of the non-linear relationship between the conditional expectation of the outcome and the covariates. In 2005, Hu and Root proposed a synthesis analysis method for a logistic regression modeling framework that incorporates both new and previously known risk factors [13]. Their method is limited in several ways, particularly (1) it is unable to provide coefficient estimates from a comprehensive multivariate regression model, and (2) it lacks theoretical justification and was not tested on a variety of datasets, which may concern researchers wanting to implement the method. In this paper, we propose a method of synthesis analysis as an alternative to Hu and Root’s method. Our method has two main advantages over Hu and Root’s method: (1) it estimates the coefficients of a comprehensive multivariate logistic regression model, and (2) it is mathematically derived, making it theoretically correct under certain assumptions. We test the robustness of our proposed method compared to Hu and Root’s method on a variety of data distributions in a simulation study and evaluate both methods in a real population dataset.

2. The statistical problem We are concerned with the case where the true relationship between a disease (or binary outcome, Y ) and a set of risk factors can be modeled using a multivariate logistic regression model. We wish to estimate a multivariate logistic regression model where a complete dataset containing all risk factors of interest and the disease outcome is not available. Suppose that there are multiple logistic regression models of the same binary outcome fit on different subsets of covariates, estimated on samples from the same superpopulation. Denote the number of logit models as K, each with ki independent variables, where i D 1; : : : ; K. Note that 1 6 ki 6 p for each i and the number of covariates in each of the K models may be different. Let X i be a vector of ki independent variables corresponding to the ith model. Define the ith individual logit model by P .Y D 1jX i D x i / D expit .ˇi 0 C ˇ i x i /:

(1)

Throughout the text, we will refer to models of the form (1) as incomplete logit models. Our goal is to synthesize the K individual logit models into a single, comprehensive logit model. The synthesized logit model with p covariates is defined by P .Y D 1jX D x/ D expit .˛0 C ˛1 x/;

(2)

where X D .X1 ; : : : ; Xp /, x D .x1 ; : : : ; xp /, and ˛ D .˛1 ; : : : ; ˛p /. We will refer to model (2) as the complete logit model.

3. Our proposed method

2568

Our proposed method is based on a special relationship between coefficients from incomplete logistic regression models and the complete logistic regression model that exists under the assumption that the Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

E. SHENG ET AL.

underlying data of the superpopulation follow a multivariate normal distribution. This relationship was derived by Efron in 1975 [14], which we will now restate as a proposition. Proposition 1 If we assume that X j Y D 1  Np .1 ; †/ and that X j Y D 0  Np .0 ; †/, we have the following result: ˛0 D ln

 1  0 1   0 † 1 0 C 1 0 † 1 1 ; 1 2 2

(3)

˛ D .0  1 /0 † 1 ;

(4)

where 1 D P r.Y D 1/ and 0 D P r.Y D 0/. To derive the coefficients in the combined multiple logistic regression model, we need to know 1 , estimate 0 , 1 , and †, and then apply Proposition 1. In the following subsections, we show that the parameter estimates can be obtained by using coefficients from the estimated incomplete logistic regression models and an estimate of the covariance matrix of all risk factors of interest. We propose three steps to synthesizing a complete logistic regression model. First, estimate 0 and 1 (i.e., the means corresponding to the subpopulations defined by the binary outcome variable). Denote the estimated means by O0 and O1 , respectively. Second, estimate the common covariance matrix of the subO And third, estimate the coefficients of our complete logit model by plugging populations, denoted by †. O into Equations (3) and (4). For simplicity, we restrict our attention to the case where all O0 , O1 , and † of the incomplete logit models are univariate. We discuss generalization to the case where incomplete logit models may be multivariate and possibly have predictors in common in the Appendix. In the following steps, assume that we know 1 as well as have a sample of data from the superpopulation including each of the risk factors of interest. Note that the Hu and Root synthesis method also requires knowledge of 1 and availability of a risk factor dataset. If we had a complete dataset containing both the disease outcome and measurements on each risk factor of interest, we could simply fit a multivariate logistic regression to obtain the complete logit model. In practice, 1 , the disease prevalence, might be obtained from census data or an epidemiological study. The risk factor dataset might be obtained from a hospital record database or a metadata derived sample. 3.1. Estimating subpopulation means First, consider computing O0 and O1 . Supposing each incomplete logit model of the form (1) is univariate, we have K D p, and the ith incomplete logit model is given by P .Y D 1jXi D xi / D expit .ˇi 0 C ˇi xi /:

(5)

Using Bayes’ theorem, we can obtain the following result. For 0 and 1 , the ith component of these vectors is given by 0i

1 D E.Xi jY D 0/ D E 0

1 1i D E.Xi jY D 1/ D E 1



Xi

 ; and

1 C e ˇi 0 Cˇi1 Xi Xi e ˇi 0 Cˇi1 Xi 1 C e ˇi 0 Cˇi1 Xi

! ;

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

2569

for i D 1; : : : ; p. Let Xi.j / be the value of the ith risk factor for the j th subject in risk factor dataset, that is, a sample comprised of both diseased and non-diseased subjects that contains measurements of all p risk predictors of interest but not the outcome. Let ˇOi 0 and ˇOi1 be the estimates for the coefficients in the univariate logistic regression on the ith risk factor.

E. SHENG ET AL.

O 0 and  O 1 from the following equations, which simply substitute sample Obtain the estimates  moments for population moments and make use of the known values of 1 and 0 D 1  1 . ! Xi.j / 1 X O 0i D ; and (6) .j / ˇOi 0 CˇOi1 Xi 0 1 C e j .j / ! O .j / O 1 X Xi e ˇi 0 Cˇi1 Xi : (7) O 1i D .j / O O 1 1 C e ˇi 0 Cˇi1 Xi j

3.2. Estimating the common covariance matrix Denote the covariance matrix of the risk factor dataset by † mix . Note that this covariance matrix is different from † because the risk factor dataset is a mixed sample of diseased and non-diseased subjects in unknown proportions. We can estimate the common covariance matrix, †, using the equations in the following proposition. Proposition 2     † D † mix C 1 T1 12  1 C 1 T0 1 0 C 0 T1 1 0 C 0 T0 02  0 : This result is obtained by decomposing † mix ; a proof is given in the Appendix. O by plugging † O mix , O1 , O0 and the known values of 1 ; 0 into the right We then calculate † side of the aforementioned equation. Once we have obtained the estimates of 0 , 1 , and †, we use Equations (3) and (4) to estimate the coefficients of the complete logit model (2) by plugging O0 , O1 , O in for 0 , 1 , and †, and plugging in the known values of 1 ; 0 . and †

4. Simulated data

2570

The objective of the simulation study is to test the robustness of the proposed synthesis analysis method under various conditions in terms of the data distribution, sample size, and strength of the covariates’ correlations. We simulated data from normal, lognormal, and uniform distributions. We let the number of covariates be p D 3 and used three univariate models to estimate synthesized models. To assess performance of both the Hu and Root method and our proposed method, we consider predicted probabilities from the synthesized models. We chose to report performance based on predicted probabilities instead of model coefficient estimates because Hu and Root’s method does not estimate the complete model coefficients, but rather only predicted probabilities. We compare predictions based on synthesized models to the ‘true’ predicted values. For the normally distributed data setting, the ‘true’ predictions are obtained from the complete model (2), where the coefficients are computed from Equations (3) and (4) using the simulation parameter values for 1 ; 0 ; 1 , and †. For the lognormal and uniform data settings, the ‘true’ predictions are actually estimates obtained by fitting a multivariate logistic regression. In addition to performing simulations under different distributions, we also varied the covariance structure between the three variables, as well as the sample size from which the univariate model coefficients were estimated. For each simulation setting (each distribution and covariance structure), we first simulated a secondary dataset of 100,000 independent observations. Next, we simulated three datasets with 500, 1000, or 10,000 independent observations. We used these last three simulated datasets to fit univariate logistic regression models. To estimate the synthesized models, we used the univariate model coefficient estimates, the estimated covariance matrix based on the secondary dataset, and the fixed value 1 D 0:6. We performed 1000 Monte Carlo simulations for each setting. For each synthesis analysis method, we computed the correlation, bias, and MSE of the model predicted probabilities with the ‘true’ probabilities for each subject in the secondary dataset. Tables I–III show the simulation results for data simulated under normal, lognormal, and uniform distributions. The simulation results are summarized by averaging over the Monte Carlo simulations. For prediction bias, the Monte Carlo standard error is also included. The sample size refers to the sample size used to fit the univariate logit models. Under both normal and non-normally distributed data, the Hu and Root method performs comparably or better than the new method. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

E. SHENG ET AL.

Table I. Simulation results: normal data. Proposed method N

Correlation

Bias (mean, sd)

Hu and Root method MSE

Correlation

Bias (mean, sd)

MSE

Independent covariates 500 1000 10,000

0.9750 0.9954 0.9995

0.002, 0.01752 0.2262 0.0003, 0.0003 0.2260 0.0002, 0.00008 0.2262

0.9993 0.9996 0.9998

0.0006, 0.00056 0.2242 0.0006, 0.00047 0.2241 0.0006, 0.00046 0.2240

Weakly correlated covariates 500 1000 10,000

0.9334 0.9835 0.9997

0.0065, 0.03483 0.2338 0.0016, 0.0168 0.2337 0.0001, 0.00008 0.2338

0.9973 0.9976 0.9979

0.0017, 0.00035 0.2289 0.0017, 0.00032 0.2289 0.0017, 0.00034 0.2287

Strongly correlated covariates 500 1000 10,000

0.4517 0.6019 0.9940

0.0543, 0.08844 0.2502 0.0394, 0.07921 0.2478 0.0006, 0.01085 0.2402

0.9982 0.9984 0.9985

0.0008, 0.00021 0.0008, 0.00014 0.0008, 0.00012

0.2356 0.2356 0.2357

Table II. Simulation results: lognormal data. Proposed method N

Correlation

Bias (mean, sd)

Hu and Root method MSE Correlation

Bias (mean, sd)

MSE

Independent covariates 500 1000 10,000

0.9566 0.9921 0.9954

0.0067, 0.02578 0.2320 0.0039, 0.00364 0.2317 0.0044, 0.00122 0.2322

0.9942 0.9949 0.9955

0.0001, 0.00097 0.2168 0.000004, 0.00068 0.2164 0.00001, 0.00068 0.2161

Weakly correlated covariates 500 1000 10,000

0.8657 0.9611 0.9952

0.016, 0.04877 0.2391 0.0066, 0.02543 0.2377 0.0034, 0.00111 0.2375

0.9952 0.9961 0.9968

0.0006, 0.00097 0.0004, 0.00061 0.0004, 0.00051

0.2220 0.2219 0.2219

Strongly correlated covariates 500 1000 10,000

0.4282 0.6561 0.9958

0.0593, 0.08945 0.2510 0.0364, 0.07453 0.2468 0.0022, 0.00038 0.2400

0.9954 0.9960 0.9965

0.0014, 0.00109 0.0012, 0.0005 0.0011, 0.00027

0.2305 0.2306 0.2309

Table III. Simulation results: uniform data. Proposed method N

Hu and Root method

Bias (mean, sd)

MSE

Correlation

Bias (mean, sd)

MSE

Independent covariates 500 1000 10,000

0.8566 0.8697 0.8761

0.0218, 0.01428 0.0235, 0.00883 0.023, 0.00255

0.1569 0.1537 0.1539

0.9964 0.9962 0.9961

0.0021, 0.00022 0.0021, 0.00019 0.0021, 0.00016

0.2396 0.2397 0.2397

Weakly correlated covariates 500 1000 10,000

0.9047 0.9164 0.9223

0.0244, 0.01551 0.025, 0.00688 0.0249, 0.00212

0.1739 0.1730 0.1724

0.9977 0.9976 0.9975

0.0014, 0.00024 0.0014, 0.00017 0.0014, 0.0001

0.2398 0.2398 0.2398

Strongly correlated covariates 500 1000 10,000

0.9058 0.9562 0.9650

0.01, 0.0351 0.0145, 0.01087 0.0148, 0.00169

0.2051 0.2050 0.2052

0.9968 0.9967 0.9965

0.0018, 0.00033 0.0018, 0.00028 0.0019, 0.00011

0.2398 0.2398 0.2398

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

2571

Correlation

E. SHENG ET AL. AGE

BMI

TCHOL 0.012

0.06 0.08 0.010

0.06 0.05 0.04

coefficient estimate

coefficient estimate

coefficient estimate

0.05 0.07

0.04 0.03 0.02 0.01

0.008

0.006

0.004

0.03 0.00 0.002 female yes Rx

male yes Rx

female no Rx

male no Rx

female yes Rx

male yes Rx

female no Rx

male no Rx

female yes Rx

−0.01 male yes Rx

male no Rx

female no Rx

multivariate (95% CI) univariate synthesized

0.02

Figure 1. Hypertension data example: coefficient estimate comparison.

In the normal data setting, the new method attains similar levels of predictive accuracy and precision to the Hu and Root method with increased sample size. In the lognormal data setting, the new method also appears to converge to the predictive accuracy and precision of the Hu and Root method with increased sample size but converges at a slower rate. In the uniform data setting, in all sample sizes, the new method is notably less accurate (the correlations between the model predictions and ‘true’ predictions are much lower) compared to Hu and Root’s method.

5. Example: a real population dataset We obtained a real population dataset from the 2007 National Health and Nutrition Examination Survey [15]. In this dataset, the occurrence of hypertension (systolic blood pressure > 120 or diastolic blood pressure > 80) was the outcome, and sex, age, hypertensive medication status, body mass index, and total cholesterol were regarded as the complete list of predictors. Note that in this dataset, both the outcome and every predictor of interest is included; hence, this is not a practical example but an illustrative one. While we compared the correlations between the multivariate regression predicted values with predicted values derived from synthesis analysis in the simulated data, here we only compare the multivariate regression coefficients with our synthesis analysis method’s estimated coefficients. No comparison was made to Hu and Root’s method, as their method does not estimate model coefficients. In this example, sex and hypertension medication are binary variables, and our method assumes the predictors are multivariate normal; therefore, we divided the data into four subgroups prior to implementing our method. Univariate regressions and the complete multivariate regression model were fit on the same datasets as a simple way of obtaining results from the same superpopulation. The univariate, multivariate, and synthesized model coefficient estimates are shown in Figure 1. Results demonstrate that the synthesized model coefficients are typically closer to the multivariate model coefficients when compared to the univariate model coefficients. This observation suggests that our method provides better estimated multivariate model coefficients compared to naively using univariate model coefficients to represent multivariate model coefficients.

6. Discussion

2572

Multivariate logistic regression models are commonly used in biomedical research. In predictive models, the outcome of interest is usually disease status. The objective of multivariate logistic regression modeling is twofold: (1) predict the likelihood of disease based on risk factor measurements and (2) describe the association between risk factors and the disease (i.e., the adjusted association between each risk factor and the disease). The synthesis method proposed by Hu and Root only addresses the first objective, while our method addresses both. Even when predicted values are of primary interest, multivariate model coefficients are still useful, as they indicate the relative contribution of each risk factor to the predicted outcome value. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

E. SHENG ET AL.

Our proposed synthesis analysis method is statistically and mathematically justified under the assumption that the predictors follow a multivariate normal distribution. In practice, non-normally distributed continuous covariates can be transformed to follow a normal distribution. If a predictor variable happens to be binary, then a stratified analysis can be used as illustrated in Section 5. The performance of the our proposed method, according to the correlation between the true and estimated predicted values, was evaluated and compared to the performance of Hu and Root’s synthesis analysis method using simulated data. Results show Hu and Root’s method performs well regardless of the data distribution, and our proposed synthesis analysis method only performed well when the simulated data came from a normal distribution. The advantage of our synthesis analysis method over Hu and Root’s method is the estimation of multivariate regression coefficients, as illustrated in Section 5. Although the synthesized model coefficients were not necessarily close to the multivariate regression model coefficients, they were more accurate than the univariate coefficients. This example illustrates how synthesized model coefficients are less susceptible to confounding effects compared to univariate regression coefficients. In practice, it is more common to find incomplete multivariate models instead of univariate models in the literature. In the present study, we chose to restrict our attention to the case where incomplete models were all univariate – both for simplicity and to compare the performance of our method to Hu and Root’s analogous method. Unlike Hu and Root’s method, our method is not restricted to the case where all incomplete models are univariate; therefore, an extension of the present study would include further development and evaluation of our method using multivariate incomplete models. Synthesis analysis is a relatively new area, however related to meta-analysis. Methods in meta-analysis are well established and increasingly being used [16–18]. However, the problem setup where those techniques can be used is somewhat different from our problem setup. In meta-analysis, the primary goal is to combine results from studies with different populations. In synthesis analysis, the goal is to combine results from studies that consider different predictors, and whether the study populations differ is a secondary concern. With additional information, such as complete individual patient data or estimates of within study predictor covariance, methods in meta-analysis could be used to build a comprehensive prediction model [19–21]. However, our proposed method differs in the type of data required for its implementation. In our problem setup, current meta-analysis methodology can only create a multivariate model when a complete model already exists; however, we are interested in the situation where such a model does not exist. Despite differences between the aims of meta-analysis and synthesis analysis, techniques in metaanalysis could be used in combination with synthesis analysis. We assume that the studies from which we obtain incomplete model coefficients are from the same superpopulation; however, in practice, this is usually not the case. An extension of our study could utilize methods in random-effect meta-analysis to address between-study population differences. The motivation for synthesis analysis is to be able to create or update multivariate disease prediction models when a complete dataset is not available or easily obtained. By using the proposed synthesis analysis method, one could take advantage of the wealth of data collected in previous studies. The application of the synthesized model may vary depending on the situation and is not intended to replace the traditional method of collecting a complete dataset and fitting a multivariate logistic regression. In the absence of a complete dataset, synthesized models could help inform decisions to invest in new studies or additional data collection.

Appendix Proof of Proposition 2 In Section 3.2, the following proposition was introduced and used to estimate the common covariance matrix of two subpopulations. We now give the details of the proof of this proposition.  Proposition 2

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

2573

    † D † mix C 1 T1 12  1 C 1 T0 1 0 C 0 T1 1 0 C 0 T0 02  0 :

E. SHENG ET AL.

Proof First note that † mix D E.X X T /  E.X /E.X /T : Also, we know E.X / D 0 0 C 1 1 ; and

  E.X X T / D E.X X T jY D 0/0 C E X X T jY D 1 1   D .C ov.X jY D 0/ C E.X jY D 0/E X jY D 0/T 0   C .C ov.X jY D 1/ C E.X jY D 1/E X jY D 1/T 1     D † C 0 T0 0 C † C 1 T1 1 D † C 0 T0 0 C 1 T1 1 :

Therefore,   † mix D † C 0 T0 0 C 1 T1 1  .0 0 C 1 1 /.0 0 C 1 1 /T     D †  1 T1 12  1  1 T0 1 0  0 T1 1 0  0 T0 p2  0 : Rearranging the aforementioned expression, we have     † D † mix C 1 T1 12  1 C 1 T0 1 0 C 0 T1 1 0 C 0 T0 02  0 :  Estimating subpopulation means In Section 3.1, we considered the case where all of the individual logit models of the form in Equation (1) are univariate. That is, we assume that K D p, and the ith logit model is given by (5). Under this assumption, we have the following result. Proposition 3 For 0 and 1 , the ith component of these vectors is given by   1 Xi ; and E 0i D E.Xi jY D 0/ D 0 1 C e ˇi 0 Cˇi1 Xi ! 1 Xi e ˇi 0 Cˇi1 Xi 1i D E.Xi jY D 1/ D E ; 1 1 C e ˇi 0 Cˇi1 Xi for i D 1; : : : ; p. Proof

2574

1i D E.Xi jY D 1/ Z D xi f .xi jY D 1/dxi Z P .Y D 1jxi /f .xi / D xi dxi by Baye’s Rule P .Y D 1/ Z e ˇi 0 Cˇi1 xi 1 D f .xi /dxi from Equation (5) xi 1 1 C e ˇi 0 Cˇi1 xi ! Xi e ˇi 0 Cˇi1 xi 1 E : D 1 1 C e ˇi 0 Cˇi1 xi

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

E. SHENG ET AL.

Similarly, 0i D E.Xi jY D 1/ D

1 E 0



Xi



1 C e ˇi 0 Cˇi1 xi

:

To estimate 0 and 1 , we use Proposition 3 by substituting sample moments for population moments and use the known values of 1 and 0 .  Now consider the case where at least one of the incomplete logit models is multivariate and the p independent variables may be contained in more than one incomplete logit model. Note that 1 6 K 6 p, and denote the ith logit model by Equation (1). Let IP i index be the set of logit models that have Xi as an independent variable. Let !v be weights such that v2Ii !v D 1. Then the analog to Proposition 3 is given by the following corollary. Corollary For 0 and 1 , the ith component of these vectors is given by   1 X Xi ; and 0i D ! E T 0 1 C e ˇ0 Cˇ X i 2Ii ! T Xi e ˇ0 Cˇ X i 1 X 1i D ! E ; T 1 1 C e ˇ0 Cˇ X i 2Ii

for i D 1; : : : ; p. Using the corollary, we obtain estimates of 0 and 1 by substituting sample moments for population moments and using the known values of 1 and 0 .

References

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

2575

1. Lee YH, Bang H, Kim HC, Kim HM, Park SW, Kim DJ. A simple screening score for diabetes for the Korean population. Diabetes Care 2012; 35(8):1723–1730. DOI: 10.2337/dc11-2347. 2. Chen BD, Xu WY, Yu C, Ni ZY, Li XF, Cui DW. A logistic regression model for microalbuminuria prediction in overweight male population. Nature Precedings, 2010. 3. Ghanei M, Aslani J, AzizAbadi-Farahani M, Assari S, Saadat SH. Logistic regression model to predict chronic obstructive pulmonary disease exacerbation. Archives of Medical Science 2007; 3(4):360–366. 4. Sun QF, Ding JG, Xu DZ, Chen YP, Hong L, Ye ZY, Zheng MH, Fu RQ, Wu JG, Du QW, Chen W, Wang XF, Sheng JF. Prediction of the prognosis of patients with acute-on-chronic Hepatitis B liver failure using the model for end-stage liver disease scoring system and a novel logistic regression model. Journal of Viral Hepatitis 2009; 16(7):464–470. DOI: 10.1111/j.1365-2893.2008.01046.x. 5. Chae JS, Kang RW, Kwak JH, Paik JK, Kim OY, Kim MJ, Park JW, Jeon JY, Lee JH. Supervised exercise program, BMI, and risk of type 2 diabetes in subjects with normal or impaired fasting glucose. Diabetes Care 2012; 35(8):1680–1685. DOI: 10.2337/dc11-2074. 6. Chamnan P, Simmons RK, Forouhi NG, Luben RN, Khaw KT, Wareham NJ, Griffin SJ. Incidence of type 2 diabetes using proposed HbA1c diagnostic criteria in the European prospective investigation of cancer-Norfolk cohort. Diabetes Care 2011; 34(4):950–956. DOI: 10.2337/dc09-2326. 7. Cheng PY, Beugaard B, Foulis P, Conlin PR. Hemoglobin A1c as a predictor of incident diabetes. Diabetes Care 2011; 34(3):610–615. DOI: 10.2337/dc10-0625. 8. Sato KK, Hayashi T, Harita N, Yoneda T, Nakamura Y, Endo G, Kambe H. Combined measurement of fasting plasma glucose and A1c is effective for the prediction of type 2 diabetes. Diabetes Care 2009; 32(4):644–646. DOI: 10.2337/dc08-1631. 9. Wang WY, Lee ET, Howard BV, Fabsitz RR, Devereux RB, Welty TK. Fasting plasma glucose and hemoglobin A1c in identifying and predicting diabetes. Diabetes Care 2011; 34(2):363–368. DOI: 10.2337/dc10-1680. 10. Droumaguet C, Balkau B, Simon D, Caces E, Tichet J, Charles MA, Eschwege E. Use of HbA1c in predicting progression to diabetes in French men and women. Diabetes Care 2006; 29(7):1619–1625. 11. Samsa G, Hu G, Root MM. Combining information from multiple data sources to create multivariable risk models: illustration and preliminary assessment of a new method. Journal of Biomedicine and Biotechnology 2005; 2:113–123. 12. Zhou XH, Hu N, Hu G, Root MM. Synthesis analysis of regression models with a continuous outcome. Statistics in Medicine 2009; 28(11):1620–1635. DOI: 10.1002/sim.3563. 13. Hu G, Root M. Building prediction models for coronary heart disease by synthesizing multiple longitudinal research findings European. Journal of Cardiovascular Prevention and Rehabilitation 2005; 12:459–464. DOI: 10.1097/01.hjr.0000173109.14228.71. 14. Efron B. The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association 1975; 70:892–898. 15. Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Department of Health and Human Services, Centers for Disease Control and Prevention. Hyattsville, MD, 2000.

E. SHENG ET AL. 16. 17. 18. 19. 20.

Hedges LV, Olkin I. Statistical Methods for Meta-analysis. Academic Press: San Diego, CA, 1985. Lipsey M, Wilson D. Practical Meta-analysis. Sage: Thousand Oaks, CA, 2001. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to Meta-analysis. Wiley: Chichester, 2009. White IR, Higgins JPT. Meta-analysis with missing data. The Stata Journal 2009; 9(1):57–69. Jackson D, Riley R, White IR. Multivariate meta-analysis: potential and promise. Statistics in Medicine 2011; 30(20):2481–2498. DOI: 10.1002/sim.4172. 21. The Fibrogen Studies Collaboration. Systematically missing confounders in individual participant data meta-analysis of observational cohort studies. Statistics in Medicine 2009; 28(8):1218–1237. DOI: 10.1002/sim.3540.

2576 Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2567–2576

A new synthesis analysis method for building logistic regression prediction models.

Synthesis analysis refers to a statistical method that integrates multiple univariate regression models and the correlation between each pair of predi...
160KB Sizes 0 Downloads 2 Views