Risk prediction model for in-hospital mortality in women with ST-elevation myocardial infarction: A machine learning approach.

Heart & Lung xxx (2017) 1e7

Contents lists available at ScienceDirect

Heart & Lung journal homepage: www.heartandlung.org

Risk prediction model for in-hospital mortality in women with ST-elevation myocardial infarction: A machine learning approach Hend Mansoor, PharmD, MS a, *, Islam Y. Elgendy, MD b, Richard Segal, PhD c, Anthony A. Bavry, MD, MPH b, Jiang Bian, PhD d a

Department of Health Services Research, University of Florida, College of Public Health, Gainesville, FL, USA Division of Cardiovascular Medicine, Department of Medicine, Gainesville, FL, USA c Department of Pharmaceutical Outcomes and Policy, College of Pharmacy, Gainesville, FL, USA d Department of Health Outcomes and Policy, College of Medicine, University of Florida, Gainesville, FL, USA b

a r t i c l e i n f o

a b s t r a c t

Article history: Received 20 April 2017 Received in revised form 5 September 2017 Accepted 9 September 2017 Available online xxx

Background: Studies had shown that mortality due to ST-elevation myocardial infarction (STEMI) is higher in women compared with men. The purpose of this study is to develop and validate prediction models for all-cause in-hospital mortality in women admitted with STEMI using logistic regression and random forest, and to compare the performance and validity of the different models. Methods: Data from the National Inpatient Sample (NIS) data years 2011e2013 were used to identify women admitted with STEMI. The main outcome was all-cause in-hospital mortality. Patients were divided into development and validation cohorts, and trained models were internally validated using 20% of the 2012 data, and externally validated using 2011 and 2013 NIS data. Results: Three main models were developed and compared; multivariate logistic regression, full and reduced random forest models. In the multivariate logistic regression, 11 variables were included in the final model based on backward elimination. The full random forest model contained 32 variables, and the reduced model contained 17 variables selected based on individual variable importance. In the internal validation cohort, the C-index was 0.84, 0.81, and 0.80 for the multivariate logistic regression, full, and reduced random forest models, respectively. The models showed good stability in the external validation cohorts with a C-index for the logistic regression, full, and reduced random forest models of 0.84, 0.85, and 0.81 for year 2011, and 0.82, 0.81, and 0.81 for year 2013, respectively. Conclusions: Random forest was comparable to logistic regression in predicting in-hospital mortality in women with STEMI, and can be a useful and accurate tool in clinical practice. Ó 2017 Elsevier Inc. All rights reserved.

Keywords: Mortality Myocardial infarction Women Risk model Machine learning

Introduction Cardiovascular diseases are considered the leading cause of death in both women and men in the United States, accounting for about one third of the total deaths.1 Studies have demonstrated that mortality from ST-elevation myocardial infarction (STEMI) is higher in women compared with men.2e4 Some studies have suggested that women experience an increased mortality despite undergoing primary percutaneous coronary intervention (PCI) in a timely fashion.5 Scoring systems are often used to predict short and longterm mortality in acute coronary syndromes. Some examples of the currently available scoring systems for predicting 30-day mortality include the Thrombolysis in Myocardial Infarction (TIMI) and Global Registry of Acute Coronary Events (GRACE) risk scores.6,7 However, * Corresponding author. Department of Health Services Research University of Florida, College of Public Health, Gainesville, 32610, FL, USA. E-mail address: hmansoor@ufl.edu (H. Mansoor). 0147-9563/$ e see front matter Ó 2017 Elsevier Inc. All rights reserved. https://doi.org/10.1016/j.hrtlng.2017.09.003

none of the available scores have explicitly focused on in-hospital mortality in women presenting with STEMI in the era of PCI. In current practice, standard statistical methods for prediction have relied on parametric regression methods. Yet, advancements in Computer Science, especially in the field of machine learning allow for the development of complex prediction models and can be converted to risk scores. Machine learning techniques such as random forest, neural network, and support vector machine have been introduced in epidemiologic studies for prediction.8e10 These machine learning techniques have the ability to ascertain interaction, nonlinear, and higher-order effects as well as estimate complex functions that are not well represented by individual covariate or interaction terms. These techniques have been used to predict mortality in elderly patients with spontaneous intra-cerebral hemorrhage, and patients with heart failure.8e10 The objectives of this study were to: 1) develop and validate prediction models for all-cause in-hospital mortality in women

2

H. Mansoor et al. / Heart & Lung xxx (2017) 1e7

admitted with STEMI using logistic regression and random forest with different variable selection methods; and 2) compare the performance and validity measures (i.e., precision or positive predictive value [PPV], accuracy, and area under the curve [AUC]) of the different prediction models. Methods Data source Data from the National (Nationwide) Inpatient Sample (NIS) data years 2011e2013 were used for this study. The NIS is the largest publicly available all-payer inpatient health care database in the United States, representing 95% of the US population. The NIS is part of the Healthcare Quality and Utilization Project, sponsored by the Agency for Healthcare Research and Quality (AHRQ). Since 2012, the NIS was redesigned to include a random sample of patient discharges rather than a random sample of hospitals retaining their discharges. Each individual hospitalization is de-identified and maintained as a unique entry with one primary discharge diagnosis and 25 secondary diagnoses during the hospitalization. It comprises national estimates of hospital inpatient stays, and data from more than 7 million hospital stays each year. It includes patients covered by Medicare, Medicaid, private insurance, and uninsured individuals.11 The NIS contains clinical and resource-use information including primary and secondary diagnoses and procedures, patient demographic characteristics, hospital characteristics, expected payment source, discharge status, length of stay, and severity and comorbidity measures.11 Study population Women aged 18 and older admitted to the hospital with a primary diagnosis of ST-elevation myocardial infarction (STEMI) ICD-9-CM Diagnosis Code 410.xx. The main outcome assessed in this study was in-hospital mortality. Development of predictive models Utilizing the NIS data for 2012, women with a primary diagnosis of STEMI were randomly divided into 2 portions, 80% for model developments, and 20% for validation, respectively. Two main approaches were used to build the predictive models, logistic regression and random forest. Variables predicting in-hospital mortality were identified based on previous literature including age, race, comorbidities, procedures performed during the hospitalization, obesity, patient’s zip code, expected primary payer, length of stay, and hospital characteristics. Accuracy, precision or PPV, and the C-index (or AUC operating characteristic curves [ROC]) were calculated to evaluate the different predictive models. Logistic regression A logistic regression model is used to predict the probability of an event occurring as a linear function of a set of predictor variables. The association between each variable and in-hospital mortality was first tested using univariate analysis. Variables with a significant association (P < 0.1) with in-hospital mortality based on the univariate analyses were entered into a multivariate model. The final model included variables that showed a P < 0.05 using a stepwise backward elimination method, and respective odds ratios (ORs) and 95% CIs were reported for the final multivariate model (model 1).12e17 Hosmer-Lemeshow goodness-of-fit test was used to assess the fitness of the model.

Random forest Random forest is a supervised learning classification algorithm that determines a consensus prediction for each observation by averaging the results of many individual recursive partitioning tree models. Each individual tree is fitted to a randomly selected subset of the observations, and uses a random subset of the available predictors at each node as candidates for splitting. Through cross-validation, the optimal number of trees (estimators) is determined to be 120, which generates the best accuracy in our study. Thirty two variables (features) were used to build the decision trees and referred to as the full random forest model (model 2).8,9 It is, however, difficult for physicians to consider all 32 variables in practical clinical settings. To simplify the full random forest model for clinical use, a reduced model (model 3) was derived from the full model, which included the top 17 variables based on individual variables’ importance. The choice of the cut-off point (i.e., 17 variables) was based on optimizing the model predictive performance with the lowest number of variables. Validation of predictive models Validation of the trained predictive models involved two steps: internal and external validation. Internal validation was performed on the remaining 20% random sample from the 2012 data using stratified 3-fold cross validation for logistic regression models, and stratified 10-fold cross validation for random forest models. Stratified k-fold sampling returns stratified folds that contain approximately the same percentage of samples of each target class as the complete set. External validation was conducted on the 2011 and 2013 NIS data separately. In other words, the models trained based on the 2012 NIS data were externally validated on 2011 and 2013 NIS data, and the respective model performance metrics (i.e., accuracy, PPV, and C-index) were calculated and compared. Statistical analyses The multivariate analysis was based on the development dataset for 2012. The threshold for age and number of chronic conditions categorization was determined graphically, and based on the age distribution in the population. Two-sample t-test was used to compare the means for the development and validation samples, and chi square test for categorical variables. All hospitalizations included in the analysis were weighted using the appropriate discharge weights provided by the NIS. Brier score and calibration curves were used to assess the reliability of the prediction models. Brier score is a statistical approach used to quantify how close predictions are to the actual outcome, and is based on a quadratic scoring rule, where the squared differences between actual binary outcomes Y and predictions p are calculated: (Y e p). A Brier score for a model can range from 0 for a perfect model to 0.25 for a non-informative model with a 50% incidence of the outcome.18,19 The predictive ability of the models was based on the validation data for 2012, and the external validation data for 2011 and 2013. Analyses were performed using Python 3.4.3, Continuum Analytics, Inc. (with scikit-learn 0.17), and SASÒ 9.4. Cary, NC. Results Development and internal validation cohort A total of 9637 women were included in the development cohort for the year of 2012. The baseline characteristics of the subjects included in this study are summarized in Table 1.

H. Mansoor et al. / Heart & Lung xxx (2017) 1e7

3

Table 1 Baseline characteristics of the included subjects. Characteristic

Development cohort (n ¼ 9637)

Validation cohort (n ¼ 2410)

P-value

Total (n ¼ 12,047)

Alive (n ¼ 10,712)

Died (n ¼ 1334)

P-value

Age, yearsa Age < 70 years Age 70 years Number of procedures on this recorda Number of days from admission to first procedurea Length of staya Number of chronic conditionsa Number of chronic conditions < 7 Number of chronic conditions 7 Race: White Black Hispanic Asian or Pacific Islander Native American Other Median household income quartiles: $1 - $38,999 $39,000 - $47,999 $48,000e62,999 $63,000 or more Hospital location: New England Middle Atlantic East North Central West North Central South Atlantic East South Central West South Central Mountain Pacific Weekend admission Primary expected payer: Medicare Medicaid Private including Health Maintenance Organization Self-pay No charge Other Comorbidities: Alcohol abuse Deficiency anemias Rheumatoid arthritis/collagen vascular diseases Congestive heart failure Chronic pulmonary disease Coagulopathy Depression Diabetes Diabetes with chronic complications Drug abuse Hypertension Liver disease Hypothyroidism Obesity Peripheral vascular disorders Pulmonary circulation disorders Renal failure Valvular heart disease Family History of CAD History of PCI Angiography during the hospitalization PCI during the hospitalization Dyslipidemia CAD Smoking Cardiogenic shock

68.3 14.5 5003 (51.9) 4634 (48.1) 6.0 3.6 0.45 1.6

69.3 14.4 1191 (49.4) 1219 (50.6) 5.9 3.7 0.55 2.0

0.83 0.03

68.5 14.4 6194 (51.4) 5853 (48.6) 6.0 3.6 0.54 1.9

67.5 14.3 5825 (54.4) 4888 (45.6) 6.0 3.4 0.53 1.8

76.0 13.0 369 (27.7) 965 (72.3) 5.3 5.0 0.62 2.1

Mortality risk prediction in burn injury: Comparison of logistic regression with machine learning approaches.

Machine learning approach for the prediction of protein secondary structure.

Osteoporosis risk prediction for bone mineral density assessment of postmenopausal women using machine learning.

Risk prediction with machine learning and regression methods.

Machine learning for risk prediction of acute coronary syndrome.

High mortality in obese women diabetics with acute myocardial infarction.

Multiple risk prediction of myocardial infarction in women as compared with men.

Mortality Risk Associated with AF in Myocardial Infarction Patients.

Development and validation of a multivariate predictive model for rheumatoid arthritis mortality using a machine learning approach.

Machine learning methods for microRNA gene prediction.

Risk factors for myocardial infarction in young women.

Mortality Risk for Acute Cholangitis (MAC): a risk prediction model for in-hospital mortality in patients with acute cholangitis.

Prediction of revascularization after myocardial perfusion SPECT by machine learning in a large population.

Prediction of mortality after radical cystectomy for bladder cancer by machine learning techniques.

Prediction of bacterial small RNAs in the RsmA (CsrA) and ToxT pathways: a machine learning approach.

A Risk Prediction Model for In-hospital Mortality in Patients with Suspected Myocarditis.

Identifying novel oncogenes: a machine learning approach.

A machine learning approach for viral genome classification.

A machine learning approach for predicting methionine oxidation sites.

Constructing query-driven dynamic machine learning model with application to protein-ligand binding sites prediction.

Machine Learning Techniques for Prediction of Early Childhood Obesity.

Towards a generalized energy prediction model for machine tools.

Discontinuation of smokeless tobacco and mortality risk after myocardial infarction.

A mortality risk prediction model for older adults with lymph node-positive colon cancer.