Predicting Number of Hospitalization Days Based on Health Insurance Claims Data using Bagged Regression Trees Yang Xie, Student Member, IEEE, G¨unter Schreier, Senior Member, IEEE, David C.W. Chang, Sandra Neubauer, Stephen J. Redmond, Senior Member, IEEE, Nigel H. Lovell, Fellow, IEEE Abstract— Healthcare administrators worldwide are striving to both lower the cost of care whilst improving the quality of care given. Therefore, better clinical and administrative decision making is needed to improve these issues. Anticipating outcomes such as number of hospitalization days could contribute to addressing this problem. In this paper, a method was developed, using large-scale health insurance claims data, to predict the number of hospitalization days in a population. We utilized a regression decision tree algorithm, along with insurance claim data from 300,000 individuals over three years, to provide predictions of number of days in hospital in the third year, based on medical admissions and claims data from the first two years. Our method performs well in the general population. For the population aged 65 years and over, the predictive model significantly improves predictions over a baseline method (predicting a constant number of days for each patient), and achieved a specificity of 70.20% and sensitivity of 75.69% in classifying these subjects into two categories of ‘no hospitalization’ and ‘at least one day in hospital’.

I. I NTRODUCTION Administrators of healthcare systems worldwide are striving to both lower the cost of care whilst improving the quality of care given. To simultaneously improve healthcare quality while saving on unnecessary medical expenditure, better clinical and administrative decision making is needed. It has been shown that novel and deep insights gained by analyzing clinical data can make a significant contribution to the process of diagnosis, choice of treatment, and prognostic predictions [1]. Since the more recent rise of Big Data, the availability of large heterogeneous medical datasets has prompted the use of data mining techniques to discover important information contained within these complex data structures. Harper [2] and Yoo et al. [1] have written thorough reviews of popular data mining algorithms used in the context of healthcare and biomedicine. One of the most important applications, from both clinical and administrative viewpoints, is to predict individual patient outcomes. Anticipating outcomes such as length of stay (LoS) in hospital, hospital readmission, health costs, and onset of a particular disease, are among the most coveted of prediction capabilities [3], [4]. This research was supported under the Australian Research Council’s Linkage Projects funding scheme (project number LP0883728) and HCF Foundation. Y. Xie, D.C.W. Chang, S.J. Redmond and N.H. Lovell are with the Graduate School of Biomedical Engineering, UNSW Australia, Sydney, New South Wales, 2052. [email protected] G. Schreier and S. Neubauer are with AIT Austrian Institute of Technology GmbH, 8020 Graz, Austria. [email protected]

978-1-4244-7929-0/14/$26.00 ©2014 IEEE

Limited previous research has demonstrated promising results in predicting LoS days for patients in special disease groups using physiological measurement data. Silberbach et al. [5] predict LoS for congenital heart disease surgery; Suter-Widmer et al. [6] investigated the possibility of predicting hospital stays in patients with community-acquired pneumonia. On the other hand, Harper [2] applied classification algorithms to more general datasets (i.e., an intensive care unit (ICU) dataset and a hospital management system dataset). Multiple targets (i.e., LoS in ICU, LoS in hospital) were predicted based on these general datasets. However, sample sizes were small (with 582 records) or medium (with 17,974 records), which could be a limit in accurately characterizing the population. The most comparable work was conducted by Bertsimas et al. [7]. An excellent predictive model was built on a big dataset of 800,000 individuals to predict the possibility of a customer falling into five different health cost groups. They point out that medical information contributes more to predictions of high-cost members than lost-cost members. The aim of this paper is to build a model that predicts the number of hospitalization days for a general population, using large-scale health insurance claims data. Through this analysis, it is intended to capture the uncertainty and variability among a patient population by predicting individual patient outcomes. In addition, since customers of 65 years and older are more frequent users of medical resources, we are particularly interested in the performance of such a predictive model on this group. II. M ETHODS A. Data Set and Summary Statistics The dataset was obtained from 261,558 de-identified customers of the Hospitals Contribution Fund of Australia (HCF), one of Australia’s largest combined registered private health fund and life insurance organizations. Three consecutive years of data, from 2010 to 2012, were provided for analysis. These data contain tables specifying the patient demographics, hospital admission, and medical services provided: Customer demographics table includes information related to customers themselves, such as gender, age, the type of HCF product they are subscribed to, the date they joined the fund, and several other customer-level information items. In nearly every case a customer will have one entry in this table. Duplicates are aggregated to form a single unique customer entity.

2706

6000

Average hospitalisation days (days)

4.5

5000

Counts #

4000 3000 2000 1000

0

10

Fig. 1. (a)

20

30

40

50 60 Age (years)

70

80

90

100

Customer count segregated by age.

3

2 1.5 1 0.5 20

40

60 Age (years)

80

100

120

Fig. 3. Average hospitalization days per person segregated by age for each the three years of HCF data.

(b)

2.5

1.5 Year 2010 Year 2011 Year 2012

2

Counts # ( 105 )

Year 2010 Year 2011 Year 2012

2.5

0 0

110

Admission #i1

1 1.5

1 0.5

Counts # ( 104 )

0

4 3.5

Admission #i2

Medical service item #ij1

...

...

Admission #ij

Medical service item #ijk

Customer #i

0.5

0

Predicting number of hospitalization days based on health insurance claims data using bagged regression trees.

Healthcare administrators worldwide are striving to both lower the cost of care whilst improving the quality of care given. Therefore, better clinical...
925KB Sizes 3 Downloads 19 Views