Journal of Biomedical Informatics 53 (2015) 147–155

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Towards actionable risk stratification: A bilinear approach Xiang Wang ⇑, Fei Wang, Jianying Hu, Robert Sorrentino IBM T.J. Watson Research Center, Yorktown Heights, NY, USA

a r t i c l e

i n f o

Article history: Received 29 April 2014 Accepted 7 October 2014 Available online 14 October 2014 Keywords: Risk stratification Bilinear model Logistic regression Matrix factorization Dimensionality reduction

a b s t r a c t Risk stratification is instrumental to modern clinical decision support systems. Comprehensive risk stratification should be able to provide the clinicians with not only the accurate assessment of a patient’s risk but also the clinical context to be acted upon. However, existing risk stratification techniques mainly focus on predicting the risk score for individual patients; at the cohort level, they offer little insight beyond a flat score-based segmentation. This essentially reduces a patient to a score and thus removes him/her from his/her clinical context. To address this limitation, in this paper we propose a bilinear model for risk stratification that simultaneously captures the three key aspects of risk stratification: (1) it predicts the risk of each individual patient; (2) it stratifies the patient cohort based on not only the risk score but also the clinical characteristics; and (3) it embeds all patients into clinical contexts with clear interpretation. We apply our model to a cohort of 4977 patients, 1127 among which were diagnosed with Congestive Heart Failure (CHF). We demonstrate that our model cannot only accurately predict the onset risk of CHF but also provide rich and actionable clinical insights into the patient cohort. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Risk stratification is indispensable to modern clinical decision support systems. By providing the clinicians (or other healthcare practitioners) an assessment of an individual’s risk against an adverse outcome, risk stratification plays a central role in personalized medicine, care plan management, and cost estimation [1]. While the application area of risk stratification is broad, generally speaking, a comprehensive risk stratification method has three fundamental goals: 1. Risk Score Prediction: Given the clinical features, predict an individual’s risk against a certain adverse outcome, such as disease onset, hospitalization, and mortality. 2. Patient Cohort Stratification: Segment a patient cohort into coherent groups based on the patients’ risk as well as clinical characteristics. 3. Clinical Context Discovery: Identify the clinical contexts that underpin the patients’ risk assessment. The vast majority of existing risk stratification techniques are based on multivariate regression analysis [2–6], especially linear regression and logistic regression. Given a set of training patients

and their clinical features, the regression model is fit to the training data such that the contribution of each individual clinical feature (also called risk factors) to the overall risk can be estimated (the regression coefficient). The trained model is then applied to a group of test patients to compute their overall risk scores. Based on the their risk scores, the patient cohort can be stratified into several tiers, e.g. high, medium, and low risk. Given sufficient amounts of training data [7], existing regression models can accurately predict the risk scores for individual patients (Goal 1 above). However, they offer limited insights when it comes to patient cohort stratification (Goal 2) and clinical context discovery (Goal 3). Imagine we have a cohort of patients who are at risk of Congestive Heart Failure (CHF). As illustrated in Fig. 1,1 traditional regression model will be able to identify two high-risk individuals (red dots). Based on the fact that they have similar risk scores, these two patients will be stratified into the same group regardless of their clinical conditions. In fact, these two individuals may have high risks of CHF for very different reasons, e.g. one with Chronic Obstructive Pulmonary Disease (COPD) and the other with Chronic Kidney Disease (CKD). These clinical contexts are crucial when the clinicians wish to act upon the risk stratification results, e.g. to devise personalized treatment plan, yet they are not adequately addressed by existing regression models.

⇑ Corresponding author. E-mail addresses: [email protected] (X. Wang), [email protected] (F. Wang), [email protected] (J. Hu), [email protected] (R. Sorrentino). http://dx.doi.org/10.1016/j.jbi.2014.10.004 1532-0464/Ó 2014 Elsevier Inc. All rights reserved.

1 For interpretation of color in Fig. 1, the reader is referred to the web version of this article.

148

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

In this work we would like to address the limitation of existing risk stratification techniques by proposing a novel bilinear risk stratification model. Our model aims to perform a comprehensive risk analysis of a given patient cohort, from which clinicians and healthcare practitioners can derive more actionable insights. Our model is designed to achieve all the three above stated goals in a principled and integrated fashion. Specifically, our model learns an embedding of the patients into a low-dimensional risk space such that: (1) the distance from a patient to the origin becomes a measure of his/her risk (risk prediction); (2) patients with similar risk scores and clinical characteristics are close together in the space (cohort stratification); and (3) each dimension of the space provides an interpretation of the clinical context (context discovery). Therefore our model is able to give clinicians a full picture of not only the individual patients’ risk but also the distinct phenotypes associated with their risk; not only the contribution of individual clinical features but the clinical contexts they collectively define. We used a CHF patient cohort extracted from a real Electronic Health Record (EHR) database to test our model. The cohort consists of 1127 case patients who were confirmed with CHF and 3850 control patients. We extracted both diagnosis codes and medication as clinical features. We applied our model to risk stratify this cohort and predicted the risk of future CHF onset. We compared the results from our model to that of logistic regression and demonstrated that our model not only achieved better prediction accuracy but also provided rich and actionable clinical insights that were missing in traditional methods.

In this paper we propose a novel bilinear model that integrates risk prediction and patient cohort analysis. Our model can be interpreted both from a regression point of view and an embedding point of view. From the regression perspective, our work is related to Bilinear Logistic Regression [13,14], which was recently proposed and applied to brain imaging analysis. The key difference is that for Bilinear Logistic Regression, each data instance is naturally represented by a matrix. The bilinear regressor introduced actually consists of two vectors, whose goal is to reduce the number of coefficients to be estimated. As a contrast, in our model each data instance (a patient) is still represented by a vector whereas the bilinear regressor is actually a matrix that captures the correlation between the medical features. From the embedding perspective, our work is related to supervised dimensionality reduction and discriminant analysis, such as Fisher Linear Discriminant Analysis [15] and its extensions [16], supervised Principal Component Analysis [17], and supervised metric learning [18]. What our model and these previous techniques have in common is that they aim to find a projection/ embedding that maximally separates the training samples from different classes. The fundamental difference is that existing supervised dimensionality reduction methods are symmetric, i.e. their objective is to pull data points from different classes apart from each other, without making distinctions between the actually class labels. In contrast, our model deals with a binary outcome and the two classes are not interchangeable. Our objective is to project the positive class as far away from the origin as possible while projecting the negative class as close to the origin as possible.

2. Background Risk stratification is a fundamental technique for medical informatics [1]. Traditionally, the goal of risk stratification was to regress the risk of patients based on a pre-selected set of risk factors [5,8]. Once the risk score is predicted, the patient will be assigned to a certain tier based on the score. After EHR has been widely adopted, more advanced machine learning algorithms were introduced into risk stratification to deal with the high dimensionality and sparseness of EHR data [9]. These techniques were able to achieve automatic feature selection and significantly improved the accuracy of risk prediction, but their outputs remained a flat score-based stratification and offered little new insights into the clinical characteristics of the patient cohort. In a separated line of research, a variety of techniques have been proposed for cluster analysis of the patient cohort or disease phenotyping [10–12]. These techniques were able to discover homogeneous patient groups who have similar medical conditions as well as strongly correlated medical features. This type of analysis provided meaningful clinical contexts which medical experts can act upon. However, these disease phenotyping techniques were not integrated into the risk stratification framework, i.e. there was no direct correspondence between the identified disease phenotypes and the predicted risk of each individual patient.

3. Methods In this section we formally introduce our objective function, the algorithm, and also discuss some implementation issues in practice. Suppose we can use a d-dimensional vector, x 2 Rd , to encode the clinical records of a patient. A variety of encoding schemes can be used here, for instance, xi ¼ 1ði ¼ 1; 2; . . . ; dÞ means the patient has feature i (e.g. a diagnosis code or a drug), 0 otherwise; alternatively, xi could be the frequency count or weighted frequency count of feature i on this particular patient. Let y be the binary label associated with the outcome we want to predict: y ¼ þ1 means the patient had a certain outcome (case) and 1 otherwise (control). In traditional logistic regression, the risk for the outcome and the input feature vector are associated in a (generalized) linear form [19]:

log

pðy ¼ þ1jxÞ ¼ wT x þ x0 ; pðy ¼ 1jxÞ

ð1Þ

where w 2 Rd is the regression coefficient vector. In our work, we extend the linear regression model in Eq. (1) to the following bilinear form:

Fig. 1. The limitation of the existing regression-based risk stratification techniques. The two high-risk patients have very different clinical conditions but are regarded similar under the score-based stratification.

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

log

pðy ¼ þ1jxÞ ¼ xT Wx þ x0 : pðy ¼ 1jxÞ

ð2Þ

Now instead of a regression coefficient vector w, we have a d  d coefficient matrix W. Notice that when the input feature vectors are binary, the linear regression model becomes a special case of our bilinear model if W is a diagonal matrix. Intuitively speaking, the linear regression model only considers the contribution of each individual risk factor to the outcome, whereas our bilinear regression model also considers the contribution of the co-occurrence of two different risk factors. Thus we extend the cardinality of the fea2 ture set of d to d and with the enriched feature set we can build a bilinear risk space (see comparison between Figs. 1 and 6). Such second-order information can have high prediction power in many cases. For instance, the occurrence of two co-morbid diseases can lead to larger increase in risk score than combining the risk they can cause separately. Also, if we consider the use of medication as risk factors, the second-order information can capture the effect of drug combinations, which is common in the treatment of complex diseases such as CHF (see Fig. 5 for quantified comparison). The regression coefficients of a linear model are usually estimated by Maximum Likelihood Estimation:

Y argmax pðyjx; wÞ: w

x

MLE can lead to low asymptotic error [20], but the number of training samples n is expected to be much larger than the number of coefficients to estimate d [7]. This requirement makes MLE infeasible for the bilinear model [13], where we have d  d coefficients to estimate. For EHR data, d is often in the scale of hundreds or thou2 sands, which makes n  d a prohibitively large number. Consequently, we consider an alternative objective function to estimate W. Our objective exploits the low-rank structure of W and allows an intuitive geometric interpretation. Formally, our objective function is:

  argmax tr X Tþ WX þ =nþ W   argmin tr X T WX  =n

ð3Þ

W

X þ 2 Rdnþ is the set of all case patients and X  2 Rdn is the set of all control patients (each column is the record of an individual patient). tr means the matrix trace. From a regression point of view, our objective in Eq. (3) aims to find the W that maximizes the average risk score over all case patients while minimizing the average risk score over all control patients. Note that the objective in Eq. (3) is not well-posed: To solve it, we need to fold the multi-criteria optimization problem into a single-criterion problem. To do so, we constrain W to be positive semidefinite and decompose it as W ¼ UU T , where U 2 Rdr . This constraint essentially makes W a kernel matrix for the features. It is also easy to satisfy in practice since we can simply add a constant to the diagonal elements of W to make it diagonal dominant. Now we can rewrite Eq. (3) in the equivalent form:

! X þ X Tþ U nþ U ! T T XX U argmin tr U n U argmax tr U T

ð4Þ

Denote Pþ , X þ X Tþ =nþ and P  , X  X T =n , it is easy to see P þ is the empirical estimation of the conditional joint probability that feature i and j ði; j ¼ 1; 2; . . . ; dÞ co-occur on a case patient and P  is the empirical estimation of the conditional joint probability that feature i and j co-occur on a control patient. With the new notation, we transform Eq. (3) into

argmax trðU T ZUÞ;

149

ð5Þ

U T U¼I

where Z ij , P þij =Pij . The constraint U T U ¼ I is introduced to avoid arbitrary scaling of U. Using the Bayes Theorem, we can show Z is essentially a second-order odds ratio matrix multiplying a constant:

Z ij ,

Pþij pðxi ¼ 1; xj ¼ 1jy ¼ þ1Þ pðy ¼ þ1jxi ¼ 1; xj ¼ 1Þ ¼ c ¼ Pij pðxi ¼ 1; xj ¼ 1jy ¼ 1Þ pðy ¼ 1jxi ¼ 1; xj ¼ 1Þ

where c ¼ pðy ¼ 1Þ=pðy ¼ þ1Þ is a prior that does not depend on i, j. Notice that the odds ratio matrix Z is nonnegative. To improve the interpretability of the components in U, we enforce the nonnegativity constraint U P 0 and relax the orthogonality constraint U T U ¼ I. Then as shown in [21], Eq. (5) is equivalent to

argmin kZ  UU T k2F :

ð6Þ

UP0

Here k  kF is the Frobenius matrix norm. Therefore, our final objective function in Eq. (6) can be interpreted as finding a low-rank approximation to the empirical odds ratio matrix Z. Recall that we started by assuming W ¼ UU T , now it is easy to see why the coefficient matrix W is a good regressor: if feature i and j are much more likely to co-occur in case patients than in control patients, then their co-occurrence in a test patient should contribute positively to the patient’s risk. Similarly, if feature i and j are much more likely to co-occur in control patients than in case patients, then their cooccurrence should contribute negatively to the patient’s risk. In other words, the entries in UU T can be used as the contribution of the feature co-occurrence to the target outcome. In practice, we expect Z to have a low-rank structure, i.e. the difference between the case patients and the control patients can be explained by a small number of disease phenotypes [12,11]. Consequently, we can use a low-rank matrix U 2 Rdr ðr  dÞ to approximate Z (see Fig. 3 for an empirical validation on the real data). In this case, U can be interpreted as a projection from the original ddimensional feature space to a r-dimensional risk space, where each dimension represents a disease phenotype that has significant contribution to the risk score. From the patient perspective, U T x embeds each patient into the risk space in such a way that the case patients are as far away from the origin as possible and the control patients are as close to the origin as possible. This is essentially the geometric interpretation of Eq. (4). We argue that such an embedding U achieves the three fundamental goals for risk stratification as we suggested in Section 1: 1. Risk Score Prediction: The distance between a patient’s  2   embedding and the origin in the risk space, namely U T x , serves as the estimation of that patient’s risk score. U is optimized such that case patients will have high risk scores and control patients will have low risk scores. 2. Patient Cohort Stratification: We can stratify patients into different groups by clustering them based on their embedding in the risk space. Patients within the same cluster will not only have similar risk scores (their distances to the origin) but also similar clinical characteristics (since the risk space is spanned by disease phenotypes). 3. Clinical Context Discovery: After embedding, a patient’s clinical context is explicitly defined by the r dimensions of the risk space. We can examine his/her position on respective dimensions to determine which are the main contributors to his/her risk score. More interestingly, we can compare two patients who are at the similar distance to the origin (thus similar risk scores) but are placed far away from each other in the risk

150

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

space. Unlike what we demonstrated in Fig. 1, our bilinear model will recognize and preserve their distinction in terms of clinical conditions.

Algorithm 1. Bilinear Risk Stratification

In practice, our algorithm is easy to implement and efficient to solve. It consists of three main steps (also summarized in Algorithm 1): (1) Compute the empirical odds ratio matrix Z:



Z ij ¼ 

X þ X Tþ X  X T



ij ij

=nþ þ  =n þ 

;

ð7Þ

where  is the smoothing factor that can be set to meet the needs of a particular application. Larger  will suppress the influence of clinical features with low incident rates. (2) Approximate Z with a rank-r matrix: Solve the symmetric nonnegative matrix factorization problem in Eq. (6). Many standard ready-to-use solvers exist for such problem [22]. The value of r is set manually according to domain knowledge, application needs, or through thresholding the approximation error. The columns of U can be interpreted as the top-r latent factors that discriminate the case patients from the control patients. U ij signifies the importance of the i-th feature in the j-th factor. (3) Risk prediction and patient stratification: Project patient x ~ ¼ U T x. The risk score into the r-dimensional risk space by x of x can be estimated as:

^¼ y

1 : ~T x ~Þ 1 þ expðx

ð8Þ

~g will give us groups of patients with similar risk Clustering fx scores and clinical characteristics. The choice of clustering algorithms is independent of our model and should be decided by application needs. 4. Experiments 4.1. Study design Our study is on a real-world EHR data warehouse including the records of 319,650 patients over 4 years. We extracted a cohort of patients who are at risk of developing Congestive Heart Failure (CHF) and applied our model to predict the CHF onset risk of those patients based on their clinical observations.

We defined CHF diagnosis using the following criteria (which are similar to the criteria used in [9]): (1) ICD-9 diagnosis of heart failure appeared in the EHR for two outpatient encounters, indicating consistency in clinical assessment and (2) at least one medication were prescribed with an associated ICD-9 diagnosis of heart failure. The diagnosis date was defined as the first appearance in the EHR of the heart failure diagnosis. These criteria have also been previously validated as part of Geisinger Clinical involvement in a Centers for Medicare & Medicaid Services (CMS) pay-for-performance pilot [23]. With these criteria, we extracted from the database 1127 CHF case patients. Following the case-control match strategy in [9], a primary care patient was eligible as a control patient if they were not in the case list, and had the same PCP as the case patient. Approximately 10 eligible clinic-, sex-, and age-matched (in five-year age intervals) controls were selected for each heart failure case. In situations where 10 matches were not available, all available matches were selected. Following this strategy, we got 3850 control patients, so on average each case patient was matched with approximately three controls. Demographic statistics of the case patients and the control patients are shown in Fig. 2. For all patients we extracted their diagnosis codes in terms of ICD-9 and medications in terms of drug classes from the EHR database. We only considered the clinical observations that occurred from 540 days prior to the diagnosis date till T days prior to the diagnosis date. Here T is called the prediction window and in our experiments we varied it from 180 days to 0 days. Patients who had insufficient amount of records (fewer than 25 clinical observations) were excluded. For control patients, we set the last day of a patient’s available records as the diagnosis date and followed the same rule. In total we have 4261 distinct medical features. After removing rare features that appeared in fewer than 2% of the patients, we have 298 features left in our experiments. 4.2. Parameter settings r is the parameter that decides the dimension of the bilinear risk space. In our experiments we set r to 3 for visualization purpose. From Fig. 3 we can see that with a 3-dimensional risk space we can already achieve a relative approximation error lower than 20%, which is defined as kZ  UU T kF =kZkF . In fact, with only the x–y plane ðr ¼ 2Þ, the approximation error is merely 25%, which demonstrates the low-rankness of the risk space. If we keep the top-7 dimensions, the approximation error will be reduced to under 10%. We set the number of clusters to 5 for patient stratification. Larger cluster number can be used if we need more detailed clinical contexts. For instance, the green group can be further partitioned into smaller groups with the z-axis as the differentiating factor. Since the clustering step is independent of the model training step, we can make it interactive and let the users choose explore any cluster of interest freely. 4.3. Prediction accuracy We evaluated the prediction power of our model in terms of Area Under the receiver operating characteristic Curve (AUC) using 10-fold cross validation. We first randomly partitioned the patient cohort into 10 disjoint folds. For each trial we picked one fold as the test set and a portion of the remaining ninefolds as the training set. As a result, the training-to-test ratio varied from 1:1 to 9:1. We used two different feature sets, one was ICD-9 codes only and the other was ICD-9 + Medication. We also tried different prediction window lengths, namely 180, 90, and 0 days. Overall we had 54 different settings and for each setting we report the mean AUC and standard deviation over 10-fold cross validation.

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

(a) Age - Case

(b) Age - Control

(c) Sex - Case

(d) Sex - Control

151

Fig. 2. The age and sex distribution of the case patients and the control patients.

1 0.9 0.8

Relative Error

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

Number of Components Fig. 3. The relative approximation error vs. the rank of the risk space.

We first compare our model to two closely related baselines. The first one is Sparse Logistic Regression, which is logistic regression with ‘1 regularization on the regression coefficients. It has been widely used for risk stratification and demonstrated good accuracy for CHF onset prediction [9]. It represents the state-ofthe-art performance we can get using the traditional linear model. The second one is Principal Component Analysis (PCA) plus Logistic Regression. PCA is a popular dimension reduction technique. We

apply it to the patient cohort to project them onto a lower dimensional space. Each dimension will represent a group of risk factors, similar to the risk factor group we can derive from our model. Then we apply logistic regression in the lower-dimension space to build a risk model. From Fig. 4 we can observe that (1) As we had more features, or shorter prediction windows, or larger training-test ratios, the performance of all approaches consistently increased, which conforms to our intuition. (2) Our bilinear model outperformed the two baselines across all the different settings, which suggested that introducing feature co-occurrence information can indeed improve the prediction accuracy (we will show later what the most predictive co-occurrences are). (3) The advantage of our model against Sparse Logistic Regression was particularly significant when the training-test ratio was small, which is a desirable property in practice where training samples are scarce. This is not surprising because our model is generative whereas logistic regression is discriminative. Therefore our model can converge much faster to its asymptotic error [20]. It was also interesting to observe that LR performed better with PCA preprocessing when the training data is scarce. This is because by only preserving top principal components, PCA is able to remove noise from a small training set. However, as the training set becomes larger, the influence of noise becomes less significant and thus PCA becomes less beneficial. In fact, since it selects the principal components in an unsupervised fashion, it could remove some information that is useful and hurt the prediction performance (as indicated in Fig. 4).

152

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

0.86

* * *

0.86 Bilinear Regression Sparse Logistic Regression PCA + Logistic Regression

0.84 0.8

0.8

0.78

0.78

0.76

0.76

0.74 0.72

0.7 0.68

0.66

0.66

0.64

0.64 0.62

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Training Set Ratio

(a) Diagnosis, 180 days

(b) Diagnosis+Medication, 180 days

* * * *

0.86

Bilinear Regression Sparse Logistic Regression PCA + Logistic Regression

0.82

0.84 0.82 0.8 0.78

0.76

0.76

AUC

0.8 0.78 0.74 0.72

Bilinear Regression Sparse Logistic Regression PCA + Logistic Regression

0.72 0.7

0.68

0.68

0.66

0.66

0.64

0.64

0.62

* * * * *

0.74

0.7

0.62

0.1

0.86

0.2

Training Set Ratio

0.84

AUC

0.72

0.68

0.62

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Training Set Ratio

Training Set Ratio

(c) Diagnosis, 90 days

(d) Diagnosis+Medication, 90 days

* * * * *

0.86

0.84

0.84

0.82

0.82

0.8

0.8

0.78

0.78

0.76

0.76

AUC

AUC

0.74

0.7

0.86

Bilinear Regression Sparse Logistic Regression PCA + Logistic Regression

0.82

AUC

AUC

0.82

* * *

0.84

0.74 0.72

Bilinear Regression Sparse Logistic Regression PCA + Logistic Regression

0.7 0.68

* * * * *

0.74 0.72

Bilinear Regression Sparse Logistic Regression PCA + Logistic Regression

0.7 0.68

0.66

0.66

0.64

0.64

0.62

0.62

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Training Set Ratio

Training Set Ratio

(e) Diagnosis, 0 days

(f) Diagnosis+Medication, 0 days

Fig. 4. Mean AUC and standard deviation over 10-fold cross validation. The star sign indicates significant difference between our model and the traditional risk model (p < 0:05).

In Fig. 5 we present a more comprehensive comparison between our model and some state-of-the-art classification algorithms using average ROC over 10-fold cross validation. The training-test ratio was 3:1. In addition to Sparse Logistic Regression, we have Support Vector Machine (SVM) and Gradient Boosting (GB), two algorithms that showed good performance in clinical risk prediction in the literature [9]. We can see that our model outperformed SLR and SVM by a large margin. GB delivered the second best performance; however, it lacks the interpretability offered by the logistic regression model or our bilinear regression model.

4.4. Patient embedding and stratification In additional to risk score prediction, our model can also embed the patient cohort into a low-dimensional risk space and stratify them based on their clinical contexts. To demonstrate this, we applied our model to the entire CHF cohort, embedded the patients into a 3-D risk space (we chose r ¼ 3 for the sake of visualization), and stratified them into five groups. The prediction window was 180 days and the features were ICD-9 codes and drug classes. Note that here we summarized ICD-9 codes into ICD-9 group codes

153

1

1

0.9

0.9

0.8

0.8

True Positive Rate

True Positive Rate

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

0.7 0.6 0.5 0.4 0.3 0.2

0

0.2

0.6

0.8

0.4 0.3

0

1

1

0.8

0.8

0.7 0.6 0.5 0.4 0.3

0.6

0.8

0.1 0.2

0.4

0.6

0.8

0.6 0.5 0.4 0.3 Bilinear Regression (Mean AUC=0.768) Sparse Logistic Regression (Mean AUC=0.718) Gradient Boosting (Mean AUC=0.744) Suppor Vector Machine (Mean AUC=0.702)

0.1 0

1

0

0.2

0.4

0.6

0.8

False Positive Rate

False Positive Rate

(c) Diagnosis, 90 days

(d) Diagnosis+Medication, 90 days 1 0.9

0.8

0.8

True Positive Rate

1 0.9

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2

0.4

0.6

0.8

0.6 0.5 0.4 0.3 Bilinear Regression (Mean AUC=0.832) Sparse Logistic Regression (Mean AUC=0.788) Gradient Boosting (Mean AUC=0.815) Suppor Vector Machine (Mean AUC=0.743)

0.1 1

1

0.7

0.2

Bilinear Regression (Mean AUC=0.8) Sparse Logistic Regression (Mean AUC=0.77) Gradient Boosting (Mean AUC=0.795) Suppor Vector Machine (Mean AUC=0.715)

1

0.7

0.2

Bilinear Regression (Mean AUC=0.732) Sparse Logistic Regression (Mean AUC=0.693) Gradient Boosting (Mean AUC=0.724) Suppor Vector Machine (Mean AUC=0.657)

0

0.4

(b) Diagnosis+Medication, 180 days

0.9

0

0.2

(a) Diagnosis, 180 days 1

0

0

False Positive Rate

0.9

0

Bilinear Regression (Mean AUC=0.734) Sparse Logistic Regression (Mean AUC=0.703) Gradient Boosting (Mean AUC=0.707) Suppor Vector Machine (Mean AUC=0.686)

False Positive Rate

0.2

True Positive Rate

0.4

0.5

0.1

True Positive Rate

True Positive Rate

0

0.6

0.2

Bilinear Regression (Mean AUC=0.702) Sparse Logistic Regression (Mean AUC=0.677) Gradient Boosting (Mean AUC=0.692) Suppor Vector Machine (Mean AUC=0.635)

0.1

0.7

0

0

0.2

0.4

0.6

0.8

False Positive Rate

False Positive Rate

(e) Diagnosis, 0 days

(f) Diagnosis+Medication, 0 days

1

Fig. 5. Average ROC over 10-fold cross validation at 3:1 training-test ratio. The shaded area indicates standard deviation.

(the first three digits of ICD-9 codes) so that we can have a highlevel interpretation of the risk factors. The stratification result of our model is illustrated in Fig. 6. Recall that each point in the 3-D risk space represents a patient. The distance from that point to the origin represents the risk of that patient. Points are grouped together based on their positions in the risk space. Therefore, we can easily tell in Fig. 6 the green cluster represents a group of low-risk patients whereas the red cluster represents a group of high-risk patients. If two patients are in the same cluster, it means they are similar not only in terms of risk score but also clinical characteristics. If two patients are at the same distance to the origin but far away from each other in the risk space, e.g. a point from the yellow cluster and a point from

the blue cluster, it means they have similar risk scores but for different reasons. This is a typical case that should trigger personalized care plan design. Next we show that the stratification from our model indeed helped define meaningful and actionable clinical contexts. In Fig. 7 we summarize the clinical characteristics of patients based on the clusters identified in Fig. 6. We can observe that the predicted risk scores are consistent with the true outcome of the patients. For instance, the average predicted risk of the green group was 0.50 and 41% of the patients in this cluster developed CHF in 180 days; the average predicted risk of the red group was 0.73 and 74% of the patients in the cluster developed CHF in 180 days. When we investigated the top diagnosis codes and medications

154

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

Fig. 6. The optimal 3-D risk space learned by our model and the stratification ðK ¼ 5Þ of the patient cohort therein (best viewed in color). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

within each group in terms of prevalence, we noticed some distinct clinical contexts. For instance, patients from the blue group and the yellow group are similar in terms of risk scores, but the main contributing factors for the yellow group include Hypertension, Cardiac Dysrhythmias, and the use of Calcium Channel Blocking Agents and Beta Blockers2. As a comparison, the blue group is associated with different risk factors, such as Ill-Defined Heart Conditions (Cardiomegaly and other unspecified cardiovascular diseases that impair cardiac output and commonly result in severe CHF) and related symptoms (shortness of breath, chest pain, etc.); patients from the blue group were treated with Beta Blockers2 but not Calcium Channel Blocking Agents, which can be attributed to the absence of Cardiac Dysrhythmias. Similarly, when comparing

the two high risk groups, purple and red, we observed that their main difference in terms of diagnosis was the presence/absence of Ill-Defined Heart Conditions, while the main difference in terms of medication was the use of Calcium Channel Blocking Agents and Anticoagulants. Such distinction in both diagnosis and medication can provide key insights into personalized care plan management. Our model not only highlights the difference between groups but also the transition from one group to another. For instance, when we compare the yellow group to the red group, we can see a significant increase in the prevalence of Cardiac Dysrhythmias (from 54% to 98%) and the introduction of a new medication, Anticoagulants, which is used to prevent the formation of blood clots, particularly for atrial arrhythmias. This is another aspect of clinical contexts which shows how the change in clinical characteristics of patients is associated with the change in their risk levels. Each axis of the risk space in Fig. 6 corresponds to a group of highly correlated risk factors identified by our model. For each column of U, we computed the relative weight of each feature for that .P d group as U ij i¼1 U ij . Then in Fig. 8 we show the clinical features whose weights are higher than 0.05 for each group. The top features for the x-axis are Cardiac Dysrhythmias and two drugs that are commonly used to treat this condition. Therefore, placing a patient farther away from the origin along the x-axis implies his/ her top risk factors are Cardiac Dysrhythmias and the related medications. On the other hand, the top factors for the y-axis are Ill-Defined Heart Conditions and Chronic Ischemic Heart Disease. The difference between the two axes explains the different clinical conditions of the blue and the yellow groups. The top factors for the z-axis include some additional clinical features, such as Diabetes and Diuretics3. However, this axis is not a differentiating factor for our patient stratification in Fig. 6 as the five clusters are distributed along the x–y plane.

Fig. 7. The clinical characteristics of the patient groups identified in Fig. 6.

155

X. Wang et al. / Journal of Biomedical Informatics 53 (2015) 147–155

y-axis

x-axis

z-axis

13%

21%

29% 12%

38% 4%

48% 11%

8% 76%

6% 5% 5%

26%

Calcium Channel Blocking Agents Cardiac Dysrhythmias Anticoagulants Others

Chronic Ischemic Heart Disease Others

Cardiac Dysrhythmias Chronic Ischemic Heart Disease Anticoagulants Diabetes Beta-lactam Penicillins Diuretics3 Others

Fig. 8. Top clinical features for each of the axes of our risk space in Fig. 6.

5. Conclusions In this paper we propose a bilinear model for risk stratification. The advantage of our model is that it cannot only accurately predict the risk for individual patients but also capture the distinct clinical characteristics of different high-risk groups. Such advantage is achieved by approximating the feature-wise odds ratio matrix using a low-rank decomposition. We tested our model on a CHF cohort with 1127 case patients and 3850 control patients. We demonstrate that our model consistently outperforms the state-of-the-art risk prediction model in a variety of settings. More importantly our model is able to identify meaningful clinical contexts associated with the onset of CHF. These clinical insights are easy to interpret and instrumental to personalized care plan design. References [1] Miller CC, Reardon MJ, Safi HJ. Risk stratification: a practical guide for clinicians. Cambridge University Press; 2001. [2] Echouffo-Tcheugui JB, Kengne AP. Risk models to predict chronic kidney disease and its progression: a systematic review. PLoS Med 2012;9:e1001344. [3] Oliver D, Daly F, Martin FC, McMurdo ME. Risk factors and risk assessment tools for falls in hospital in-patients: a systematic review. Age Ageing 2004;33:122–30. [4] Fonarow GC, Adams Jr KF, Abraham WT, Yancy CW, Boscardin WJ, et al. Risk stratification for in-hospital mortality in acutely decompensated heart failure. JAMA 2005;293:572–80. [5] Roques F, Michel P, Goldstone A, Nashef S. The logistic euroscore. Eur Heart J 2003;24:881–2. [6] Wilson PW, DAgostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation 1998;97:1837–47. [7] Peng C-YJ, Lee KL, Ingersoll GM. An introduction to logistic regression analysis and reporting. J Educ Res 2002;96:3–14.

[8] Mahmood SS, Levy D, Vasan RS, Wang TJ. The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective. Lancet 2014;383:999–1008. [9] Wu J, Roy J, Stewart WF. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med Care 2010;48:S106–113. [10] Wang X, Wang F, Wang J, Qian B, Hu J. Exploring patient risk groups with incomplete knowledge. In: ICDM, 2013. p. 1223–8. [11] Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, Hansen T, et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol 2011;7:e1002141. [12] Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J 2010;36:531–9. [13] Dyrholm M, Christoforou C, Parra LC. Bilinear discriminant component analysis. JMLR 2007;8:1097–111. [14] Basu S, Dunagan J, Duh K, Muniswamy-Reddy K-K. Blr-d: applying bilinear logistic regression to factored diagnosis problems. SIGOPS Oper Syst Rev 2012;45:31–8. [15] Mika S, Ratsch G, Weston J, Scholkopf B, Muller K. Fisher discriminant analysis with kernels. In: IEEE signal processing society workshop, 1999. p. 41–8. [16] Sugiyama M. Local fisher discriminant analysis for supervised dimensionality reduction. In: ICML, 2006. p. 905–12. [17] Barshan E, Ghodsi A, Azimifar Z, Zolghadri Jahromi M. Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recogn 2011;44:1357–71. [18] Wang F, Sun J, Hu J, Ebadollahi S. iMet: interactive metric learning in healthcare applications. In: SDM, 2011. p. 944–55. [19] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Springer series in statistics. New York (NY, USA): Springer New York Inc.; 2001. [20] Ng AY, Jordan MI. On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: NIPS, 2001. p. 841–8. [21] Ding CH, He X, Simon HD. On the equivalence of nonnegative matrix factorization and spectral clustering. In: SDM, vol. 5; 2005. p. 606–10. [22] Janecek AGK, Grotthoff SS, Gansterer WN. libNMF – a library for nonnegative matrix factorization. Comput Inform 2011;30:205–24. [23] Pfisterer M, Buser P, Rickli H, Gutmann M, Erne P, Rickenbacher P, et al. Bnpguided vs symptom-guided heart failure therapy. JAMA: J Am Med Assoc 2009;301:383–92.

Towards actionable risk stratification: a bilinear approach.

Risk stratification is instrumental to modern clinical decision support systems. Comprehensive risk stratification should be able to provide the clini...
2MB Sizes 0 Downloads 4 Views