Stable feature selection for clinical prediction: exploiting ICD tree structure using Tree-Lasso.

YJBIN 2259

No. of Pages 14, Model 5G

12 December 2014 Journal of Biomedical Informatics xxx (2014) xxx–xxx 1

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin 5 6

Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso

3 4 7

Q1

8

Iman Kamkar ⇑, Sunil Kumar Gupta, Dinh Phung, Svetha Venkatesh Centre for Pattern Recognition and Data Analytics, Deakin University, Australia

10 9 11

a r t i c l e

1 2 3 5 14 15 16 17

i n f o

Article history: Received 7 July 2014 Accepted 28 November 2014 Available online xxxx

18 19 20 21 22 23 24

Keywords: Feature selection Lasso Tree-Lasso Feature stability Classification

a b s t r a c t Modern healthcare is getting reshaped by growing Electronic Medical Records (EMR). Recently, these records have been shown of great value towards building clinical prediction models. In EMR data, patients’ diseases and hospital interventions are captured through a set of diagnoses and procedures codes. These codes are usually represented in a tree form (e.g. ICD-10 tree) and the codes within a tree branch may be highly correlated. These codes can be used as features to build a prediction model and an appropriate feature selection can inform a clinician about important risk factors for a disease. Traditional feature selection methods (e.g. Information Gain, T-test, etc.) consider each variable independently and usually end up having a long feature list. Recently, Lasso and related l1 -penalty based feature selection methods have become popular due to their joint feature selection property. However, Lasso is known to have problems of selecting one feature of many correlated features randomly. This hinders the clinicians to arrive at a stable feature set, which is crucial for clinical decision making process. In this paper, we solve this problem by using a recently proposed Tree-Lasso model. Since, the stability behavior of Tree-Lasso is not well understood, we study the stability behavior of Tree-Lasso and compare it with other feature selection methods. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore can potentially assist clinical decision making for accurate medical prognosis. Ó 2014 Published by Elsevier Inc.

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

49 50

1. Introduction

51

Recent advances in information technology has changed the way health care is carried out and documented [37]. Nowadays, not only traditional clinical narrative but also other types of data related to healthcare such as laboratory test results, medications and radiological images are automatically captured by databases in modern health centers. Clinical data describing the phenotypes and treatment of patients represents an underused data source that has a great research potential. Mining of Electronic Medical Records (EMR) can yield useful patterns that support clinical research and decision making [21]. The EMR contains rich information about a patient, including demographics, history of hospital

52 53 54 55 56 57 58 59 60 61

⇑ Corresponding author.

Q1

E-mail addresses: [email protected] (I. Kamkar), [email protected]. au (S.K. Gupta), [email protected] (D. Phung), svetha.venkatesh@deakin. edu.au (S. Venkatesh).

visits, diagnoses, physiological measurements and interventions. As these data come with considerable amount of irrelevant and redundant features where only a subset of these features are useful for prediction, feature selection plays an important role to identify important features for building predictive models. For example, it is crucial to identify risk factors of cancer mortality for designing care plan and prognosis of a cancer patient. Similarly, it may be useful to find risk factors that are responsible for avoidable hospital re-admissions to reduce the cost of healthcare. Feature selection methods can be broadly classified into three categories: (1) Filter methods such as T-test, Information Gain [7], ReliefF [52], and Chi Square [31] that assess the relevance of features by looking only at the intrinsic properties of the data. These methods consider each feature separately and ignore dependencies between features. Hence, comparing to other types of feature selection methods, they may cause long feature lists. (2) Wrapper methods such as Beam search [46], Sequential forward selection (SFS) [25], and Sequential backward elimination (SBE)

http://dx.doi.org/10.1016/j.jbi.2014.11.013 1532-0464/Ó 2014 Published by Elsevier Inc.

Q1 Please cite this article in press as: Kamkar I et al. Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.11.013

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79

YJBIN 2259


12 December 2014 Q1 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

2

I. Kamkar et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

[25] that utilize a supervised learning algorithm in the process of selecting feature subsets. The downside of these techniques is their high cost of computation and the risk of an over-fitted model [23]. (3) Embedded methods such as Weighted naive Bayes [10] and Lasso [49] search for an optimal subset of features, which is built within the classifier construction, and can be seen as a search in the combined space of feature subsets (joint feature selection property) and hypotheses. In this context ‘1 -norm methods such as Lasso has received an increasing attention. Using ‘1 -norm penalty, Lasso regularizes linear models and achieves automatic feature selection by driving some coefficients toward zero. Beyond classifier performance, the other main objective of feature selection is to obtain a stable list of features [23,44]. The stability of feature selection methods has been used to examine their sensitivity to changes in input data and is defined as the degree of agreement of classification models produced by an algorithm when trained on different training sets. The stability is an important property in applications where the features carry intuitive meanings and actions are taken based on these features, e.g. risk factors for cancer survival or hospital readmission prediction must be stable to help a practitioner towards making clinical decisions. The need for stable feature sets has also been strongly felt in pattern recognition community [22,27]. Despite obtaining great success in many applications with high dimensional data [43,45,54], Lasso is known to be unstable when features exhibit strong correlations [56,59]. In these situations, Lasso shows acceptable predictive performance but selected predictors are quite sensitive to small changes in data and vary drastically. This variability is due to the property that among several correlated variables, Lasso penalty term (‘1 -norm) tends to select one of them randomly [49,60]. Different methods have been proposed to solve this problem. Elastic net is one of these remedies that reduce this randomness by adding a convex penalty term (squared ‘2 -norm) [60]. However, it is not capable of exploiting the correlation structure of the data. Other methods use sub-sampling techniques [2,35], where only those features are taken that are stable across several sub-samples. Group Lasso offers another solution when the features form different groups and the variables within a group are correlated. Feature selection is performed at group level by penalizing the sum of the ‘2 -norm of these groups, so that if a group is selected then all the features in that group are

selected [20,56]. In many problems, features can naturally be represented using certain tree structures e.g. ICD-10 codes used in healthcare data have an intrinsic tree structure, (Fig. 1 shows an example of ICD-10 codes for diseases of musculoskeletal system and connective tissue). For such problems, application of groupLasso is not straight forward. Addressing these problems, we propose to use Tree-Lasso algorithm – a technique which has been proposed as a prediction model for classifying images where its pixels are considered to be the features lying on a tree [32]. However, the stability behavior of Tree-Lasso is not well understood. In this paper we take up this problem and study the stability behavior of Tree-Lasso. To do so, we use two different stability measures Spearman’s rank correlation coefficient and Jaccard similarity measure. We compare its stability with stability of other feature selection methods such as T-test, Information Gain (IG), ReliefF, and Lasso using a synthetic and two real-world datasets: Cancer cohort and Acute Myocardial Infarction cohort. Furthermore, we evaluate predictive performance of Tree-Lasso with other feature selection methods using several classifiers namely, logistic regression, naive Bayes, SVM, decision trees and Random Forest. In summary, Our main contributions are:

121

Introducing novel application of Tree-Lasso algorithm to obtain stable feature sets for developing healthcare predictive models from ICD codes. An extensive experimental study that shows stability behavior of Tree-Lasso based feature selection is significantly better than Lasso and comparable with other feature selection algorithms. Assessing thepredictive performance of models with the corresponding feature sets using several classifiers, e.g. logistic regression, naive Bayes, SVM, decision trees and Random Forest and find that under the constraint of stable feature selection, Tree-Lasso prediction performance is always better than that of many feature selection algorithms, namely T-test, IG, ReliefF, and Lasso. Comparing the risk-factors obtained using Tree-Lasso with the list of features used by clinical experts and find that many risk factors found by Tree-Lasso are consistent with those used by domain experts.

143

Fig. 1. A sample of ICD-10 tree for diseases of musculoskeletal system and connective tissue.


122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142

144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160

YJBIN 2259


12 December 2014 Q1


3

161 162 163 164 165 166 167

Our findings have implications in identifying stable risk factors for many healthcare problems and therefore assist clinicians and patients to arrive at a better care plan and prognosis. Although, we have applied our model for healthcare data, it is applicable to other real-world problems where features are hierarchical in nature and stability of features is important.

168

2. Feature stability

169

The stability of a feature selection algorithm is the robustness of algorithm in selecting features in different training sets which are drawn from same distribution [29,33]. Different methods have been proposed to assess the stability of feature selection algorithms [22]. These methods can be categorized into three groups based on the representation of the selected features used by a specific feature selection algorithm. First group, known as stability by index, considers the indices of the selected features. In this category, the selected features have no particular order or corresponding relevance weight. In the second group, known as stability by weight degree of relevance of each feature is considered by a weight that is assigned to the feature. In the third group, which is called stability by rank, the features order is important in evaluation of stability. In this group, each feature is assigned by a rank that shows the feature importance. In order to measure similarity between subsets of features we use Jaccard index. Jaccard index is a metric that measures the similarity between two sets. Given two sets Sq and Sq0 , the Jaccard index J Sq ; Sq0 is defined as

170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187

188 190

191 192 193 194 195 196

197

T Sq Sq 0 JðSq ; Sq0 Þ ¼ S Sq Sq 0

To define a stability measure using Jaccard index, we generate Q sub-samples of the training data, indexed as q ¼ 1; . . . ; Q . For each sub-sample, we run feature selection model and obtain a feature set, denoted by Sq . Given feature sets S1 ; . . . ; SQ , the Jaccard stability measure (JSM) is defined as the average of Jaccard indices over each pair of feature sets, i.e. J Sq ; Sq0 . Formally, we have

JSM ¼ 199 200 201 202

203

205 206 207 208 209 210 211 212

213

Q1 X

Q X

2 J Sq ; Sq 0 : Q ðQ 1Þ q¼1 q0 ¼qþ1

ð2Þ

If we assume that a feature selection algorithm assigns a weight to each feature, to evaluate the similarity between two weight vectors b; b0 , we use Pearson’s correlation coefficient:

P

0 j ðbj lb Þðbj lb0 Þ : PCCðb; b0 Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2 2P 0 j ðbj lb Þ i ðbj lb0 Þ

ð3Þ

PCCðb; b0 Þ takes values in ½1; 1, where PCCðb; b0 Þ ¼ 1 implies that the feature weights are completely correlated, PCCðb; b0 Þ ¼ 0 implies that feature weights are uncorrelated and PCCðb; b0 Þ ¼ 1 implies that feature weights are anti correlated. Given weight vectors b1 ; . . . ; bQ , we define PCC as the average of Pearson’s correlation coefficient over each pair of weights of feature sets i.e. PCCðbq ; bq0 Þ as follows:

PCC ¼ 215

ð1Þ

Q1 X Q X 2 PCCðb; b0 Þ: Q ðQ 1Þ q¼1 q0 ¼qþ1

ð4Þ

219

We construct a ranking (denoted by r) over features by sorting the weight vector b. To measure rank based similarity between any two rankings r and r0 , we use Spearman’s rank correlation coefficient:

222

X ðrj r0j Þ SRCCðr; r 0 Þ ¼ 1 6 ; mðm2 1Þ j

216 217 218

220

ð5Þ

where rj and r 0j are the ranks of jth feature in rankings r and r0 and m is the size of the whole feature set. Similar to Pearson’s correlation, the possible range of values are ½1; 1, where 1 means that the two rankings are identical, 0 means that there is no correlation between two ranks, and a value of 1 means that rankings are in reverse order. Given rankings r 1 ; . . . ; rQ , we define SRCC as the average of Spearman’s rank correlation over each pair of ranks of feature sets, so we have: Q1 X

Q X

2 SRCC ¼ SRCCðr; r 0 Þ: Q ðQ 1Þ q¼1 q0 ¼qþ1

224 225 226 227 228 229 230 231

232

ð6Þ 234

Based on the fact that PCC works directly on the weight vectors that are obtained using each feature selection method, which may use different scales to assign weights and so its results may not be directly comparable across different methods. Therefore, in this paper we only use SRCC and JSM to assess stability of each feature selection method.

235

3. Methodologies

241

We study the stability behavior of Tree-Lasso and compare it with various feature selection methods, namely T-test, Information gain, ReliefF and Lasso. In the following, we briefly describe Tree-Lasso and the other feature selection methods. Furthermore, we evaluate the predictive performance of obtained features using each feature selection method by using different types of classifiers such as logistic regression (LR), naive Bayes (NB), support vector machines (SVM), decision trees (DT) and Random Forest (RF). In Section 3.1 we briefly introduce feature selection methods used in this paper and in Section 3.2 we introduce classifiers used for evaluating predictive performance of each feature selection method.

242

3.1. Feature selection methods

254

3.1.1. T-test In large datasets in order to determine which input features are more important, feature ranking is often used. One of the most important feature ranking measures is T-test. It calculates a ratio between the difference of two class means and the variability of the two classes. Using this ratio we can assess whether the means of two classes are statistically different from each other. In a binary classification problem, for T-test we compute the following test statistic for each feature f j

255

f j0 f j1 tðf j Þ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; s2j0

s2

236 237 238 239 240

243 244 245 246 247 248 249 250 251 252 253

256 257 258 259 260 261 262 263

264

ð7Þ

þ Nj11 N0

266

where f j0 and fj1 are the feature means for class 0 and class 1, sj0 and sj1 are the standard deviation of feature f j from class 0 and class 1 and N 0 and N1 are size of class 0 and class 1, respectively. In this method after calculating t for each feature, best features (those who have p-value 0.05) are selected as final feature set.

267

3.1.2. Information Gain Information Gain (IG) [7] is one of the most important feature ranking methods, which measures dependency between a feature and a class label. IG of jth feature f j and class y is calculated as

272

IGðyjf j Þ ¼ HðyÞ Hðyjf j Þ;

ð8Þ

where HðÞ is the entropy and is a measure of the uncertainty of a random variable. If we assume that we have a two-class classification problem, HðyÞ and Hðyjf j Þ are defined as follows:


223

268 269 270 271

273 274 275

276 278 279 280 281

282

YJBIN 2259


12 December 2014 Q1

4


HðyÞ ¼ ½Pðy ¼ 0ÞlogPðy ¼ 0Þ þ Pðy ¼ 1ÞlogPðy ¼ 1Þ

ð9Þ

284

Hðyjf j Þ ¼ Pðy ¼ 0jf j Þlog Pðy ¼ 0jf j Þ þ Pðy ¼ 1jf j Þlog Pðy ¼ 1jf j Þ:

285

In this method, for each feature we evaluate IG independently and top K features are selected as the final feature set.

286

ð10Þ

293

3.1.3. ReliefF Relief [24] is a supervised feature selection algorithm for binary classification problems. It randomly samples instances from the training data and for each sample computes the nearest instance of the same class called ‘‘near-hit’’ and the nearest instance of the different class called ‘‘near-miss’’. The score SðjÞ of the jth feature is updated in each iteration of algorithm as follows:

296

St ðjÞ ¼ St1 ðjÞ

287 288 289 290 291 292

294

297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312

313 315 316 317 318 319

320

321 323 324 325 326 327 328 329 330

331 332 333 334 335 336 337 338 339 340 341 342

dðxt nearHit t Þ dðxt nearMisst Þ þ ; n n

ð11Þ

where xt is the random instance at iteration t; n is the number of randomly sampled examples, and dðÞ is the Euclidean distance measure. Kononeko et al. [26] proposed ReliefF by using Manhattan (l1 ) norm instead of Euclidean (l2 ) norm for finding near-hit and near-miss. For selecting final feature set using ReliefF, we compute S score for each feature and select top K features with best S score as the final selected features. 3.1.4. Lasso Lasso is a regularization method that is used to learn a regularized regression/classification model that is sparse in the feature space [49]. Consider a supervised learning problem consisting of N training instances denoted as ðxðiÞ ; yðiÞ ; i ¼ 1; . . . N , where each P xðiÞ 2 R is a P-dimensional feature vector and yðiÞ 2 f0; 1g is a class label. For classification problems, Lasso is used with logistic regression [19], which models the probability distribution of the class label yðiÞ given a feature vector xðiÞ as

pðyðiÞ ¼ 1jxðiÞ ; bÞ ¼ rðbT xðiÞ Þ ¼

1 1 þ expðbT xðiÞ Þ

;

ð12Þ

P

where b 2 R is a parameter of the logistic regression model and rðÞ is the sigmoid function. The parameter b is also known as classification weight vector. The Lasso regularization acts by penalizing the sum of absolute value of weights, i.e. ‘1 -norm of b, denoted as jjbjj1 . The combined optimization function can be written as

n o min RNi¼1 log p yðiÞ ¼ 1jxðiÞ ; b þ k Rpj¼1 bj b

ð13Þ

where k is a non-negative regularization term. The solution of the above optimization does not have a closed form and is usually found iteratively by minimizing the cost function using pathwise co-ordinate optimization [13]. By increasing the regularization parameter k, Lasso increasingly shrinks the coefficients toward 0. A large enough k, makes some of the weights to become exactly zero. 3.1.5. Hierarchical features and Tree-Lasso In many applications, the features can be naturally represented as a tree structure, e.g. ICD-10 features in healthcare data form a tree. ICD-10 is ‘‘standard diagnostic tool for epidemiology, health management and clinical purposes’’.1 Fig. 2 shows a part of ICD-10 tree relevant to Cancer dataset used in this paper. The set of diseases shown here relates to the musculoskeletal system (ICD-10 codes: M00 up to M99). According to ICD-10 hierarchy, these diseases are classified into 6 groups and each of these groups are further classified into several subgroups, giving rise to a treestructure. We note that the grouping of codes is mostly based on disease similarity and co-occurrences causing correlations among features. Due to using a flat l1 -penalty on features, Lasso randomly 1

http://www.who.int/classifications/icd/en/.

selects only one feature from every such correlated set. Although Lasso mechanism for feature selection results in selecting less features, it causes that this method to be unstable in selecting important features. This drawback of Lasso is undesirable in many real-world applications such as clinical prediction. For classification and regression problems having hierarchical features, a more suitable model is the Tree-Lasso [32] as it can exploit the feature correlations in the form of a tree-structure. In this context, the definition of a tree is as follows. For a tree T of depth d, all the nodes corresponding to depth i are in T i ¼ fGi1 ; Gi2 ; . . . ; Gini g, where Gij denotes the jth node at depth i; n0 ¼ 1; G01 ¼ f1; 2; . . . ; pg and ni P 1; i ¼ 1; 2; . . . ; d. The nodes must satisfy the following two conditions:

343

1. The nodes at the same depth should not have overlapping indices. 2. The index set of a child node is a subset of its parent node.

356

345 346 347 348 349 350 351 352 353 354 355

357 358 359

Given the above definition of feature tree, Tree-Lasso learns the classification weight vector b by minimizing the following cost function

n o min RNi¼1 log p yðiÞ ¼ 1jxðiÞ ; b þ k/ðbÞ b

ð14Þ

where k is a non-negative regularization parameter. The regularization term /ðbÞ is given by

ni /ðbÞ ¼ Rdi¼1 Rj¼1 wij bGi

ð15Þ

j

360 361 362

363 365 366 367

368 370

where b 2 RP is the weight for node Gij , and bGi is a vector composed j of the entries of b with the indices in Gij . The other parameter in the i regularization term is wj ði ¼ 0; 1; . . . ; d; j ¼ 1; 2; . . . ; ni Þ, which is a predefined weight for the node Gij . As mentioned in [32], this parameter can be set according to importance of feature groups. In our application, since we do not have any prior knowledge about importance of feature groups, we use xij ¼ 1 for all the groups. To solve the problem efficiently, the term /ðbÞ is reformulated through Moreau–Yosida regularization as /k ðv Þ ¼

o n

minb 12 kb v k2 þ kRi Rj wij bGi for some k > 0. It has been shown

371

that the above problem admits an analytical solution. For details of the minimization, we refer the reader to [32].

381

3.2. Classification methods

383

3.2.1. Logistic regression Logistic regression is a linear classifier that models the posterior probabilities of the K classes via linear function in an example x. In logistic regression, the parameters of the model can be interpreted as changes in log odds and also the results can be interpreted in terms of probabilities [19]. Hence, logistic regression is a widely used classifier in medical domain [1,18,47]. Lasso and Tree-Lasso has built-in logistic regression algorithms. Therefore, these methods can perform feature selection and prediction, simultaneously.

384

3.2.2. Naive Bayes Naive Bayes (NB) is a probabilistic classifier based on Bayes theorem [48]. It assumes that given a class label, all the features are independent and posterior probability that an example x 2 RP is classified to class c is obtained as

393

j

P Y PrðC ¼ cjxÞ / PrðC ¼ cÞ Prðxj jC ¼ cÞ:

372 373 374 375 376 377 378 379 380 382

385 386 387 388 389 390 391 392

394 395 396 397

398

ð16Þ

j¼1

Despite its unrealistic independence assumption, research shows that naive Bayes often works well in practice [42]. furthermore, due to its independence assumption, in naive Bayes the


344

400 401 402 403

YJBIN 2259


12 December 2014 Q1


5

Fig. 2. Part of the ICD-10 tree constructed for the Cancer dataset used in this paper.

404 405 406

407 408 409 410 411 412 413 414 415 416 417 418

419 420 421 422 423 424 425 426 427 428 429 430 431 432

433 434 435 436 437 438 439 440 441 442 443

number of parameters (which is equal to the number of features) do not depend on the number of examples. This property helps naive Bayes to scale well for large problems. 3.2.3. Support vector machines Support vector machines (SVM) are a type of classifiers that work based on the principle of structural risk minimization (SRM) [6,50,51]. They have some advantages such as ability to handle large feature spaces, and avoidance of overfitting [55]. SVM uses inner product to measure the similarity or distance between patterns, which is known as kernel function. In our experiments, we use Gaussian RBF kernels, where the kernel width r is of values f0:001; 0:005; 0:01; 0:05; 0:1; 0:5; 1; 2g and the value of box constraint C is varied between 109 to 105 by factors of ten. The best parameters of r and C are obtained using 5-fold cross validation. 3.2.4. Decision trees Decision trees (DT) are well-known classification methods in the field of machine learning. Popular decision tree algorithms include ID3, C4.5, C5, and CART [4,38,39]. Decision trees recursively partition the data based on its features to construct a tree for the purpose of improving prediction accuracy. To achieve this, they use mathematical algorithms such as information gain (used in ID3, C4.5, C5), Gini index (used in CART), and Chi-squared test (used in CHAID) to specify the variable and its threshold that splits the data into two or more subgroups. The splitting of the data is repeated until the complete tree is constructed. Based on the favorable predictive performance, obtained from preliminary runs, in this study we chose CART as our decision tree method. 3.2.5. Random Forest Random Forest (RF) is an ensemble classifier that generates multiple decision trees and aggregates their results [3]. Each tree is trained on a bootstrap sample of the training data. In addition, a subset of features is randomly selected to consider at each node of each decision tree. To classify an example, decisions (votes) of all trees in the forest are aggregated and the majority voting of the trees is considered as the output of the classifier. In this paper we grow 100 trees in the forest and the number of features at each split is chosen as the square root of the number of features.

4. Experiments

444

In our experiments, we have used both synthetic and real-world datasets and demonstrate the effectiveness of the proposed TreeLasso by carrying out the following comparisons.

445

We show that stability behavior of Tree-Lasso is better compared to severalbaseline feature selection algorithms namely, T-test, IG, ReliefF and Lasso. We compare the predictive performance of Tree-Lasso with other baseline feature selection algorithms by using them with different classifiers namely, logistic regression, naive Bayes, SVM, decision tree and Random Forest and show that under theconstraint of stable feature selection, Tree-Lasso prediction performance is constantly better than that of other baselines. As an extra evaluation, we compare stability and predictive performance of Tree-Lasso (with built-in logistic regression) with Random Forest, which is a well-known embedded type feature selection method in machine learning and show that Tree-Lasso achieves better results in terms of both stability and prediction. We show that the features obtained using Tree-Lasso for realworld datasets are consistent with the well-known risk factors used by experts in clinical domain.

448

447

449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465

4.1. Datasets

466

4.1.1. Synthetic datasets To illustrate the stability behavior of different feature selection algorithms, we generate a synthetic data where features are grouped hierarchically in a tree structure. To keep the matter simple, we confine ourselves to shallow 2-level trees. In order to generate data, we fix the number of leaf nodes, referred to as groups. Each such group (or node) contains a set of variables such that the variables within a group are correlated to one another while uncorrelated with the variables from other groups. This is done by defining a correlation matrix C such that its ði; jÞ-th element contains the correlation coefficient between i-th and j-th variables. Formally, the correlation matrix is defined as

467

C i;j ¼

q i; j belong to the same group 0

otherwise

ð17Þ

In order to generate upper layers of the tree, correlation between each pair of groups from the lower layer is calculated. For each group, we find the other group having the highest correlation with


446

468 469 470 471 472 473 474 475 476 477 478

479 481 482 483 484

YJBIN 2259


12 December 2014 Q1 485 486 487 488 489

6


it and connect them to construct the upper layer of the tree. Given the above correlation matrix, the feature vector xðiÞ is generated using a multivariate normal distribution having mean zero and covariance C, i.e. xðiÞ N ð0; CÞ; i ¼ 1; . . . N. The true parameter vector b is

490 492 493 494

50times

Given i-th data vector xðiÞ and the weight vector b, the label is generated as following

¼ sign bT xðiÞ þ ;

yðiÞ

498

For the results reported in this paper, we simulate 100 variables, grouped into 4 leaf nodes, i.e. the first 25 variables are part of group-1, the next 25 variables are part of group-2 and so on. Using these features, we generate 200 data samples. We generate two such datasets: one with low correlation (q ¼ 0) and the other with high correlation (q ¼ 0:8).

501 502 503

504 505 506 507 508

509 510 511 512 513 514 515 516 517 518

519 520 521 522 523 524 525

N ð0; 0:1Þ:

ð18Þ

4.1.2. Real-world datasets For experiments with real-world data, we used two hospital patient cohorts: Cancer and Acute Myocardial Infarction (AMI). A summary of statistics of the two datasets is provided in Table 1 and their details are described below: 4.1.2.1. Cancer dataset. This dataset is obtained from a large regional hospital in Australia.2 There are eleven different cancer types in this data recorded from patients visiting the hospital during 2010– 2012. Patient data is acquired from Electronic Medical Records (EMR). The dataset consists of 4293 patients and their disease condition is described using 439 ICD codes (or features). Using this dataset, our goal is to predict 1 year mortality of patients while ensuring the stable feature sets. We note that feature stability is crucial for clinical decision making towards cancer prognosis. This dataset has been previously used in [16]. 4.1.2.2. Acute Myocardial Infarction (AMI) Dataset. This dataset is also obtained from the same hospital in Australia. The dataset involves patients admitted with AMI conditions and discharged later between 2007–2011. The task is to predict if a patient will be re-admitted to the hospital within 30 days. The dataset consists of 2941 patients and their disease condition is described using 528 ICD codes. This dataset has been previously used in [41].

526

4.2. Evaluation measures

527

4.2.1. Stability measures In order to compare the stability of Tree-Lasso with other feature selection algorithms, we use two different stability measures, Spearman’s rank correlation coefficient (SRCC), and Jaccard similarity measure (JSM), These stability measures are described in Section 2.

528 529 530 531 532

533 534 535 536 537 538 539 540

# Samples

# ICD 10 Codes

Cancer AMI

4293 2941

439 528

50times

497

500

Name

b ¼ ð0; 0; . . . ; 0; 1; 1; . . . 1Þ: |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl}

495

499

Table 1 Statistics of the real-world datasets.

4.2.2. Classification performance measure To compare the prediction performance of Tree-Lasso with other feature selection methods, we use the area under the receiver operating characteristic (ROC) curve, further abbreviated as AUC [12]. Due to its robustness across both balanced and imbalanced datasets, AUC is commonly used in clinical decision making and is becoming increasingly popular in pattern recognition community [8,53]. 2

Ethics approval obtained through university and the hospital 12/83.

4.3. Experimental settings

541

The Tree-Lasso model built in our experiments is based on ICD-10 codes that have intrinsic hierarchical structure. For example as it can be seen from Fig. 2 of the paper, ICD-10 codes of musculoskeletal system is in range of M00 up to M99. According to ICD hierarchy, the next level of ICD tree further classifies these diseases (codes) into 6 groups ðM00—M25; M30—M36; M40—M54; M60—M79; M80—M94 and M95—M99Þ. These groups are also further classified into several subgroups. Essentially, at the leaf node, we have individual features that are grouped progressively in parent nodes as we move up in the ICD tree. So, feature encoding is such that if a patient record contains all the offspring of some parent node, we also set the parent node to 1 along with setting all offspring nodes to 1. To assess the variability of the experiment, we randomly divide our data into two sets: 70% of our data is considered as training set and 30% as test set. For each random split, we further split the training set into two sets: derivation set (80% of the training set) and a validation set (20% of the training set). In order to be able to select the best features, this second split is randomly repeated 100 times to generate 100 sets of derivation-validation pairs. We train all the models using each derivation set while selecting the best model parameters through model performance on the corresponding validation set. This process provides us 100 feature sets. Using the ensemble of 100 feature sets, we empirically estimate the probability of presence for each feature. Given these probability estimates, we re-train a model using derivation dataset and including only those features that occur with at least probability p (a threshold that we gradually increase). Using 30% held out test set, the predictive performance of the model is evaluated using AUC while the stability of the model is computed using SRCC, and JSM.

542

4.4. Experimental results

573

In this section, we evaluate the stability performance of Tree-Lasso and compare it with other baseline feature selection methods. We also investigate the classification performance of each algorithm with different classifiers and measure their performance in terms of AUC. In addition, we study the consistency of the features obtained using Tree-Lasso for Cancer and AMI datasets with well-known risk factors in clinical domain.

574

4.4.1. Stability performance of feature selection methods Table 2 shows the stability results in terms of SRCC and JSM, for Tree-Lasso and baseline algorithms. In case of synthetic data, TreeLasso achieves the best stability performance in terms of both SRCC and JSM. The high value of SRCC in Tree-Lasso means that for different training sets the ranks of features does not vary a lot. On the other hand, the high value of JSM means that the feature set selected does not change significantly. In terms of JSM TreeLasso achieves the stability of 0.7830 (when q ¼ 0Þ and 0.8777 (when q ¼ 0:8) for synthetic data which is higher compared to the other methods. In case of Cancer dataset, Tree-Lasso again shows the best stability performance, in terms of SRCC. The SRCC results for T-test, IG and Lasso is poor, while that of ReliefF is somewhat average.

581


543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572

575 576 577 578 579 580

582 583 584 585 586 587 588 589 590 591 592 593 594

YJBIN 2259


12 December 2014 Q1

Q2

600 601 602 603 604 605 606

AMI dataset

SRCC JSM

0.2853 0.5088

0.4878 0.5656

0.1204 0.4553

0.4537 0.5258

IG

SRCC JSM

0.2528 0.6621

0.5030 0.6453

0.2164 0.7863

0.5150 0.6378

ReliefF

SRCC JSM

0.5177 0.5184

0.5373 0.5008

0.5362 0.7576

0.5275 0.6635

Lasso

SRCC JSM

0.1278 0.5773

0.4374 0.4080

0.2330 0.5542

0.4738 0.5028

Tree-Lasso

SRCC JSM

0.6670 0.7830

0.6282 0.8777

0.7458 0.8850

0.6274 0.7147

When we turn to JSM Tree-Lasso is again the winner (0.7910) followed by IG (0.7863). The other methods achieve JSM value of 0.7576 (ReliefF), 0.5542 (Lasso) and 0.4553 (T-test). For the AMI dataset, Tree-Lasso is once again the winner with SRCC = 0.6274 and JSM = 0.7147, followed by ReliefF and IG. The SRCC and JSM for T-test and Lasso are significantly lower. In order to have a better understanding of the stability behavior of different feature selection methods for the datasets that containcorrelated variables i.e. synthetic data (q ¼ 0:8), Cancer data and AMI data, we show top ten features selected by various methods in different splits of data in Figs. 3–5, respectively. From these figures not only we can visually compare the stability of different

methods but also can infer which features are considered important by each algorithm. To better distinguish between stable features in these plots we use a threshold T and features that are selected with a probability more than T are shown in black color while others are in gray color. So more stable feature selection methods will have more number of black lines and less number of gray points. In our experiments, we set T ¼ 0:5. In Fig. 3 (results for synthetic data, q ¼ 0:8) features within a group are highly correlated to one another and as expected Lasso shows an unstable behavior in selecting features. On the other hand, Tree-Lasso is the most stable algorithm. Moreover, it could correctly infer the true features of the model. In this dataset,

T-Test

100 80

80

60 40 20 0

IG

100

Splits

599

Cancer dataset

60 40 20

0

20

40

60

80

0

100

0

20

40

Features ReliefF

100

80

100

80

100

Lasso

100

80

80

60 40 20 0

60

Features

Splits

598

Synthetic data (q ¼ 0:8)

60 40 20

0

20

40

60

80

0

100

0

20

40

Features

60

Features Tree-Lasso

100 80

Splits

597

Synthetic data (q ¼ 0Þ T-test

Splits

596

Table 2 Comparison of Tree-Lasso with baselines in terms of different stability measures for both synthetic and real-world datasets.

Splits

595

7


60 40 20 0

0

20

40

60

80

100

Features Fig. 3. Stability results for synthetic dataset ðq ¼ 0:8Þ for 10 selected features. In these plots we use a threshold T ¼ 0:5 and features that are selected with a probability more than T are shown in black color while others in gray color.


607 608 609 610 611 612 613 614 615 616 617 618

YJBIN 2259


12 December 2014 Q1

8


T-Test

100

80

Splits

Splits

80 60 40 20 0

IG

100

60 40 20

0

100

200

300

0

400

0

100

200

Features ReliefF

100

300

400

80

Splits

Splits

400

Lasso

100

80 60 40 20 0

300

Features

60 40 20

0

100

200

300

0

400

0

100

200

Features

Features Tree-Lasso

100

Splits

80 60 40 20 0

0

100

200

300

400

Features Fig. 4. Stability results for Cancer dataset for 10 selected features. In these plots we use a threshold T ¼ 0:5 and features that are selected with a probability more than T are shown in black color while others in gray color.

619 620 621 622 623 624 625 626 627 628 629 630

631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646

ReliefF, IG and T-test are in the next stages of stability after TreeLasso. However, they are unable to select true features of the model. Figs. 4 and 5 show the stability behavior of each feature selection method on Cancer and AMI datasets. As it is illustrated in Fig. 4 for Cancer dataset, again Tree-Lasso is the winner followed by IG and ReliefF. T-test and Lasso show the least stable behavior in selecting features in this dataset. The visual inspection of this figure also shows that the features selected by Tree-Lasso, IG and ReliefF are approximately similar. For AMI dataset (Fig. 5), Tree-Lasso shows the best stability followed by ReliefF and IG. Again, Lasso achieves the least stability followed by T-test. 4.4.2. Classification performance In order to compare discrimination performance of Tree-Lasso with other baseline feature selection methods, we apply the features obtained using each feature selection algorithm to different classifiers e.g. logistic regressin (LR), naive Bayes (NB), SVM, decision trees (DT) and Random Forest (RF). As we explained in Section 4.3, after estimating the probability of presence for each feature, we re-train the model using derivation set and include only those features that occur with at least probability p (a threshold that we gradually increase). In our experiments we consider features with p ¼ 0:6 to 1 with a step of 0.1. This is done to show the prediction performance under average-to-high stability constraints. We evaluate the predictive performance of each method by using 30% held out set and report it in terms of AUC. The classification performance of various algorithms is shown in Figs. 6–9. As it can be seen from these figures, irrespective of

the classifier type used, AUC of Tree-Lasso is always the best and in most of the cases followed by Lasso. In terms of classifier used for each feature selection method, we can see that on average the best predictive performance is obtained using Random Forest followed by SVM and logistic regression. In addition, when we increase the stability threshold from 0.6 to 1, the AUC performance of Tree-Lasso remains stable. However, the performance of other algorithms varies a lot and that of Lasso drops suddenly. The sudden drop in performance of Lasso is due to underfitting caused by its inability to select sufficient number of stable features.

647

4.4.3. Comparison with Random Forest (as an embedded feature selection method) Random Forest is a promising classification method that can perform feature selection and prediction, simultaneously. In order to study and compare the stability and predictive performance of Tree-Lasso with Random Forest, we grow 100 trees in the forest and the number of features at each split is chosen as the square root of the number of features. The experimental settings is identical to the settings described in Section 4.3. In order to compare the stability of Tree-Lasso with Random Forest in selecting features, we use SRCC and JSM. These metrics are explained in details in Section 2. Table 3 shows the stability results for TreeLasso and Random Forest. As it can be seen from this table, for all datasets (synthetic and real) the stability of Tree-Lasso is better than Random Forest, in terms of SRCC and JSM. The other comparison between Tree-Lasso and Random Forest is based on their predictive performance. To this end, obtaining the probability of presence of each feature as described in Section 4.3,

657


648 649 650 651 652 653 654 655 656

658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674

YJBIN 2259


12 December 2014 Q1

9


T-Test

100

80

Splits

Splits

80 60 40 20 0

IG

100

60 40 20

0

100

200

300

400

500

0

600

0

100

200

Features ReliefF

100

500

600

500

600

80

Splits

Splits

400

Lasso

100

80 60 40 20 0

300

Features

60 40 20

0

100

200

300

400

500

0

600

0

100

200

Features

300

400

Features Tree-Lasso

100

Splits

80 60 40 20 0

0

100

200

300

400

500

600

Features Fig. 5. Stability results for AMI dataset for 10 selected features. In these plots we use a threshold T ¼ 0:5 and features that are selected with a probability more than T are shown in black color while others are in gray color.

675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693

694 695 696 697 698 699 700 701 702

the model is trained again using derivation set by including only those features that occur with at least probability p (a threshold that we gradually increase). In our experiments we consider features with p ¼ 0:6 to 1 with a step of 0.1. Using 30% held out test set, the predictive performance of the model is reported in terms of AUC. Figs. 10 and 11 show the predictive performance of Tree-Lasso compared to Random Forest. As it can be seen from these figures, the predictive performance of Random Forest and Tree-Lasso are approximately the same when features with average stability are used (with p ¼ 0:6 and 0.7). However, when the stability threshold is increased, the AUC of Random Forest declines steadily while that of Tree-Lasso remains stable. In synthetic data with q ¼ 0, where the average correlation between groups of variables is around zero, reduction of AUC performance in Random Forest is less compared to other datasets (with correlated groups of variables). This shows that although Random Forest is a good classifier and shows good performance in many applications, its performance degrades in presence of correlated features. On the other hand, Tree-Lasso shows acceptable predictive performance in presence of correlated variables. 4.4.4. Comparison with Expanded-Lasso One way to deal with instability of Lasso in selecting informative features can be through expanding the selected feature set by including the features that are unselected but correlated to one or more features in the selected set. We refer to this heuristic-based method asExpanded-Lasso and compare its feature selection stability and predictive performance with those of Lasso and Tree-Lasso. Using our synthetic and real-world datasets (same as used above), we split data into training and test sets. The model

is trained on the training set and evaluated on the test set. In order to specify the tuning parameters of each method, we use 5-fold cross validation. In order to compare the feature selection stability of Expanded-Lasso with Lasso and Tree-Lasso, as before, we use two stability measures: JSM and SRCC. The explanation of these metrics can be found in Section 2. Based on the fact that for Expanded-Lasso method we need to specify the level of correlation between selected and unselected features, in our experiments we use three different thresholds, i.e. 0.7, 0.8, and 0.9. Tables 4–7 compare Expanded-Lasso with Tree-Lasso and Lasso in terms of feature stability and predictive performance. As seen from the tables, in terms of both JSM and SRCC Tree-Lasso is the winner, followed by Expanded-Lasso. Turning to predictive performance, again Tree-Lasso achieves the best AUC and ExpandedLasso is the runner-up. The reason behind better performance of Tree-Lasso compared to Expanded-Lasso is because of its ability to use intrinsic hierarchical information (correlation) between ICD-10 features, whereas Expanded-Lasso needs to estimate this information. However, in problems where no information about hierarchical structure of the features is available, Expanded-Lasso can be used as a remedy to increase the stability of Lasso in selecting informative features.

703

4.4.5. Risk factors obtained using Tree-Lasso Identifying stable features (risk factors) can assist clinical decision making towards accurate medical prognosis. In Tables 8 and 9, we show that risk factors selected using Tree-Lasso (with high probability) for both Cancer and AMI datasets are consistent with well-known risk factors used in clinical domain

725


704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724

726 727 728 729 730

YJBIN 2259


12 December 2014 Q1

10


LR

NB

0.8

AUC

AUC

0.7 0.7 0.6 0.5 0.6

0.7

0.8

0.9

0.5 0.6

1

0.8

0.9

probability of feature presence

SVM

DT

1

0.7

0.7

AUC

AUC

0.7


0.8

0.6 0.5 0.6

0.6

0.7

0.8

0.9

0.6 0.5 0.6

1


0.7

0.8

0.9

1

probability of feature presence RF TTest IG ReliefF Lasso TreeLasso

AUC

0.8 0.7 0.6 0.6

0.7

0.8

0.9

1

probability of feature presence Fig. 6. Predictive performance of different classification methods coupled with each feature selection method for synthetic data (q ¼ 0).

LR

NB

0.8

AUC

AUC

0.7 0.7 0.6

0.65 0.6 0.55

0.5 0.6

0.7

0.8

0.9

0.5 0.6

1


0.7

SVM

1

DT

0.7

AUC

AUC

0.9

0.8

0.8

0.6 0.5 0.6

0.8


0.7

0.8

0.9

1

0.7 0.6 0.5 0.6


0.7

0.8

0.9

1

probability of feature presence RF

AUC

0.8

TTest IG ReliefF Lasso TreeLasso

0.7 0.6 0.5 0.6

0.7

0.8

0.9

1

probability of feature presence Fig. 7. Predictive performance of different classification methods coupled with each feature selection method for synthetic data (q ¼ 0:8).


YJBIN 2259


12 December 2014 Q1

11


LR

NB

0.8 0.7

AUC

AUC

0.7 0.6 0.5 0.6

0.7

0.8

0.9

0.6 0.5

1

0.6

0.7


AUC

0.9

1

DT

SVM 0.8

0.8

0.7

AUC

0.7 0.6 0.6

0.8


0.7

0.8

0.9

0.6 0.5

1

0.6

0.7


0.8

0.9

1


AUC

0.8 0.7 0.6 0.6

0.7

0.8

0.9

1

probability of feature presence Fig. 8. Predictive performance of different classification methods coupled with each feature selection method for Cancer data.

LR

NB 0.6

0.6

AUC

AUC

0.65

0.55 0.5 0.6

0.7

0.8

0.9

0.55 0.5

1

0.6

probability of feature presence SVM

0.6

0.5 0.6

0.7

0.8

0.8

0.9

1

DT

0.7

AUC

AUC

0.7

0.7


0.9

0.6

0.5 0.6

1


0.7

0.8

0.9

1


AUC

0.7 0.6 0.5 0.6

0.7

0.8

0.9

1

probability of feature presence Fig. 9. Predictive performance of different classification methods coupled with each feature selection method for AMI data.

731 732 733 734 735

[5,9,11,28,30,36,40,41,58]. Columns 1 and 2 of the tables show the risk factors ICD-10 code and names, respectively and column 3 shows the probability of presence of the risk factor in each split of data. For example, acute respiratory failure (J96.0) in Cancer dataset with probability equal to one means that this important

risk factor (based on clinical research papers) is also considered important by Tree-Lasso and is selected in every splits of the data. We also study why it matters that selected features be stable when the prediction accuracy is good. To this end, we investigate importance of stability in two ways:


736 737 738 739 740

YJBIN 2259


12 December 2014 Q1

12


Table 3 Comparison of Tree-Lasso with Random Forest in terms of different stability measures for both synthetic and real-world datasets. Synthetic data (q ¼ 0)

Synthetic data (q ¼ 0:8)

Cancer dataset

AMI dataset

Random Forest

SRCC JSM

0.4972 0.6535

0.2635 0.4024

0.2273 0.5624

0.3175 0.5276

Tree-Lasso

SRCC JSM

0.6670 0.7830

0.6282 0.8788

0.7458 0.8850

0.6274 0.7147

(a) 0.8 0.78

AUC

0.76 0.74 0.72 Tree−Lasso Random Forest

0.7 0.68

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.9

0.95

1


(b) 0.8 AUC

0.75 0.7 0.65 Tree−Lasso Random Forest

0.6 0.55

0.6

0.65

0.7

0.75

0.8

0.85

probability of feature presence Fig. 10. Predictive performance of Tree-Lasso compared to Random Forest for synthetic datasets. (a) q ¼ 0 and (b) q ¼ 0:8.

(a) 0.8 AUC

0.75

0.7 Tree−Lasso Random Forest

0.65 0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.9

0.95

1


AUC

(b) 0.7 0.65 0.6 0.55 0.5 0.6

Tree−Lasso Random Forest

0.65

0.7

0.75

0.8

0.85

probability of feature presence Fig. 11. Predictive performance of Tree-Lasso compared to Random Forest for real-world datasets. (a) Cancer data and (b) AMI data.

741 742 743 744 745

4.4.5.1. Consistency over time. In some applications such as healthcare, it is important that the obtained features to be interpretable over time. For example, we need to attribute the disease of a patient to certain risk factors consistently over time. However, in presence of correlated features, feature selection methods such

as Lasso may select some features off and on, causing confusions and suspicions about the model. As an example, consider correlated features I20, I21 and I25 in AMI data that are related toischaemic heart disease and we expect that these features are always selected together. However, as it is shown in Table 10, although


746 747 748 749 750

YJBIN 2259


12 December 2014 Q1

Table 4 Comparison of Tree-Lasso and Lasso with Expanded-Lasso in terms of feature stability and predictive performance on synthetic data (q ¼ 0). Method

Stability

Expanded_Lasso (Threshold = 0.7) Expanded_Lasso (Threshold = 0.8) Expanded_Lasso (Threshold = 0.9) Lasso Tree-Lasso

JSM

SRCC

Predictive performance AUC

0.6693

0.2435

0.7436

0.6247

0.1976

0.7281

0.5925

0.1624

0.7137

0.5773 0.7830

0.1278 0.6670

0.7081 0.8264

Table 5 Comparison of Tree-Lasso and Lasso with Expanded-Lasso in terms of feature stability and predictive performance on synthetic data (q ¼ 0:8). Method

Stability


JSM

SRCC


0.5753

0.5173

0.7463

0.5162

0.5027

0.7164

0.4587

0.4736

0.7041

0.4080 0.8777

0.4374 0.6282

0.6836 0.8038

Method


Stability

Predictive performance

JSM

SRCC


752 753

754 755 756

ICD-10 code

Risk factor

Probability of presence (obtained by Tree-Lasso)

D24 R11 R06.0 R63.0 R53 K59.0 R19.7 F32 G47.0 E11

Benign neoplasm of breast [36] Nausea and vomiting [30,58] Dyspnea [30,34] Anorexia [34] Fatigue [58] Constipation [30] Diarrhea [30] Depression [30,58] Insomnia [30] Type II diabetes mellitus [57,58]

1.00 1.00 1.00 0.98 0.95 0.91 0.90 0.87 0.85 0.85

both related to heart failure and so correlated. However, I50 is a

757

0.6425

0.3836

0.7538

0.6027

0.3178

0.7315

0.5726

0.2763

0.7136

0.5542 0.8850

0.2330 0.7458

0.6925 0.7982

ICD-10 code

Feature’s name

Probability of presence (obtained by Tree-Lasso)

I50 N17 I46 I20 I21 I25 J18 E11 I10 E83.4 R07 E78.0

Heart failure [5,11,14,41] Acute renal disorder [11,14,41] Cardiac arrest [28] Angina pectoris [5,14] Acute Myocardial Infarction [5] Chronic ischaemic heart disease [5] Pnemonia [11,14] Type II diabetes mellitus [14] Hypertension [5,9,28] Cardiorespiratory failure [11] Pain in chest [28] Hypercholesterolemia [9,41]

1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.95 0.90 0.87 0.84 0.80

AUC

Table 7 Comparison of Tree-Lasso and Lasso with Expanded-Lasso in terms of feature stability and predictive performance on AMI data. Method

Table 8 Well-known risk factors for cancer reported by clinicians or other research papers, which are also obtained by Tree-Lasso with high probability.

Table 9 Well-known risk factors of readmission after AMI reported by clinicians or other research papers, which are also obtained by Tree-Lasso with high probability.

Table 6 Comparison of Tree-Lasso and Lasso with Expanded-Lasso in terms of feature stability and predictive performance on Cancer data.

751

13


Stability JSM

SRCC


0.6763

0.5531

0.5853

0.6228

0.5129

0.5561

0.5727

0.4836

0.5398

0.5028 0.7147

0.4738 0.6274

0.5342 0.6756

Lasso selects I21 and I25 consistently, it selects I20 only 38% of the times. This may lead to a confusion about predictive value of I20 for AMI related hospital readmissions. 4.4.5.2. Choosing the best explanatory features. By using an stable feature selection method, our goal is to choose thebest explanatoryfeatures. For example, in AMI dataset features I50 and I51 are

Table 10 An example that shows importance of stability in selecting correlated features. Features I20, I21 and I25 belong to ischaemic heart diseases. Features I50 and I51 are related to heart failure. ICD-10 code

Feature’s name

Probability of feature presence (Lasso)

Probability of feature presence (Tree-Lasso)

I20 I21

Angina pectoris Acute Myocardial Infarction Chronic ischaemic heart disease Heart failure Complications of heart disease

0.38 1

1 1

1

1

0 1

1 0

I25 I50 I51

basic feature used to code ‘‘heart failure’’ and I51 is a more specialized feature that gives details of heart failure i.e. ‘‘complications of heart disease’’. Based on the features used by clinicians I50 is more important feature than I51 and selecting latter where the former is not selected would be meaningless. As it can be seen from Table 10, Lasso chooses I51 and ignores I50 (that is more important feature). However, this is not the case in Tree-Lasso.

758

5. Conclusion

765

In this paper, we study the stability behavior of Tree-Lasso – a supervised learning model that is used when features are hierarchical in nature and form a tree structure. We compare its stability

766


759 760 761 762 763 764

767 768

YJBIN 2259


12 December 2014 Q1

14


786

and prediction performance with other feature selection algorithms, T-test, Information Gain, ReliefF and Lasso. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, Naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore assists clinical decision making towards accurate medical prognosis. As a future work, it would be interesting to explore the possibilities of extending our framework to a multiple hospital setting with an aim to reduce sample selection bias of a single hospital using the ideas of supervised and unsupervised transfer learning such as shared subspace learning and multi-task learning [15,17].

787

References

788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846

[1] Altman DG. Practical Statistics for Medical Research. CRC Press; 1990. [2] Bach FR. Bolasso: model consistent lasso estimation through the bootstrap. Proceedings of the 25th International Conference on Machine Learning. ACM; 2008. p. 33–40. [3] Breiman L. Random forests. Mach Learn 2001;45:5–32. [4] Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. CRC press; 1984. [5] Brown JR, Conley SM, Niles NW. Predicting readmission or death after acute stelevation myocardial infarction. Clin Cardiol 2013;36:570–5. [6] Burges CJ. A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 1998;2:121–67. [7] Cover TM. Ja thomas elements of information theory; 1991. [8] Davis J, Goadrich M. The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on machine learning. ACM; 2006. p. 233–40. [9] Desai MM, Stauffer BD, Feringa HH, Schreiner GC. Statistical models and patient predictors of readmission for acute myocardial infarction a systematic review. Circ Cardiovasc Qual Out 2009;2:500–7. [10] Duda RO, Hart PE, Stork DG. Pattern Classification. New York: John Wiley; 2001. Section 10, l. [11] Dunlay SM, Weston SA, Killian JM, Bell MR, Jaffe AS, Roger VL. Thirty-day rehospitalizations after acute myocardial infraction: a cohort study. Ann Internal Med 2012;157:11–8. [12] Egan JP. Signal detection theory and ROC analysis; 1975. [13] Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat 2007;1:302–32. [14] Grady JN, Bhat KR, Desai MM, Grosso L, Lin Z, Parzynski C, Strait K, Wang Y. 2012 measures maintenance technical report: acute myocardial infarction, heart failure, and pneumonia 30-day risk-standardized readmission measure; 2012. [15] Gupta S, Phung D, Venkatesh S. Factorial multi-task learning: a bayesian nonparametric approach. In: Proceedings of the 30th international conference on machine learning (ICML-13); 2013. p. 657–65. [16] Gupta S, Tran T, Luo W, Phung D, Kennedy RL, Broad A, et al. Machine-learning prediction of cancer survival: a retrospective study using electronic administrative records and a cancer registry. BMJ Open 2014;4:e004007. [17] Gupta SK, Phung DQ, Venkatesh S. A bayesian nonparametric joint factor model for learning shared and individual subspaces from multiple data sources. In: SDM. SIAM; 2012. p. 200–11. [18] Harrell FE, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Stat Med 1984;3:143–52. [19] Hastie T, Tibshirani R, Friedman J, Franklin J. The elements of statistical learning: data mining, inference and prediction. Math Intell 2005;27:83–5. [20] Jacob L, Obozinski G, Vert JP. Group lasso with overlap and graph lasso. In: Proc of the 26th international conference on machine learning. ACM; 2009. p. 433–40. [21] Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet 2012;13: 395–405. [22] Kalousis A, Prados J, Hilario M. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inform Syst 2007;12:95–116. [23] Khoshgoftaar TM, Fazelpour A, Wang H, Wald R. A survey of stability analysis of feature subset selection techniques. In: 2013 IEEE 14th international conference on information reuse and integration (IRI). IEEE; 2013. p. 424–31. [24] Kira K, Rendell LA. A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning. Morgan Kaufmann Publishers Inc.; 1992. p. 249–56. [25] Kittler J et al. Feature set search algorithms. Pattern Recognit Signal Process 1978:41–60.

769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785

[26] Kononenko I, Šimec E, Robnik-Šikonja M. Overcoming the myopia of inductive learning algorithms with relieff. Appl Intell 1997;7:39–55. [27] Krˇízˇek P, Kittler J, Hlavácˇ V. Improving stability of feature selection methods. In: Computer analysis of images and patterns. Springer; 2007. p. 929–36. [28] Krumholz HM, Chen J, Chen YT, Wang Y, Radford MJ. Predicting one-year mortality among elderly survivors of hospitalization for an acute myocardial infarction: results from the cooperative cardiovascular project. J Am College Cardiol 2001;38:453–9. [29] Kuncheva LI. A stability index for feature selection. In: Artificial intelligence and applications; 2007. p. 421–7. [30] Laird BJ, Kaasa S, McMillan DC, Fallon MT, Hjermstad MJ, Fayers P, et al. Prognostic factors in patients with advanced cancer: a comparison of clinicopathological factors and the development of an inflammation-based prognostic system. Clin Cancer Res 2013;19:5456–64. [31] Liu H, Setiono R. Chi2: feature selection and discretization of numeric attributes. In: 1995 proceedings of seventh international conference on tools with artificial intelligence. IEEE; 1995. p. 388–91. [32] Liu J, Ye J. Moreau-yosida regularization for grouped tree structure learning. In: Advances in neural information processing systems; 2010. p. 1459–67. [33] Loscalzo S, Yu L, Ding C. Consensus group stable feature selection. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2009. p. 567–76. [34] Maltoni M, Scarpi E, Pittureri C, Martini F, Montanari L, Amaducci E, et al. Prospective comparison of prognostic scores in palliative care cancer populations. Oncologist 2012;17:446–54. [35] Meinshausen N, Bühlmann P. Stability selection. J Roy Stat Soc Ser B (Stat Methodol) 2010;72:417–73. [36] Pfeiffer RM, Park Y, Kreimer AR, Lacey Jr JV, Pee D, Greenlee RT, et al. Risk prediction for breast, endometrial, and ovarian cancer in white women aged 50 y or older: derivation and validation from population-based cohort studies. PLoS Med 2013;10:e1001492. [37] Prokosch HU, Ganslandt T. Perspectives for medical informatics. Methods Inf Med 2009;48:38–44. [38] Quinlan JR. Induction of decision trees. Mach Learn 1986;1:81–106. [39] Quinlan JR. C4.5: programs for machine learning, vol. 1. Morgan kaufmann; 1993. [40] Ramchandran KJ, Shega JW, Von Roenn J, Schumacher M, Szmuilowicz E, Rademaker A, et al. A predictive model to identify hospitalized cancer patients at risk for 30-day mortality based on admission criteria via the electronic medical record. Cancer 2013;119:2074–80. [41] Rana S, Tran T, Luo W, Phung D, Kennedy R, Venkatesh S. Predicting unplanned readmission after myocardial infarction from routinely collected administrative hospital data. Aust Health Rev 2014. [42] Rish I. An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence; 2001. p. 41–6. [43] Ryali S, Supekar K, Abrams DA, Menon V. Sparse logistic regression for wholebrain classification of FMRI data. NeuroImage 2010;51:752–64. [44] Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007;23:2507–17. [45] Shi W, Wahba G, Irizarry R, Bravo H, Wright S. The partitioned lassopatternsearch algorithm with application to gene expression data. BMC Bioinformatics 2012;13:98. [46] Siedlecki W, Sklansky J. On automatic feature selection. Int J Pattern Recognit Artif Intell 1988;2:197–220. [47] Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Stat Med 1986;5:421–33. [48] Stuart Russell PN. Artificial intelligence: a modern approach. 2nd ed. Prentice Hall; 2003 [1995]. [49] Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodol) 1996:267–88. [50] Vapnik V. The nature of statistical learning theory. Springer; 2000. [51] Vapnik VN, Vapnik V. Statistical learning theory, vol. 2. New York: Wiley; 1998. [52] Witten IH, Frank E. Data mining: practical machine learning tools and techniques. Morgan Kaufmann; 2005. [53] Wu J, Roy J, Stewart WF. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med Care 2010;48:S106–13. [54] Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009;25:714–21. [55] You ZH, Yin Z, Han K, Huang DS, Zhou X. A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinformatics 2010;11:343. [56] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Roy Stat Soc Ser B (Stat Methodol) 2006;68:49–67. [57] Yuhara H, Steinmaus C, Cohen SE, Corley DA, Tei Y, Buffler PA. Is diabetes mellitus an independent risk factor for colon cancer and rectal cancer&quest. Am J Gastroenterol 2011;106:1911–21. [58] Zhao D, Weng C. Combining pubmed knowledge and EHR data to develop a weighted bayesian network for pancreatic cancer prediction. J Biomed Inform 2011;44:859–68. [59] Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res 2006;7:2541–63. [60] Zou H, Hastie T. Regularization and variable selection via the elastic net. J Roy Stat Soc Ser B (Stat Methodol) 2005;67:301–20.

847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933


FREL: A Stable Feature Selection Algorithm.

Identification of biomarkers for esophageal squamous cell carcinoma using feature selection and decision tree methods.

Feature Selection Combined with Neural Network Structure Optimization for HIV-1 Protease Cleavage Site Prediction.

Feature selection in supervised saliency prediction.

Tree-structured feature extraction using mutual information.

Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression.

Prediction of protein structural classes based on feature selection technique.

Drug activity prediction using multiple-instance learning via joint instance and feature selection.

Heartbeat classification using disease-specific feature selection.

Feature selection using genetic algorithms for fetal heart rate analysis.

Structured feature selection using coordinate descent optimization.

Robust Feature Selection Technique using Rank Aggregation.

Feature Subset Selection for Cancer Classification Using Weight Local Modularity.

Feature selection for chemical sensor arrays using mutual information.

Feature correlation hypergraph: exploiting high-order potentials for multimodal recognition.

Stable feature selection based on the ensemble L 1 -norm support vector machine for biomarker discovery.

Structure prediction for multicomponent materials using biminima.

Prediction of protein structure classes with flexible neural tree.

Self-adaptive MOEA feature selection for classification of bankruptcy prediction data.

Generating, Maintaining, and Exploiting Diversity in a Memetic Algorithm for Protein Structure Prediction.

Bagging survival tree procedure for variable selection and prediction in the presence of nonsusceptible patients.

Identifying potential clinical syndromes of hepatocellular carcinoma using PSO-based hierarchical feature selection algorithm.

Early prediction of clinical benefit of treating ovarian cancer using quantitative CT image feature analysis.

Multiclass classification of sarcomas using pathway based feature selection method.