A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining.

Accepted Manuscript A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining Hongqiang Lyu, Mingxi Wan, Jiuqiang Han, Ruiling Liu, Cheng Wang PII:

S0010-4825(17)30280-9

DOI:

10.1016/j.compbiomed.2017.08.021

Reference:

CBM 2761

To appear in:

Computers in Biology and Medicine

Received Date: 29 March 2017 Revised Date:

19 August 2017

Accepted Date: 20 August 2017

Please cite this article as: H. Lyu, M. Wan, J. Han, R. Liu, C. Wang, A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining, Computers in Biology and Medicine (2017), doi: 10.1016/j.compbiomed.2017.08.021. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A filter feature selection method based on the Maximal Information

Coefficient

and

Gram-Schmidt

Hongqiang Lyu

a,b,*

b

a

a

RI PT

Orthogonalization for biomedical data mining , Mingxi Wan , Jiuqiang Han , Ruiling Liu , Cheng Wangc

School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, P.R. China

b

School of Life Science and Technology, Xi’an Jiaotong University, Xi’an 710049, P.R. China

c

Krieger School of Arts and Sciences, Johns Hopkins University, Baltimore, MD 21218, USA

SC

a

* Corresponding author at: Room 158, 2nd East Building, 28 West Xianning Road, Xi’an 710049, China. Tel: +86 29 82668665.

AC C

EP

TE D

M AN U

E-mail address: [email protected] (H. Lyu).

1

ACCEPTED MANUSCRIPT Abstract A filter feature selection technique has been widely used to mine biomedical data. Recently, in the classical filter method minimal-Redundancy-Maximal-Relevance (mRMR), a risk has been revealed that a specific part of the redundancy, called irrelevant redundancy, may be involved in the

RI PT

minimal-redundancy component of this method. Thus, a few attempts to eliminate the irrelevant redundancy by attaching additional procedures to mRMR, such as Kernel Canonical Correlation Analysis based mRMR (KCCAmRMR), have been made. In the present study, a novel filter feature selection method based on the Maximal Information Coefficient (MIC) and Gram-Schmidt

SC

Orthogonalization (GSO), named Orthogonal MIC Feature Selection (OMICFS), was proposed to solve this problem. Different from other improved approaches under the max-relevance and min-redundancy

M AN U

criterion, in the proposed method, the MIC is used to quantify the degree of relevance between feature variables and target variable, the GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to previously selected features, and the max-relevance and min-redundancy can be indirectly optimized by maximizing the MIC relevance between the GSO orthogonalized variable and target. This orthogonalization strategy allows OMICFS to exclude the

TE D

irrelevant redundancy without any additional procedures. To verify the performance, OMICFS was compared with other filter feature selection methods in terms of both classification accuracy and computational efficiency by conducting classification experiments on two types of biomedical datasets.

EP

The results showed that OMICFS outperforms the other methods in most cases. In addition, differences between these methods were analyzed, and the application of OMICFS in the mining of high-dimensional biomedical data was discussed. The Matlab code for the proposed method is

AC C

available at https://github.com/lhqxinghun/bioinformatics/tree/master/OMICFS/.

Keywords

Filter feature selection

Maximal Information Coefficient (MIC) Gram-Schmidt Orthogonalization (GSO) Biomedical data mining

2

ACCEPTED MANUSCRIPT 1. Introduction The feature selection method in machine learning plays an important role in the context of biomedical data analysis. With the development of modern measurement technologies in the biomedical field, large volumes of rapidly expanding and ever-changing biomedical data are generated, presenting a primary challenge for researchers to conduct data mining and knowledge discovery [1, 2].

RI PT

Thus, there is a practical need for feature selection methods, by which a compact subset of significant features can be determined to provide information concerning a phenotype of interest. Currently, many different feature selection methods have been utilized in various clinical research studies, such as

SC

biomedical image segmentation for tumor region definition [3], electrocardiogram assessment for health inspection [4], molecular bioactivity prediction for drug design [5] and, particularly, biomarker discovery for cancer diagnosis [6].

M AN U

Feature selection techniques can be broadly divided into three categories, namely, filter methods, wrapper methods and embedded methods [7]. The filter technique performs feature selection by exploring the intrinsic properties of data in an open-loop approach, which functions independently of the classifier design. By contrast, wrapper and embedded techniques perform feature selection by interacting with classifiers. While wrapper methods incorporate a classifier hypothesis into a close-loop

TE D

search for an optimal feature subset, embedded methods build the search into a classifier construction. Wrapper and embedded techniques can commonly achieve better accuracy than the filter technique, but the filter technique is less likely to lead to over-fitting and is much more computationally economical

EP

than the other methods, particularly compared with the embedded technique [8, 9]. Thus, filter methods are commonly used for biomedical data mining, particularly for analyses in high-dimensional

AC C

biomedical domains [10].

Herein, filter methods based on the max-relevance and min-redundancy criterion are examined. minimal-Redundancy-Maximal-Relevance (mRMR) [11] is a classical filter method based on Mutual Information (MI), which searches for a compact subset of informative features by reducing redundant features, and this method has been widely used in biomedical data mining [12, 13]. However, mRMR only addresses the quantity of the redundancy, without considering the type of the redundancy. Thus, there is a risk that a specific part of the redundancy, called irrelevant redundancy, may be involved in the minimal-redundancy component of this method. Although potential irrelevant redundancy is part of the redundancy between different feature variables, it is completely independent of the corresponding 3

ACCEPTED MANUSCRIPT target variable [14, 15]. Thus, the covariates of features were fed into mRMR instead of the features themselves in an attempt to eliminate irrelevant redundancy [15]. An improved algorithm, called Kernel Canonical Correlation Analysis based mRMR (KCCAmRMR), was also developed, in which irrelevant redundancy is filtered out by using an additional kernel canonical correlation analysis, thus, only the relevant redundancy is considered in the subsequent mRMR procedure [16].

RI PT

In the present study, a novel filter feature selection method based on the Maximal Information Coefficient (MIC) [17] and Gram-Schmidt Orthogonalization (GSO), called Orthogonal MIC Feature Selection (OMICFS), was proposed for biomedical data mining. Different from other filter approaches

SC

under the max-relevance and min-redundancy criterion, in the proposed method, the MIC rather than MI is used to quantify the degree of relevance between features and target, the GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to previously selected

M AN U

features to remove the redundancy between them, and the max-relevance and min-redundancy is indirectly optimized by maximizing the MIC relevance between the GSO orthogonalized variable and target, so that the most promising feature is determined. In this way, a compact subset of informative features can be retrieved by a stepwise search. This orthogonalization strategy allows OMICFS to exclude the irrelevant redundancy without any additional procedures.

TE D

The remainder of this paper is organized as follows: Section 2 presents the proposed feature selection method, OMICFS, in detail, and provides a simple introduction of the classifiers and evaluation measures used in the present study. In section 3, classification experiments are conducted,

EP

including a summary of biomedical datasets and comparison with other similar methods in terms of the classification accuracy and computational efficiency. Section 4 describes the differences between

AC C

these methods and the application of OMICFS. Finally, conclusions are provided in Section 5. 2. Methods

2.1. Algorithm design

The irrelevant redundancy risk involved in mRMR has been analyzed in detail in some literature [15, 16]. To illustrate the problem, a simple graphical example is given. In Fig. 1, three variables, including a target variable

t , a previously selected feature x1 and a candidate feature x2 are visualized.

According to the max-relevance and min-redundancy criterion, the mRMR Score (mRMRS) of

4

x2

with

ACCEPTED MANUSCRIPT respect to

x1

has the following form:

mRMRS( x2 , x1 ) = MI ( x2 , t ) − MI ( x2 , x1 ) = (r3 + r4 ) − (r3 + r6 ) = r4 − r6 ,

MI is the MI statistics between two variables, MI ( x2 , t ) is the relevance between the

candidate feature

x2

t , MI ( x2 , x1 ) is the redundancy of x2 with respect to

and target variable

previously selected feature

x1 , and ri

indicates the

i th information region in Fig. 1. It can be seen

MI ( x2 , x1 ) consists of two parts, r3 and r6 . The two parts are considered as the relevant

and

x2

carry about target

t , but r6 is completely irrelevant to the target. It has been

suggested that the irrelevant redundancy

r6 should be removed from the mRMR score, leaving only

M AN U

x1

r3 represents the redundant information

SC

redundancy and irrelevant redundancy, respectively, since that

RI PT

where

that

(1)

TE D

r4 , which represents the unique information that x2 carries about target t [16].

candidate feature

t , a previously selected feature x1

and a

EP

Fig. 1. Diagram of the relations between three variables, including a target variable

x2 . ri

indicates the

i th

information region.

AC C

Herein, a novel filter feature selection method, OMICFS, was designed to try to deal with this problem. The OMICFS Score (OMICFSS) is given by:

OMICFSS( x2 , x1 ) = MIC ( GSO( x2 , x1 ), t ) = (r4 + r7 ) − r7 = r4 ,

(2)

where MIC is the MIC statistics between two variables, GSO ( x2 , x1 ) is the orthogonalized variable of candidate feature function, and

x2

with respect to previously selected feature

by means of GSO

MIC ( GSO( x2 , x1 ), t ) indicates the MIC relevance between the orthogonalized

variable GSO ( x2 , x1 ) and target variable 5

x1

t . In Fig. 1, GSO ( x2 , x1 ) is represented by r4 + r7 ,

ACCEPTED MANUSCRIPT which is the information that is carried by regarded as subtracting r7 from r4

x2

but distinct from

x1 , and MIC ( GSO( x2 , x1 ), t )

can be

+ r7 , leaving only r4 . It can be seen that the irrelevant

redundancy r6 is excluded from the OMICFSS. Thus, the proposed orthogonalization strategy has

2.2. Proposed feature selection method

is

∈ R N ×P is tabled as N samples and P features, where the i th feature variable

xi ∈ RN ,1 ≤ i ≤ P .

The target variable is

t ∈ Z N , in which different integer values represent

SC

A data matrix X

RI PT

the ability to eliminate the irrelevant redundancy from the classical mRMR.

different classes of the corresponding samples. Under the max-relevance and min-redundancy criterion, the proposed feature selection method, OMICFS, in the present study aims to find a compact

si ∈ RN ,1 ≤ i ≤ D ,

N ×D

, D < P , where the i th promising feature variable is

M AN U

subset of informative features S ∈ R

and the target variable t could be optimally characterized by the feature

variables in S . 2.2.1. Max-relevance

TE D

The max-relevance scheme is committed to selecting the most informative features, which effectively distinguish between different classes of samples. In the scheme, the degree of relevance between the feature variables and target variable is quantified, and the informativeness of features is

EP

determined according to the degree of the relevance score. To implement this scheme, a newly explored information metric, MIC, was used in the paper to score the relevance degree of a candidate

AC C

feature variable with respect to the target variable. The Max-Relevance Score (MRelS) of feature variable xi has the following form:

MRelS ( xi ) = MINEMIC ( xi , t ) ,

(3)

where MINE MIC is a MIC score function from the Maximal Information-based Nonparametric Exploration (MINE) application, which can be used to calculate the MIC and other statistical scores. MIC is a statistical measure of the association between paired variables regardless of linear or nonlinear relationship. To get the MIC score, the values of the two variables are partitioned into different number of bins to form rectangular grids with different resolutions, thus the distribution on the 6

ACCEPTED MANUSCRIPT cells of each grid can be obtained by letting the probability mass in each cell be the fraction of points falling into that cell, then the MI statistics for each grid is calculated, and the maximum is chosen as the MIC score. The MIC of two variables

MINEMIC ( x, t ) = max0.6[ xn tn < N

x and t is defined as:

MI xn ,tn ( x, t )

log 2 min{xn , tn }

],

(4)

x and t

RI PT

where N is the sample size, xn and tn denote the number of bins imposed on the

axes, respectively, and MI xn ,tn ( x, t ) is the MI statistics between the two variables for an xn -by- tn rectangular grid. According to the definition of MIC, the MIC score can be used to quantify the degree

application

packages,

which

are

available

at

SC

of relevance between a continuous variable and a qualitative variable, but the two existing MINE http://www.exploredata.net/

[17]

and

M AN U

http://minepy.readthedocs.io/ [18], separately, are not capable of receiving a qualitative target variable as input, since the different integer values of target variable represent different classes of the corresponding samples, and these qualitative values will be arbitrarily partitioned into bins whose number is mostly not equal to the number of sample classes. Thus, in the present study, the latter package was adjusted to ensure that the value of target variable is only allowed to be partitioned into

TE D

specified bins whose number is consistent with that of sample classes. Then the relevance between continuous feature variables and the qualitative target variable was calculated using the adjusted package with default parameters, where the greater the MRelS, the higher the distinguishing ability of

EP

the corresponding feature.

2.2.2. Max-relevance and min-redundancy

AC C

The max-relevance and min-redundancy aims to find a compact subset of informative features by simultaneously considering the max-relevance scheme and min-redundancy. The simple combination of individually informative features does not necessarily achieve a good classification performance. Thus, "the m best features are not the best m features" [11]. Therefore, both the informativeness of individual features and redundancy between them should be considered. There are a number of feature selection methods in this way, such as mRMR [11] and Quadratic Programming Feature Selection (QPFS) [19], in which the difference between the relevance and redundancy is maximized to optimize the max-relevance and min-redundancy. In the present study, the max-relevance and min-redundancy was indirectly optimized using an 7

ACCEPTED MANUSCRIPT orthogonalization strategy that combines the GSO and MIC together. The GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to other features to remove the redundancy between them, and the MIC relevance between the orthogonalized variable and target variable is maximized to indirectly optimize the max-relevance and min-redundancy. The proposed

x1, x2 ,…, x j

is given by:

OMICFSS( xi ; x1 , x2 ,…, x j ) = MINEMIC ( GSO( xi ; x1 , x2 ,…, x j ), t ) ,

RI PT

OMICFS score of candidate feature variable xi with respect to previously selected features

(5)

variables

x1 , x2 ,…, x j

SC

where GSO( xi ; x1 , x2 ,…, x j ) is the orthogonalized variable of feature xi with respect to the feature using a GSO function. The greater the OMICFS score, the more promising

M AN U

the feature xi . Notably, the irrelevant redundancy is not included in OMICFS score because of the involvement of orthogonal transformation and target variable. 2.2.3. Implementation

In actual implementation, a stepwise search is used to select features that achieve a near-optimal maximization of OMICFS score. In the first step, MRelS of all the candidate features is calculated,

s1 = arg max[ MRelS( xi ) ] , xi ∈X

TE D

where the feature with maximum MRelS can be determined as the first promising feature variable (6)

EP

and the corresponding orthogonalized variable is selected as q1 =

s1 . Subsequently, the problem s1

becomes how to incrementally select other promising features from the remaining features, where one

AC C

feature represents one step forward. Suppose that a feature subset S m −1 , 2 ≤ m ≤ D , composed of

m−1 promising features s1 , s2 ,… , sm −1 , has been determined at step m−1 , and the corresponding orthogonalized variables are q1 , q2 , … , qm −1 . The

mth promising feature can be selected from

X − S m −1 at step m by optimizing the following condition:

sm = arg max OMICFSS( xi , Sm−1 )  xi ∈ X − Sm −1 = arg max  MINEMIC ( GSO ( xi , S m −1 ), t )  xi ∈ X − S m−1 8

,

(7)

ACCEPTED MANUSCRIPT where GSO( xi , Sm−1 ) is the orthogonalized variable of candidate feature xi with respect to the previously selected feature variables in S m −1 , and can be calculated using a GSO function

GSO ( xi , S m −1 ) =

ui 〈x , q 〉 〈 xi , qm −1 〉 , ui = xi − i 1 q1 − ... − qm −1 . ui 〈 q1 , q1 〉 〈 qm −1 , qm −1 〉

(8)

RI PT

With the determination of s m , the associated orthogonalized variable qm = GSO ( sm , S m −1 ) can be simultaneously determined, which will be useful in subsequent steps. Thus, the promising feature variables could be incrementally retrieved until step D where a total of D features are selected. For example, there are 3 promising features that need to be selected from 5 candidate

x2 . The second step begins with the x2

M AN U

x2 has the maximal MRelS, we can have s1 = x2 , q1 =

SC

features X = {x1 , x2 , x3 , x4 , x5 } . In the first step, the MRelS of all the features is calculated. Assuming

computation of OMICFSS ( x1 , s1 ) , OMICFSS ( x3 , s1 ) , OMICFSS( x4 , s1 ) and OMICFSS ( x5 , s1 ) , and the feature x3 , which is assumed to have the maximal OMICFS score, is selected, leaving the other three features to be chosen later, that is

s2 = x3 , q2 = GSO ( s2 , s1 ) . In the third step, since

TE D

OMICFSS( x1 ;s1 , s2 ) is greater than OMICFSS ( x4 ;s1 , s2 ) and OMICFSS ( x5 ;s1 , s2 ) , the last promising feature is determined as s3 = x1 , q3 = GSO ( s3 ; s1 , s2 ) . Thus, the three expected promising features

EP

S = {x2 , x3 , x1} are finally obtained.

In addition, the theory of Sure Independence Screening (SIS) [20] can be used to speed up OMICFS

AC C

when dealing with high-dimensional biomedical data. In each step of non-accelerating OMICFS, the orthogonalized variables and MIC relevance for each candidate feature have to be recalculated to determine which has the maximum OMICFS score. Thus, when the feature dimension is ultra-high, the computational burden is heavy. The SIS theory claims that the original set of features can be reduced to a small subset whose dimension is in the order of N where

/ log N under the condition of P >> N ,

P and N are the dimension of original features and size of samples, respectively, and then

feature selection can be accomplished by some refined lower dimensional methods [20, 21]. According to SIS, when OMICFS is used for high-dimensional and low-sample size biomedical data, the candidate features can be ranked in descending order by MRelS, and only the top 9

ACCEPTED MANUSCRIPT λ N / log N  , λ ≥ 1 features are fed into the stepwise search, which has been described in detail above, to reduce the computational burden. For example, suppose the sample size is N = 200 , and there are D = 200 promising features that need to be selected from a total of P = 15,000, P >> N candidate feature variables. This would be a time-consuming task for OMICFS that is not accelerated.

RI PT

Thanks to SIS, the 15,000 candidate features can be ranked in descending order by MRelS, and only

λ N / log N  = 434 , here λ is set to 5, features are retained, so that the original selection

the top

becomes a computationally simpler one, in which 200 promising features need to be selected from

features one by one.

OMICFS: Input: Candidate feature variables

X ∈ R N ×P tabled as N N variable t ∈ Z Output: Ordinal number of

xi ∈ X ,1 ≤ i ≤ P ,

samples and

D

P

features, and target

selected features,

1: for i←{1,..., P } do 3: end for 4: if 5:

P >> N X r = rank[ MRelS ( xi ) ] xi ∈ X

D> N , and the

top-ranking 200 features were fed into the classifiers with a step size of 5 features. The corresponding

TE D

curves, as shown in Fig. 2 (e)-(h), are based on five repeats of 10-fold cross-validation due to small sample size. In addition, the conditions of peak points on the accuracy curves for the SVM, KNN, MLP and RF are shown in Tables 2, 3, 4 and 5, respectively. It is noted that QPFS has no ability to handle

EP

Lung_Cancer and GDS3875 datasets, since the two turn out to be a non-convex problem, but QPFS is a convex quadratic program [19, 33].

AC C

a

b

13

ACCEPTED MANUSCRIPT c

RI PT

d

M AN U

SC

e

AC C

h

EP

g

TE D

f

Fig. 2. Change of the average classification accuracies versus the number of top-ranking features selected by mRMR, KCCAmRMR, QPFS and OMICFS using SVM, KNN, MLP and RF classifiers. The first four datasets are based on 10-fold cross-validation, and the last four datasets are based on five repeats of 10-fold cross-validation due to small sample size. (a) Diabetic, (b) Heart, (c) Parkinson's, (d) Seeds, (e) Prostate_Tumor, (f) Lung_Cancer, (g) GDS2960 and (h) GDS3875.

14

ACCEPTED MANUSCRIPT

Table 2 Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the SVM.

Feaa

Diabetic

19

19

ACC (%) 75.67

Heart

13

8

84.08

Parkinson’s

26

21

70.67

Seeds

7

5

95.72

b

mRMR Kappa (%) 51.59

19

KCCAmRMR ACC Kappa (%) (%) 75.67 51.59

AUC (%) 84.01

90.75

11

84.45

68.33

76.71

16

70.58

41.15

-

6

94.76

AUC (%) 84.01

67.53 41.35 93.57

Feaa

17

ACC (%) 76.54

QPFS Kappa (%) 53.33

AUC (%) 83.64

14

ACC (%) 76.63

91.00

3

84.82

69.08

85.20

11

84.45

68.29

76.13

24

71.35

42.69

78.06

20

71.64

42.77

77.33

92.14

-

5

95.72

93.57

-

5

96.19

94.29

-

81

94.51

89.02

98.57

51

96.51

93.04

97.68

-

-

-

-

21

84.43

67.72

-

96

96.85

93.13

98.78

196

97.84

95.33

99.68

-

-

-

-

191

96.75

93.05

-

10,509

126

95.69

91.38

98.07

36

95.07

90.13

97.08

12,600

51

84.35

68.25

-

186

82.76

64.73

-

GDS2960

4,068

91.50

99.62

93.62

-

GDS3875 16,847 a Number of original features.

101

98.24

96.16

99.68

51

96.07

181

92.67

84.03

-

116

96.98

b

Number of selected features.

Table 3

Feaa

M AN U

Prostate_Tumor Lung_Cancer

RI PT

Original Feaaa

SC

Datasets

Feaa

OMICFS Kappa (%) 53.57

AUC (%) 84.94 90.67

Original Feaaa


19

9

Heart

13

11

84.82

68.92

91.25

Parkinson’s

26

23

68.75

37.50

74.08

Seeds

7

2

93.34

90.76

-

Prostate_Tumor

10,509

106

78.18

56.72

Lung_Cancer

12,600

191

79.56

55.79

GDS2960

4,068

56

94.84

88.49

176

68.87

29.75

GDS3875 16,847 a Number of original features. b


15

Feaab

AUC (%) 71.35


AUC (%) 70.08


6

4

ACC (%) 69.15

12

84.82

68.89

91.36

4

85.19

69.39

85.11

11

85.93

71.25

22

68.56

37.12

73.59

26

67.02

34.04

73.95

25

68.56

37.12

74.40

6

92.86

89.29

-

6

91.91

87.86

-

5

93.33

90.00

-

Feaa 6

AUC (%) 72.76


ACC (%) 66.38

AC C

Diabetic

ACC (%) 66.29

EP

Datasets

TE D

Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the KNN.

Feaa

Feaa

AUC (%) 76.05 91.86

85.37

36

92.60

85.26

94.35

6

92.35

84.70

95.03

26

91.38

82.78

95.69

-

156

82.62

61.24

-

-

-

-

-

21

83.86

66.07

-

98.10

11

94.27

88.26

99.68

36

93.45

85.72

99.30

16

95.22

89.70

99.04

-

181

84.69

57.39

-

-

-

-

-

6

80.15

49.09

-

ACCEPTED MANUSCRIPT

Table 4 Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the MLP. Original Feaaa

b

Feaa

ACC (%)

mRMR Kappa (%)

AUC (%)

Feaa

KCCAmRMR ACC Kappa (%) (%)

AUC (%)

QPFS Kappa (%)

AUC (%)

Feaa

ACC (%)

RI PT

Datasets

Feaa

ACC (%)

OMICFS Kappa (%)

AUC (%)

19

19

75.15

50.44

82.80

18

75.24

50.56

82.94

16

75.58

51.17

82.66

15

76.20

52.50

82.95

Heart

13

5

82.59

64.40

88.72

4

82.96

65.05

85.42

6

83.33

66.22

89.14

5

84.08

67.56

87.75

Parkinson’s

26

17

67.11

34.23

72.90

18

67.02

34.04

73.76

23

67.98

35.96

74.49

19

68.27

36.54

73.77

Seeds

7

6

90.48

85.72

-

6

90.00

85.00

-

3

90.95

86.43

-

4

92.38

88.57

-

10,509

11

85.40

71.24

91.48

6

90.98

81.95

95.09

12,600

51

75.34

53.47

-

21

73.08

50.40

-

GDS2960

4,068

26

89.18

76.91

92.50

16

92.42

83.99

97.42

6

72.14

31.86

-

6

74.51

41.20

-



Table 5

6

91.00

83.51

94.95

11

91.35

80.18

96.86

-

-

-

-

56

78.08

54.38

-

21

89.73

77.83

93.44

6

96.00

91.59

99.31

-

-

-

-

16

75.69

45.13

-

M AN U

Prostate_Tumor Lung_Cancer

SC

Diabetic

Original Feaaa


Diabetic

19

18

ACC (%) 68.63

Heart

13

13

84.45

68.45

90.78

Parkinson’s

26

23

72.02

44.04

78.85

Feaa

Seeds

7

2

94.29

91.43

Prostate_Tumor

10,509

31

95.25

90.51

Lung_Cancer

12,600

46

85.29

69.31

GDS2960

4,068

51

95.67

90.79

11

85.23

61.68



16

AUC (%) 74.64


AUC (%) 74.93

19

84.45

68.29

89.67

3

84.45

68.37

86.42

12

84.82

69.04

89.53

14

72.40

44.81

79.20

26

72.50

45.00

79.15

21

71.64

43.27

78.66

19

Feaa

Feaa 5

ACC (%) 70.72


13

Feaa

AUC (%) 75.52


ACC (%) 69.24

AUC (%) 75.67

3

94.76

92.14

-

6

94.76

92.14

-

6

95.72

93.57

-

97.81

11

95.07

90.15

98.19

26

95.07

90.15

96.89

31

95.27

90.55

95.96

-

61

84.83

67.45

-

-

-

-

-

181

84.81.

67.78

-

98.68

26

96.64

92.74

98.85

86

96.04

91.43

98.53

36

97.82

95.31

99.23

-

31

90.69

77.40

-

-

-

-

-

66

91.11

75.66

-

-

AC C

b

EP

Datasets

TE D

Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the RF.

ACCEPTED MANUSCRIPT The proposed OMICFS was compared with mRMR, KCCAmRMR and QPFS by analyzing the general trend of the classification accuracy curves. With respect to Diabetic, Seeds, and Lung_Cancer datasets (Fig. 2 (a), (d) and (f)), the average ACC of the OMICFS curves is consistently higher than that of the other curves for all the SVM, KNN, MLP and RF classifiers, except for Diabetic using SVM and MLP as well as Lung_Cancer using RF, in which the accuracy of OMICFS is close to that of the

RI PT

other methods, especially to that of KCCAmRMR and QPFS. For the other datasets, the comparison results are not consistent due to the different classifiers. In these cases, the average ACC of OMICFS could be slightly better or similar compared with the other three methods, except for Heart dataset

good as the other methods, especially as KCCAmRMR.

SC

using RF and GDS3875 dataset using KNN (Fig. 2 (b) and (h)), in which OMICFS is obviously not as

OMICFS was also compared with the other three methods by analyzing the peak points on the

M AN U

classification accuracy curves. In Table 2, using the SVM classifier, the average ACC and Kappa for OMICFS are the maximum in 5 and 4 out of a total of 8 biomedical datasets, respectively. For the average AUC, OMICFS has the maximum value in 2 out of 5 associated datasets. Moreover, the proposed method achieves these peak points with the minimum number of selected features on 3 datasets. For example, with respect to the Diabetic dataset, a maximum average ACC of 76.63% is

TE D

achieved via OMICFS with only 14 features, whereas a maximum average ACC of 75.67% is achieved through mRMR and KCCAmRMR with all 19 features, and 76.54% through QPFS with 17 features. In Table 3, using the KNN classifier, OMICFS has the maximum average ACC and Kappa in 4 out of 8

EP

biomedical datasets, its average AUC is the maximum in 4 out of 5 associated datasets, and the peak points are achieved through OMICFS with the minimum number of selected features on 3 datasets. In

AC C

Table 4, using the MLP classifier, the average ACC and Kappa for OMICFS are always the maximum in 8 biomedical datasets except for 1 dataset about average Kappa, and the maximum average AUC is achieved via OMICFS on 3 out of 5 associated datasets. In Table 5, using the RF classifier, OMICFS has the maximum average ACC and Kappa in 6 and 5 datasets, respectively, and the maximum average AUC in 2 datasets. Thus, as shown in Fig. 2 and Tables 2, 3, 4 and 5, the proposed OMICFS outperforms mRMR, KCCAmRMR and QPFS in terms of accuracy in most cases. 3.4. Computational efficiency The computational efficiency of mRMR, KCCAmRMR, QPFS and OMICFS were different. Table 6 shows the average time consumed by the four methods during feature selection on the datasets 17

ACCEPTED MANUSCRIPT involved in the present study. The running time was obtained on a windows 7 notebook with Intel DualCore i7-6500U 2.50 GHz processor and 8 GB RAM. The results show that mRMR is much more efficient than the other three methods. This result is expected since mRMR typically accepts discretized feature variables. Although KCCAmRMR uses a strategy similar to mRMR for calculating the relevance and redundancy terms, the additional kernel canonical correlation analysis is

RI PT

time-consuming, reflecting the fact that KCCAmRMR is the least computationally economical method. QPFS is the closest approach to mRMR in terms of computational efficiency, but much more time is required while dealing with high-dimensional data due to the lack of a corresponding acceleration

SC

algorithm that is used in the other three methods. It is obvious that the efficiency of OMICFS is not as good as that of mRMR, but is much better than that of KCCAmRMR, as the proposed method could exclude irrelevant redundancy while calculating the max-relevance and min-redundancy without any

M AN U

additional procedures. Compared with QPFS, OMICFS is more efficient in dealing with high-dimensional data due to its acceleration algorithm based on SIS theory. Table 6

Comparison of the average computational time (second) consumed by mRMR, KCCAmRMR, QPFS and OMICFS. Datasets

mRMR

KCCAmRMR QPFS

Diabetic

0.03

101.23

0.07

Heart

0.02

2.23

0.04

Parkinson’s

0.05

132.35

0.08

16.55

Seeds

0.01

0.85

0.02

0.11

Prostate_Tumor

9.39

1,209.23

520.20

109.64

Lung_Cancer

15.59

11,109.65

-

480.95

GDS2960

6.59

319.13

233.03

41.37

GDS3875

24.08

9,611.28

-

299.65

EP

TE D

0.23

AC C

4. Discussion

OMICFS

8.12

4.1. Necessity of normalization A normalization procedure was used to prepare the two types of datasets in the present study. To illustrate the necessity of the normalization, all the datasets were processed in exactly the same way, except that the normalization procedure was omitted. The results of the proposed OMICFS without normalization are shown in Table 7, including the number of selected features and the average ACC of peak points using the SVM, KNN, MLP and RF classifiers. Compared with the corresponding parts of Table 2, 3, 4 and 5, it can be seen that OMICFS has a better performance in most cases with the help of the normalization procedure. There are at least two reasons for it. First, the MIC value can be 18

ACCEPTED MANUSCRIPT changed by normalization. The distribution of points falling into the cells of a rectangular grid is changed due to the absence of normalization. It makes the MIC value different from that with normalization. Second, normalization has always been used to make different features on the same scale before being fed into the training engine of classifiers, such as the svm-scale module in LIBSVM package, since the features with much larger scales can govern the output of classifiers, especially

RI PT

distance-based classifiers. Thus, the preceding normalization procedure in the present study helps OMICFS to find the promising features which can achieve better classification performance.

Peak points on the accuracy curves by OMICFS without normalization. Datasets

Original Feaaa

SVM

KNN b

19

19/76.11

13/70.20

Heart

13

10/84.45

10/73.70

Parkinson’s

26

20/72.69

8/64.52

Seeds

7

4/95.24

5/90.48

RF

18/75.58

11/70.54

13/82.60

13/83.70

25/68.27

17/73.17

4/90.95

3/94.76

M AN U

Diabetic

MLP

SC

Table 7

Prostate_Tumor

10,509

71/94.29

11/93.51

21/90.21

21/94.11

Lung_Cancer

12,600

126/82.82

136/82.16

11/79.09

76/84.01

GDS2960

4,068

156/97.00

71/95.05

6/91.09

76/96.42

146/96.25

161/82.18

11/76.58

136/88.20


Number of selected features/average ACC (%).

TE D

4.2. Differences with other methods

With respect to other filter feature selection methods based on the max-relevance and min-redundancy criterion, such as mRMR, KCCAmRMR and QPFS, the proposed method, OMICFS,

EP

has two main differences that enable competitive data mining. First, an orthogonalization strategy is used in this method to exclude irrelevant redundancy. That is different from mRMR in which the

AC C

undesirable irrelevant redundancy is involved. Compared with KCCAmRMR, the orthogonalization strategy allows OMICFS to exclude the irrelevant redundancy while optimizing the max-relevance and min-redundancy, making this method much more computationally economical than KCCAmRMR, in which an additional time-consuming kernel canonical correlation analysis is necessary for the filtering of irrelevant redundancy. Second, MIC statistics is chosen in this method to explore the interesting relationships between pairs of variables. Compared with MI statistics in mRMR and KCCmRMR as well as Pearson correlation coefficient in QPFS, the MIC not only effectively captures a wider range of functional relationships, such as exponential, periodic, sinusoidal and so on, but can also be applicable to non-functional relationships. In addition, the MIC is general and roughly equal to the coefficient of 19

ACCEPTED MANUSCRIPT determination on functional relationships under noisy conditions [17]. Thus, the choice of statistics benefits OMICFS, although more computational time is required for MIC. 4.3. Application of OMICFS As a filter feature selection technique, OMICFS can be used to mine data from various fields. Herein,

RI PT

we analyzed high-dimensional biomedical data, such as cancer biomarker discovery, which is helpful for the accurate diagnosis of cancer. Heatmaps of the top 20 marker genes selected by OMICFS on Prostate_Tumor and Lung_Cancer datasets are shown in Fig. 3 (a) and (b), respectively. These color maps show the level of gene expression versus the type of cancers and name of marker genes, where

SC

red and blue represent the high and low expression levels relative to the mean, respectively. The results suggest that OMICFS distinguishes between different cancers by identifying a set of key

M AN U

marker genes related to each type of cancer. Compared with OMICFS in dealing with the two high-dimensional biomedical datasets, mRMR has a fairly high speed, but its accuracy is much lower using the KNN classifier. KCCAmRMR and QPFS can achieve similar accuracy, but they are much less computationally economical, and QPFS has no ability to handle Lung_Cancer dataset. Thus, it is better to choose a competitive feature selection method according to its own characteristics and

TE D

specific applications.

b

AC C

EP

a

Fig. 3. Heatmaps of the top 20 marker genes selected by OMICFS on (a) Prostate_Tumor and (b) Lung_Cancer datasets. The type of cancers and the name of marker genes are shown on the horizontal and vertical axes, respectively. Red and blue represent the high and low expression levels relative to the mean, respectively. The Standard Deviation (SD) from the mean is indicated.

20

ACCEPTED MANUSCRIPT 5. Conclusion In the present study, a new filter feature selection method, called OMICFS, was proposed for biomedical data mining. Comparing with other methods under the max-relevance and min-redundancy criterion, such as mRMR, KCCAmRMR and QPFS, the main advantage of the proposed method is that

RI PT

an orthogonalization strategy is used to exclude the undesired irrelevant redundancy without any additional procedures while optimizing the max-relevance and min-redundancy. Besides, MIC statistics has a promising ability to capture the complex relationships between pairs of variables. Thus, the proposed method outperforms the other methods in most experiments. OMICFS could serve a

SC

variety of applications related to data mining, particularly the mining of high-dimensional biomedical data.

M AN U

Acknowledgments

This work was financially supported by the National Natural Science Foundation of China under Grant 61602367 and the China Postdoctoral Science Foundation under Grant 2015M580851. References

AC C

EP

TE D

[1] E. Bender, Big data in biomedicine, Nature, 527 (2015) S1. [2] K.H. Buetow, Cyberinfrastructure: empowering a" third way" in biomedical research, Science, 308 (2005) 821-824. [3] P. Lambin, E. Rios-Velazquez, R. Leijenaar, S. Carvalho, R.G. van Stiphout, P. Granton, C.M. Zegers, R. Gillies, R. Boellard, A. Dekker, Radiomics: extracting more information from medical images using advanced feature analysis, European journal of cancer, 48 (2012) 441-446. [4] T. Mar, S. Zaunseder, J.P. Martínez, M. Llamedo, R. Poll, Optimization of ECG classification by means of feature selection, IEEE transactions on Biomedical Engineering, 58 (2011) 2168-2177. [5] J. Weston, F. Pérez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, B. Schölkopf, Feature selection and transduction for prediction of molecular bioactivity for drug design, Bioinformatics, 19 (2003) 764-771. [6] Y. Saeys, L. Wehenkel, P. Geurts, Statistical interpretation of machine learning-based feature importance scores for biomarker discovery, Bioinformatics, 28 (2012) 1766-1774. [7] Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics, 23 (2007) 2507-2517. [8] A. Sharma, K.K. Paliwal, S. Imoto, S. Miyano, A feature selection method using improved regularized linear discriminant analysis, Mach Vision Appl, 25 (2014) 775-786. [9] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics, 26 (2010) 392-398. [10] G. Wong, C. Leckie, A. Kowalczyk, FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number, Bioinformatics, 28 (2012) 151-159. [11] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE transactions on pattern analysis and machine intelligence, 27 (2005) 1226-1238. [12] A. Chin, A. Mirzal, H. Haron, H. Hamed, Supervised, Unsupervised and Semi-supervised Feature selection: A Review on Gene Selection, IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, (2015). 21

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[13] A. Balodi, M.L. Dewal, R.S. Anand, A. Rawat, Texture based classification of the severity of mitral regurgitation, Comput Biol Med, 73 (2016) 157-164. [14] P.A. Estevez, M. Tesmer, C.A. Perez, J.M. Zurada, Normalized mutual information feature selection, IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 20 (2009) 189-201. [15] O. Kurşun, C.O. ŞAKAR, O. Favorov, N. Aydin, S.F. GÜRGEN, Using covariates for improving the minimum redundancy maximum relevance feature selection method, Turkish Journal of Electrical Engineering & Computer Sciences, 18 (2010) 975-989. [16] C.O. Sakar, O. Kursun, F. Gurgen, A feature selection method based on kernel canonical correlation analysis and the minimum Redundancy–Maximum Relevance filter method, Expert Systems with Applications, 39 (2012) 3432-3437. [17] D.N. Reshef, Y.A. Reshef, H.K. Finucane, S.R. Grossman, G. McVean, P.J. Turnbaugh, E.S. Lander, M. Mitzenmacher, P.C. Sabeti, Detecting novel associations in large data sets, Science, 334 (2011) 1518-1524. [18] D. Albanese, M. Filosi, R. Visintainer, S. Riccadonna, G. Jurman, C. Furlanello, Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers, Bioinformatics, 29 (2013) 407-408. [19] I. Rodriguez-Lujan, R. Huerta, C. Elkan, C.S. Cruz, Quadratic programming feature selection, Journal of Machine Learning Research, 11 (2010) 1491-1516. [20] J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (2008) 849-911. [21] Q. He, D.-Y. Lin, A variable selection method for genome-wide association studies, Bioinformatics, 27 (2011) 1-8. [22] C.C. Chang, C.J. Lin, LIBSVM: A Library for Support Vector Machines, Acm T Intel Syst Tec, 2 (2011). [23] M. Jirina, Jr, Classifiers Based on Inverted Distances, INTECH Open Access Publisher2011. [24] B.V. Dasarathy, Nearest neighbor ({NN}) norms:{NN} pattern classification techniques, (1991). [25] R. Remesan, J. Mathew, Hydroinformatics and Data-Based Modelling Issues in Hydrology, Hydrological Data Driven Modelling, Springer2015, pp. 19-39. [26] C. Heylman, R. Datta, A. Sobrino, S. George, E. Gratton, Supervised machine learning for classification of the electrophysiological effects of chronotropic drugs on human induced pluripotent stem cell-derived cardiomyocytes, PloS one, 10 (2015) e0144572. [27] M. Kuhn, K. Johnson, Applied predictive modeling, Springer2013. [28] J.-H. Kim, Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap, Computational statistics & data analysis, 53 (2009) 3735-3745. [29] K. Bache, M. Lichman, UCI machine learning repository, 2013. [30] A. Statnikov, I. Tsamardinos, Y. Dosbayev, C.F. Aliferis, GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data, International journal of medical informatics, 74 (2005) 491-503. [31] T. Barrett, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, M. Holko, NCBI GEO: archive for functional genomics data sets—update, Nucleic acids research, 41 (2013) D991-D995. [32] H. Bengtsson, aroma. light: Light-weight methods for normalization and visualization of microarray data using only basic R data types, 2009, UR L http://www. braju. com/R/. R package version, 1. [33] X.V. Nguyen, J. Chan, S. Romano, J. Bailey, Effective global approaches for mutual information based feature selection, Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 512-521.

22

ACCEPTED MANUSCRIPT

Highlights A novel filter feature selection method named OMICFS is proposed.

MIC statistics is employed to quantify the relevance between features and target.

An orthogonalization strategy is used to deal with the irrelevant redundancy risk.

The performance is compared in terms of both accuracy and efficiency.

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT Hongqiang Lyu received the BS, MS and PhD degrees in Electronic and Information Engineering from Xi’an Jiaotong University, China, in 2007, 2010 and 2015, respectively. Now he is a lecturer in School of Electronic and Information Engineering of the university. His current research interests include bioinformatics and biomedical image processing.

RI PT

Mingxi Wan received the MS and PhD degrees in Biomedical Engineering from Xi’an Jiaotong University, China, in 1985 and 1989 respectively. Currently he is a professor at Department of Biomedical Engineering, Xi’an Jiaotong University. His current research interests include

SC

biomedical ultrasound and biomedical signal processing.

Jiuqiang Han graduated from Xi’an Jiaotong University, China, where he joined the faculty in

M AN U

1977 and is currently a professor in School of Electronic and Information Engineering of the university. His current research interests include 3-D image measurement and fusion, bioinformatics and sensor networks.

Ruiling Liu received the BS, MS and PhD degrees in Electronic and Information Engineering

TE D

from Xi’an Jiaotong University, China, in 2000, 2003 and 2010 respectively. Currently she is a lecturer in School of Electronic and Information Engineering of the university. Her research

EP

interests include bioinformatics, artificial intelligence and computer vision.

Cheng Wang received the MS and PhD degrees in Neuroscience from Chinese Academy of

AC C

Sciences, China, in 2008 and 2012, respectively. Currently he is a postdoctor in Johns Hopkins Krieger School of Arts and Sciences. His current research focuses on brain signal analysis and processing.

Equitability, mutual information, and the maximal information coefficient.

Robust Feature Selection from Microarray Data Based on Cooperative Game Theory and Qualitative Mutual Information.

A kernel-based multivariate feature selection method for microarray data classification.

A comparative study of improvements Pre-filter methods bring on feature selection using microarray data.

Principal feature analysis: a multivariate feature selection method for fMRI data.

FIFS: A data mining method for informative marker selection in high dimensional population genomic data.

RapidMic: Rapid Computation of the Maximal Information Coefficient.

A New Algorithm to Optimize Maximal Information Coefficient.

A new method for data stream mining based on the misclassification error.

A robust and accurate method for feature selection and prioritization from multi-class OMICs data.

Cleaning up the record on the maximal information coefficient and equitability.

A novel algorithm for the precise calculation of the maximal information coefficient.

SparkText: Biomedical Text Mining on Big Data Framework.

Multiclass classification of sarcomas using pathway based feature selection method.

A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information.

A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data.

Compass: a hybrid method for clinical and biobank data mining.

Feature selection based on dependency margin.

A feature selection method based on multiple kernel learning with expression profiles of different types.

A novel feature selection strategy for enhanced biomedical event extraction using the Turku system.

A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data.

Feature selection for chemical sensor arrays using mutual information.

Feature Selection Method Based on Neighborhood Relationships: Applications in EEG Signal Identification and Chinese Character Recognition.

Mining biomedical images towards valuable information retrieval in biomedical and life sciences.