Accepted Manuscript A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining Hongqiang Lyu, Mingxi Wan, Jiuqiang Han, Ruiling Liu, Cheng Wang PII:
S0010-4825(17)30280-9
DOI:
10.1016/j.compbiomed.2017.08.021
Reference:
CBM 2761
To appear in:
Computers in Biology and Medicine
Received Date: 29 March 2017 Revised Date:
19 August 2017
Accepted Date: 20 August 2017
Please cite this article as: H. Lyu, M. Wan, J. Han, R. Liu, C. Wang, A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining, Computers in Biology and Medicine (2017), doi: 10.1016/j.compbiomed.2017.08.021. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A filter feature selection method based on the Maximal Information
Coefficient
and
Gram-Schmidt
Hongqiang Lyu
a,b,*
b
a
a
RI PT
Orthogonalization for biomedical data mining , Mingxi Wan , Jiuqiang Han , Ruiling Liu , Cheng Wangc
School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, P.R. China
b
School of Life Science and Technology, Xi’an Jiaotong University, Xi’an 710049, P.R. China
c
Krieger School of Arts and Sciences, Johns Hopkins University, Baltimore, MD 21218, USA
SC
a
* Corresponding author at: Room 158, 2nd East Building, 28 West Xianning Road, Xi’an 710049, China. Tel: +86 29 82668665.
AC C
EP
TE D
M AN U
E-mail address:
[email protected] (H. Lyu).
1
ACCEPTED MANUSCRIPT Abstract A filter feature selection technique has been widely used to mine biomedical data. Recently, in the classical filter method minimal-Redundancy-Maximal-Relevance (mRMR), a risk has been revealed that a specific part of the redundancy, called irrelevant redundancy, may be involved in the
RI PT
minimal-redundancy component of this method. Thus, a few attempts to eliminate the irrelevant redundancy by attaching additional procedures to mRMR, such as Kernel Canonical Correlation Analysis based mRMR (KCCAmRMR), have been made. In the present study, a novel filter feature selection method based on the Maximal Information Coefficient (MIC) and Gram-Schmidt
SC
Orthogonalization (GSO), named Orthogonal MIC Feature Selection (OMICFS), was proposed to solve this problem. Different from other improved approaches under the max-relevance and min-redundancy
M AN U
criterion, in the proposed method, the MIC is used to quantify the degree of relevance between feature variables and target variable, the GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to previously selected features, and the max-relevance and min-redundancy can be indirectly optimized by maximizing the MIC relevance between the GSO orthogonalized variable and target. This orthogonalization strategy allows OMICFS to exclude the
TE D
irrelevant redundancy without any additional procedures. To verify the performance, OMICFS was compared with other filter feature selection methods in terms of both classification accuracy and computational efficiency by conducting classification experiments on two types of biomedical datasets.
EP
The results showed that OMICFS outperforms the other methods in most cases. In addition, differences between these methods were analyzed, and the application of OMICFS in the mining of high-dimensional biomedical data was discussed. The Matlab code for the proposed method is
AC C
available at https://github.com/lhqxinghun/bioinformatics/tree/master/OMICFS/.
Keywords
Filter feature selection
Maximal Information Coefficient (MIC) Gram-Schmidt Orthogonalization (GSO) Biomedical data mining
2
ACCEPTED MANUSCRIPT 1. Introduction The feature selection method in machine learning plays an important role in the context of biomedical data analysis. With the development of modern measurement technologies in the biomedical field, large volumes of rapidly expanding and ever-changing biomedical data are generated, presenting a primary challenge for researchers to conduct data mining and knowledge discovery [1, 2].
RI PT
Thus, there is a practical need for feature selection methods, by which a compact subset of significant features can be determined to provide information concerning a phenotype of interest. Currently, many different feature selection methods have been utilized in various clinical research studies, such as
SC
biomedical image segmentation for tumor region definition [3], electrocardiogram assessment for health inspection [4], molecular bioactivity prediction for drug design [5] and, particularly, biomarker discovery for cancer diagnosis [6].
M AN U
Feature selection techniques can be broadly divided into three categories, namely, filter methods, wrapper methods and embedded methods [7]. The filter technique performs feature selection by exploring the intrinsic properties of data in an open-loop approach, which functions independently of the classifier design. By contrast, wrapper and embedded techniques perform feature selection by interacting with classifiers. While wrapper methods incorporate a classifier hypothesis into a close-loop
TE D
search for an optimal feature subset, embedded methods build the search into a classifier construction. Wrapper and embedded techniques can commonly achieve better accuracy than the filter technique, but the filter technique is less likely to lead to over-fitting and is much more computationally economical
EP
than the other methods, particularly compared with the embedded technique [8, 9]. Thus, filter methods are commonly used for biomedical data mining, particularly for analyses in high-dimensional
AC C
biomedical domains [10].
Herein, filter methods based on the max-relevance and min-redundancy criterion are examined. minimal-Redundancy-Maximal-Relevance (mRMR) [11] is a classical filter method based on Mutual Information (MI), which searches for a compact subset of informative features by reducing redundant features, and this method has been widely used in biomedical data mining [12, 13]. However, mRMR only addresses the quantity of the redundancy, without considering the type of the redundancy. Thus, there is a risk that a specific part of the redundancy, called irrelevant redundancy, may be involved in the minimal-redundancy component of this method. Although potential irrelevant redundancy is part of the redundancy between different feature variables, it is completely independent of the corresponding 3
ACCEPTED MANUSCRIPT target variable [14, 15]. Thus, the covariates of features were fed into mRMR instead of the features themselves in an attempt to eliminate irrelevant redundancy [15]. An improved algorithm, called Kernel Canonical Correlation Analysis based mRMR (KCCAmRMR), was also developed, in which irrelevant redundancy is filtered out by using an additional kernel canonical correlation analysis, thus, only the relevant redundancy is considered in the subsequent mRMR procedure [16].
RI PT
In the present study, a novel filter feature selection method based on the Maximal Information Coefficient (MIC) [17] and Gram-Schmidt Orthogonalization (GSO), called Orthogonal MIC Feature Selection (OMICFS), was proposed for biomedical data mining. Different from other filter approaches
SC
under the max-relevance and min-redundancy criterion, in the proposed method, the MIC rather than MI is used to quantify the degree of relevance between features and target, the GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to previously selected
M AN U
features to remove the redundancy between them, and the max-relevance and min-redundancy is indirectly optimized by maximizing the MIC relevance between the GSO orthogonalized variable and target, so that the most promising feature is determined. In this way, a compact subset of informative features can be retrieved by a stepwise search. This orthogonalization strategy allows OMICFS to exclude the irrelevant redundancy without any additional procedures.
TE D
The remainder of this paper is organized as follows: Section 2 presents the proposed feature selection method, OMICFS, in detail, and provides a simple introduction of the classifiers and evaluation measures used in the present study. In section 3, classification experiments are conducted,
EP
including a summary of biomedical datasets and comparison with other similar methods in terms of the classification accuracy and computational efficiency. Section 4 describes the differences between
AC C
these methods and the application of OMICFS. Finally, conclusions are provided in Section 5. 2. Methods
2.1. Algorithm design
The irrelevant redundancy risk involved in mRMR has been analyzed in detail in some literature [15, 16]. To illustrate the problem, a simple graphical example is given. In Fig. 1, three variables, including a target variable
t , a previously selected feature x1 and a candidate feature x2 are visualized.
According to the max-relevance and min-redundancy criterion, the mRMR Score (mRMRS) of
4
x2
with
ACCEPTED MANUSCRIPT respect to
x1
has the following form:
mRMRS( x2 , x1 ) = MI ( x2 , t ) − MI ( x2 , x1 ) = (r3 + r4 ) − (r3 + r6 ) = r4 − r6 ,
MI is the MI statistics between two variables, MI ( x2 , t ) is the relevance between the
candidate feature
x2
t , MI ( x2 , x1 ) is the redundancy of x2 with respect to
and target variable
previously selected feature
x1 , and ri
indicates the
i th information region in Fig. 1. It can be seen
MI ( x2 , x1 ) consists of two parts, r3 and r6 . The two parts are considered as the relevant
and
x2
carry about target
t , but r6 is completely irrelevant to the target. It has been
suggested that the irrelevant redundancy
r6 should be removed from the mRMR score, leaving only
M AN U
x1
r3 represents the redundant information
SC
redundancy and irrelevant redundancy, respectively, since that
RI PT
where
that
(1)
TE D
r4 , which represents the unique information that x2 carries about target t [16].
candidate feature
t , a previously selected feature x1
and a
EP
Fig. 1. Diagram of the relations between three variables, including a target variable
x2 . ri
indicates the
i th
information region.
AC C
Herein, a novel filter feature selection method, OMICFS, was designed to try to deal with this problem. The OMICFS Score (OMICFSS) is given by:
OMICFSS( x2 , x1 ) = MIC ( GSO( x2 , x1 ), t ) = (r4 + r7 ) − r7 = r4 ,
(2)
where MIC is the MIC statistics between two variables, GSO ( x2 , x1 ) is the orthogonalized variable of candidate feature function, and
x2
with respect to previously selected feature
by means of GSO
MIC ( GSO( x2 , x1 ), t ) indicates the MIC relevance between the orthogonalized
variable GSO ( x2 , x1 ) and target variable 5
x1
t . In Fig. 1, GSO ( x2 , x1 ) is represented by r4 + r7 ,
ACCEPTED MANUSCRIPT which is the information that is carried by regarded as subtracting r7 from r4
x2
but distinct from
x1 , and MIC ( GSO( x2 , x1 ), t )
can be
+ r7 , leaving only r4 . It can be seen that the irrelevant
redundancy r6 is excluded from the OMICFSS. Thus, the proposed orthogonalization strategy has
2.2. Proposed feature selection method
is
∈ R N ×P is tabled as N samples and P features, where the i th feature variable
xi ∈ RN ,1 ≤ i ≤ P .
The target variable is
t ∈ Z N , in which different integer values represent
SC
A data matrix X
RI PT
the ability to eliminate the irrelevant redundancy from the classical mRMR.
different classes of the corresponding samples. Under the max-relevance and min-redundancy criterion, the proposed feature selection method, OMICFS, in the present study aims to find a compact
si ∈ RN ,1 ≤ i ≤ D ,
N ×D
, D < P , where the i th promising feature variable is
M AN U
subset of informative features S ∈ R
and the target variable t could be optimally characterized by the feature
variables in S . 2.2.1. Max-relevance
TE D
The max-relevance scheme is committed to selecting the most informative features, which effectively distinguish between different classes of samples. In the scheme, the degree of relevance between the feature variables and target variable is quantified, and the informativeness of features is
EP
determined according to the degree of the relevance score. To implement this scheme, a newly explored information metric, MIC, was used in the paper to score the relevance degree of a candidate
AC C
feature variable with respect to the target variable. The Max-Relevance Score (MRelS) of feature variable xi has the following form:
MRelS ( xi ) = MINEMIC ( xi , t ) ,
(3)
where MINE MIC is a MIC score function from the Maximal Information-based Nonparametric Exploration (MINE) application, which can be used to calculate the MIC and other statistical scores. MIC is a statistical measure of the association between paired variables regardless of linear or nonlinear relationship. To get the MIC score, the values of the two variables are partitioned into different number of bins to form rectangular grids with different resolutions, thus the distribution on the 6
ACCEPTED MANUSCRIPT cells of each grid can be obtained by letting the probability mass in each cell be the fraction of points falling into that cell, then the MI statistics for each grid is calculated, and the maximum is chosen as the MIC score. The MIC of two variables
MINEMIC ( x, t ) = max0.6[ xn tn < N
x and t is defined as:
MI xn ,tn ( x, t )
log 2 min{xn , tn }
],
(4)
x and t
RI PT
where N is the sample size, xn and tn denote the number of bins imposed on the
axes, respectively, and MI xn ,tn ( x, t ) is the MI statistics between the two variables for an xn -by- tn rectangular grid. According to the definition of MIC, the MIC score can be used to quantify the degree
application
packages,
which
are
available
at
SC
of relevance between a continuous variable and a qualitative variable, but the two existing MINE http://www.exploredata.net/
[17]
and
M AN U
http://minepy.readthedocs.io/ [18], separately, are not capable of receiving a qualitative target variable as input, since the different integer values of target variable represent different classes of the corresponding samples, and these qualitative values will be arbitrarily partitioned into bins whose number is mostly not equal to the number of sample classes. Thus, in the present study, the latter package was adjusted to ensure that the value of target variable is only allowed to be partitioned into
TE D
specified bins whose number is consistent with that of sample classes. Then the relevance between continuous feature variables and the qualitative target variable was calculated using the adjusted package with default parameters, where the greater the MRelS, the higher the distinguishing ability of
EP
the corresponding feature.
2.2.2. Max-relevance and min-redundancy
AC C
The max-relevance and min-redundancy aims to find a compact subset of informative features by simultaneously considering the max-relevance scheme and min-redundancy. The simple combination of individually informative features does not necessarily achieve a good classification performance. Thus, "the m best features are not the best m features" [11]. Therefore, both the informativeness of individual features and redundancy between them should be considered. There are a number of feature selection methods in this way, such as mRMR [11] and Quadratic Programming Feature Selection (QPFS) [19], in which the difference between the relevance and redundancy is maximized to optimize the max-relevance and min-redundancy. In the present study, the max-relevance and min-redundancy was indirectly optimized using an 7
ACCEPTED MANUSCRIPT orthogonalization strategy that combines the GSO and MIC together. The GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to other features to remove the redundancy between them, and the MIC relevance between the orthogonalized variable and target variable is maximized to indirectly optimize the max-relevance and min-redundancy. The proposed
x1, x2 ,…, x j
is given by:
OMICFSS( xi ; x1 , x2 ,…, x j ) = MINEMIC ( GSO( xi ; x1 , x2 ,…, x j ), t ) ,
RI PT
OMICFS score of candidate feature variable xi with respect to previously selected features
(5)
variables
x1 , x2 ,…, x j
SC
where GSO( xi ; x1 , x2 ,…, x j ) is the orthogonalized variable of feature xi with respect to the feature using a GSO function. The greater the OMICFS score, the more promising
M AN U
the feature xi . Notably, the irrelevant redundancy is not included in OMICFS score because of the involvement of orthogonal transformation and target variable. 2.2.3. Implementation
In actual implementation, a stepwise search is used to select features that achieve a near-optimal maximization of OMICFS score. In the first step, MRelS of all the candidate features is calculated,
s1 = arg max[ MRelS( xi ) ] , xi ∈X
TE D
where the feature with maximum MRelS can be determined as the first promising feature variable (6)
EP
and the corresponding orthogonalized variable is selected as q1 =
s1 . Subsequently, the problem s1
becomes how to incrementally select other promising features from the remaining features, where one
AC C
feature represents one step forward. Suppose that a feature subset S m −1 , 2 ≤ m ≤ D , composed of
m−1 promising features s1 , s2 ,… , sm −1 , has been determined at step m−1 , and the corresponding orthogonalized variables are q1 , q2 , … , qm −1 . The
mth promising feature can be selected from
X − S m −1 at step m by optimizing the following condition:
sm = arg max OMICFSS( xi , Sm−1 ) xi ∈ X − Sm −1 = arg max MINEMIC ( GSO ( xi , S m −1 ), t ) xi ∈ X − S m−1 8
,
(7)
ACCEPTED MANUSCRIPT where GSO( xi , Sm−1 ) is the orthogonalized variable of candidate feature xi with respect to the previously selected feature variables in S m −1 , and can be calculated using a GSO function
GSO ( xi , S m −1 ) =
ui 〈x , q 〉 〈 xi , qm −1 〉 , ui = xi − i 1 q1 − ... − qm −1 . ui 〈 q1 , q1 〉 〈 qm −1 , qm −1 〉
(8)
RI PT
With the determination of s m , the associated orthogonalized variable qm = GSO ( sm , S m −1 ) can be simultaneously determined, which will be useful in subsequent steps. Thus, the promising feature variables could be incrementally retrieved until step D where a total of D features are selected. For example, there are 3 promising features that need to be selected from 5 candidate
x2 . The second step begins with the x2
M AN U
x2 has the maximal MRelS, we can have s1 = x2 , q1 =
SC
features X = {x1 , x2 , x3 , x4 , x5 } . In the first step, the MRelS of all the features is calculated. Assuming
computation of OMICFSS ( x1 , s1 ) , OMICFSS ( x3 , s1 ) , OMICFSS( x4 , s1 ) and OMICFSS ( x5 , s1 ) , and the feature x3 , which is assumed to have the maximal OMICFS score, is selected, leaving the other three features to be chosen later, that is
s2 = x3 , q2 = GSO ( s2 , s1 ) . In the third step, since
TE D
OMICFSS( x1 ;s1 , s2 ) is greater than OMICFSS ( x4 ;s1 , s2 ) and OMICFSS ( x5 ;s1 , s2 ) , the last promising feature is determined as s3 = x1 , q3 = GSO ( s3 ; s1 , s2 ) . Thus, the three expected promising features
EP
S = {x2 , x3 , x1} are finally obtained.
In addition, the theory of Sure Independence Screening (SIS) [20] can be used to speed up OMICFS
AC C
when dealing with high-dimensional biomedical data. In each step of non-accelerating OMICFS, the orthogonalized variables and MIC relevance for each candidate feature have to be recalculated to determine which has the maximum OMICFS score. Thus, when the feature dimension is ultra-high, the computational burden is heavy. The SIS theory claims that the original set of features can be reduced to a small subset whose dimension is in the order of N where
/ log N under the condition of P >> N ,
P and N are the dimension of original features and size of samples, respectively, and then
feature selection can be accomplished by some refined lower dimensional methods [20, 21]. According to SIS, when OMICFS is used for high-dimensional and low-sample size biomedical data, the candidate features can be ranked in descending order by MRelS, and only the top 9
ACCEPTED MANUSCRIPT λ N / log N , λ ≥ 1 features are fed into the stepwise search, which has been described in detail above, to reduce the computational burden. For example, suppose the sample size is N = 200 , and there are D = 200 promising features that need to be selected from a total of P = 15,000, P >> N candidate feature variables. This would be a time-consuming task for OMICFS that is not accelerated.
RI PT
Thanks to SIS, the 15,000 candidate features can be ranked in descending order by MRelS, and only
λ N / log N = 434 , here λ is set to 5, features are retained, so that the original selection
the top
becomes a computationally simpler one, in which 200 promising features need to be selected from
features one by one.
OMICFS: Input: Candidate feature variables
X ∈ R N ×P tabled as N N variable t ∈ Z Output: Ordinal number of
xi ∈ X ,1 ≤ i ≤ P ,
samples and
D
P
features, and target
selected features,
1: for i←{1,..., P } do 3: end for 4: if 5:
P >> N X r = rank[ MRelS ( xi ) ] xi ∈ X
D> N , and the
top-ranking 200 features were fed into the classifiers with a step size of 5 features. The corresponding
TE D
curves, as shown in Fig. 2 (e)-(h), are based on five repeats of 10-fold cross-validation due to small sample size. In addition, the conditions of peak points on the accuracy curves for the SVM, KNN, MLP and RF are shown in Tables 2, 3, 4 and 5, respectively. It is noted that QPFS has no ability to handle
EP
Lung_Cancer and GDS3875 datasets, since the two turn out to be a non-convex problem, but QPFS is a convex quadratic program [19, 33].
AC C
a
b
13
ACCEPTED MANUSCRIPT c
RI PT
d
M AN U
SC
e
AC C
h
EP
g
TE D
f
Fig. 2. Change of the average classification accuracies versus the number of top-ranking features selected by mRMR, KCCAmRMR, QPFS and OMICFS using SVM, KNN, MLP and RF classifiers. The first four datasets are based on 10-fold cross-validation, and the last four datasets are based on five repeats of 10-fold cross-validation due to small sample size. (a) Diabetic, (b) Heart, (c) Parkinson's, (d) Seeds, (e) Prostate_Tumor, (f) Lung_Cancer, (g) GDS2960 and (h) GDS3875.
14
ACCEPTED MANUSCRIPT
Table 2 Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the SVM.
Feaa
Diabetic
19
19
ACC (%) 75.67
Heart
13
8
84.08
Parkinson’s
26
21
70.67
Seeds
7
5
95.72
b
mRMR Kappa (%) 51.59
19
KCCAmRMR ACC Kappa (%) (%) 75.67 51.59
AUC (%) 84.01
90.75
11
84.45
68.33
76.71
16
70.58
41.15
-
6
94.76
AUC (%) 84.01
67.53 41.35 93.57
Feaa
17
ACC (%) 76.54
QPFS Kappa (%) 53.33
AUC (%) 83.64
14
ACC (%) 76.63
91.00
3
84.82
69.08
85.20
11
84.45
68.29
76.13
24
71.35
42.69
78.06
20
71.64
42.77
77.33
92.14
-
5
95.72
93.57
-
5
96.19
94.29
-
81
94.51
89.02
98.57
51
96.51
93.04
97.68
-
-
-
-
21
84.43
67.72
-
96
96.85
93.13
98.78
196
97.84
95.33
99.68
-
-
-
-
191
96.75
93.05
-
10,509
126
95.69
91.38
98.07
36
95.07
90.13
97.08
12,600
51
84.35
68.25
-
186
82.76
64.73
-
GDS2960
4,068
91.50
99.62
93.62
-
GDS3875 16,847 a Number of original features.
101
98.24
96.16
99.68
51
96.07
181
92.67
84.03
-
116
96.98
b
Number of selected features.
Table 3
Feaa
M AN U
Prostate_Tumor Lung_Cancer
RI PT
Original Feaaa
SC
Datasets
Feaa
OMICFS Kappa (%) 53.57
AUC (%) 84.94 90.67
Original Feaaa
mRMR Kappa (%) 33.21
19
9
Heart
13
11
84.82
68.92
91.25
Parkinson’s
26
23
68.75
37.50
74.08
Seeds
7
2
93.34
90.76
-
Prostate_Tumor
10,509
106
78.18
56.72
Lung_Cancer
12,600
191
79.56
55.79
GDS2960
4,068
56
94.84
88.49
176
68.87
29.75
GDS3875 16,847 a Number of original features. b
Number of selected features.
15
Feaab
AUC (%) 71.35
KCCAmRMR ACC Kappa (%) (%) 68.37 37.13
AUC (%) 70.08
OMICFS Kappa (%) 39.12
6
4
ACC (%) 69.15
12
84.82
68.89
91.36
4
85.19
69.39
85.11
11
85.93
71.25
22
68.56
37.12
73.59
26
67.02
34.04
73.95
25
68.56
37.12
74.40
6
92.86
89.29
-
6
91.91
87.86
-
5
93.33
90.00
-
Feaa 6
AUC (%) 72.76
QPFS Kappa (%) 33.49
ACC (%) 66.38
AC C
Diabetic
ACC (%) 66.29
EP
Datasets
TE D
Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the KNN.
Feaa
Feaa
AUC (%) 76.05 91.86
85.37
36
92.60
85.26
94.35
6
92.35
84.70
95.03
26
91.38
82.78
95.69
-
156
82.62
61.24
-
-
-
-
-
21
83.86
66.07
-
98.10
11
94.27
88.26
99.68
36
93.45
85.72
99.30
16
95.22
89.70
99.04
-
181
84.69
57.39
-
-
-
-
-
6
80.15
49.09
-
ACCEPTED MANUSCRIPT
Table 4 Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the MLP. Original Feaaa
b
Feaa
ACC (%)
mRMR Kappa (%)
AUC (%)
Feaa
KCCAmRMR ACC Kappa (%) (%)
AUC (%)
QPFS Kappa (%)
AUC (%)
Feaa
ACC (%)
RI PT
Datasets
Feaa
ACC (%)
OMICFS Kappa (%)
AUC (%)
19
19
75.15
50.44
82.80
18
75.24
50.56
82.94
16
75.58
51.17
82.66
15
76.20
52.50
82.95
Heart
13
5
82.59
64.40
88.72
4
82.96
65.05
85.42
6
83.33
66.22
89.14
5
84.08
67.56
87.75
Parkinson’s
26
17
67.11
34.23
72.90
18
67.02
34.04
73.76
23
67.98
35.96
74.49
19
68.27
36.54
73.77
Seeds
7
6
90.48
85.72
-
6
90.00
85.00
-
3
90.95
86.43
-
4
92.38
88.57
-
10,509
11
85.40
71.24
91.48
6
90.98
81.95
95.09
12,600
51
75.34
53.47
-
21
73.08
50.40
-
GDS2960
4,068
26
89.18
76.91
92.50
16
92.42
83.99
97.42
6
72.14
31.86
-
6
74.51
41.20
-
GDS3875 16,847 a Number of original features. b
Number of selected features.
Table 5
6
91.00
83.51
94.95
11
91.35
80.18
96.86
-
-
-
-
56
78.08
54.38
-
21
89.73
77.83
93.44
6
96.00
91.59
99.31
-
-
-
-
16
75.69
45.13
-
M AN U
Prostate_Tumor Lung_Cancer
SC
Diabetic
Original Feaaa
mRMR Kappa (%) 37.25
Diabetic
19
18
ACC (%) 68.63
Heart
13
13
84.45
68.45
90.78
Parkinson’s
26
23
72.02
44.04
78.85
Feaa
Seeds
7
2
94.29
91.43
Prostate_Tumor
10,509
31
95.25
90.51
Lung_Cancer
12,600
46
85.29
69.31
GDS2960
4,068
51
95.67
90.79
11
85.23
61.68
GDS3875 16,847 a Number of original features. b
Number of selected features.
16
AUC (%) 74.64
KCCAmRMR ACC Kappa (%) (%) 69.50 39.02
AUC (%) 74.93
19
84.45
68.29
89.67
3
84.45
68.37
86.42
12
84.82
69.04
89.53
14
72.40
44.81
79.20
26
72.50
45.00
79.15
21
71.64
43.27
78.66
19
Feaa
Feaa 5
ACC (%) 70.72
OMICFS Kappa (%) 41.47
13
Feaa
AUC (%) 75.52
QPFS Kappa (%) 38.54
ACC (%) 69.24
AUC (%) 75.67
3
94.76
92.14
-
6
94.76
92.14
-
6
95.72
93.57
-
97.81
11
95.07
90.15
98.19
26
95.07
90.15
96.89
31
95.27
90.55
95.96
-
61
84.83
67.45
-
-
-
-
-
181
84.81.
67.78
-
98.68
26
96.64
92.74
98.85
86
96.04
91.43
98.53
36
97.82
95.31
99.23
-
31
90.69
77.40
-
-
-
-
-
66
91.11
75.66
-
-
AC C
b
EP
Datasets
TE D
Comparison of the peak points on the accuracy curves corresponding to mRMR, KCCAmRMR, QPFS and OMICFS using the RF.
ACCEPTED MANUSCRIPT The proposed OMICFS was compared with mRMR, KCCAmRMR and QPFS by analyzing the general trend of the classification accuracy curves. With respect to Diabetic, Seeds, and Lung_Cancer datasets (Fig. 2 (a), (d) and (f)), the average ACC of the OMICFS curves is consistently higher than that of the other curves for all the SVM, KNN, MLP and RF classifiers, except for Diabetic using SVM and MLP as well as Lung_Cancer using RF, in which the accuracy of OMICFS is close to that of the
RI PT
other methods, especially to that of KCCAmRMR and QPFS. For the other datasets, the comparison results are not consistent due to the different classifiers. In these cases, the average ACC of OMICFS could be slightly better or similar compared with the other three methods, except for Heart dataset
good as the other methods, especially as KCCAmRMR.
SC
using RF and GDS3875 dataset using KNN (Fig. 2 (b) and (h)), in which OMICFS is obviously not as
OMICFS was also compared with the other three methods by analyzing the peak points on the
M AN U
classification accuracy curves. In Table 2, using the SVM classifier, the average ACC and Kappa for OMICFS are the maximum in 5 and 4 out of a total of 8 biomedical datasets, respectively. For the average AUC, OMICFS has the maximum value in 2 out of 5 associated datasets. Moreover, the proposed method achieves these peak points with the minimum number of selected features on 3 datasets. For example, with respect to the Diabetic dataset, a maximum average ACC of 76.63% is
TE D
achieved via OMICFS with only 14 features, whereas a maximum average ACC of 75.67% is achieved through mRMR and KCCAmRMR with all 19 features, and 76.54% through QPFS with 17 features. In Table 3, using the KNN classifier, OMICFS has the maximum average ACC and Kappa in 4 out of 8
EP
biomedical datasets, its average AUC is the maximum in 4 out of 5 associated datasets, and the peak points are achieved through OMICFS with the minimum number of selected features on 3 datasets. In
AC C
Table 4, using the MLP classifier, the average ACC and Kappa for OMICFS are always the maximum in 8 biomedical datasets except for 1 dataset about average Kappa, and the maximum average AUC is achieved via OMICFS on 3 out of 5 associated datasets. In Table 5, using the RF classifier, OMICFS has the maximum average ACC and Kappa in 6 and 5 datasets, respectively, and the maximum average AUC in 2 datasets. Thus, as shown in Fig. 2 and Tables 2, 3, 4 and 5, the proposed OMICFS outperforms mRMR, KCCAmRMR and QPFS in terms of accuracy in most cases. 3.4. Computational efficiency The computational efficiency of mRMR, KCCAmRMR, QPFS and OMICFS were different. Table 6 shows the average time consumed by the four methods during feature selection on the datasets 17
ACCEPTED MANUSCRIPT involved in the present study. The running time was obtained on a windows 7 notebook with Intel DualCore i7-6500U 2.50 GHz processor and 8 GB RAM. The results show that mRMR is much more efficient than the other three methods. This result is expected since mRMR typically accepts discretized feature variables. Although KCCAmRMR uses a strategy similar to mRMR for calculating the relevance and redundancy terms, the additional kernel canonical correlation analysis is
RI PT
time-consuming, reflecting the fact that KCCAmRMR is the least computationally economical method. QPFS is the closest approach to mRMR in terms of computational efficiency, but much more time is required while dealing with high-dimensional data due to the lack of a corresponding acceleration
SC
algorithm that is used in the other three methods. It is obvious that the efficiency of OMICFS is not as good as that of mRMR, but is much better than that of KCCAmRMR, as the proposed method could exclude irrelevant redundancy while calculating the max-relevance and min-redundancy without any
M AN U
additional procedures. Compared with QPFS, OMICFS is more efficient in dealing with high-dimensional data due to its acceleration algorithm based on SIS theory. Table 6
Comparison of the average computational time (second) consumed by mRMR, KCCAmRMR, QPFS and OMICFS. Datasets
mRMR
KCCAmRMR QPFS
Diabetic
0.03
101.23
0.07
Heart
0.02
2.23
0.04
Parkinson’s
0.05
132.35
0.08
16.55
Seeds
0.01
0.85
0.02
0.11
Prostate_Tumor
9.39
1,209.23
520.20
109.64
Lung_Cancer
15.59
11,109.65
-
480.95
GDS2960
6.59
319.13
233.03
41.37
GDS3875
24.08
9,611.28
-
299.65
EP
TE D
0.23
AC C
4. Discussion
OMICFS
8.12
4.1. Necessity of normalization A normalization procedure was used to prepare the two types of datasets in the present study. To illustrate the necessity of the normalization, all the datasets were processed in exactly the same way, except that the normalization procedure was omitted. The results of the proposed OMICFS without normalization are shown in Table 7, including the number of selected features and the average ACC of peak points using the SVM, KNN, MLP and RF classifiers. Compared with the corresponding parts of Table 2, 3, 4 and 5, it can be seen that OMICFS has a better performance in most cases with the help of the normalization procedure. There are at least two reasons for it. First, the MIC value can be 18
ACCEPTED MANUSCRIPT changed by normalization. The distribution of points falling into the cells of a rectangular grid is changed due to the absence of normalization. It makes the MIC value different from that with normalization. Second, normalization has always been used to make different features on the same scale before being fed into the training engine of classifiers, such as the svm-scale module in LIBSVM package, since the features with much larger scales can govern the output of classifiers, especially
RI PT
distance-based classifiers. Thus, the preceding normalization procedure in the present study helps OMICFS to find the promising features which can achieve better classification performance.
Peak points on the accuracy curves by OMICFS without normalization. Datasets
Original Feaaa
SVM
KNN b
19
19/76.11
13/70.20
Heart
13
10/84.45
10/73.70
Parkinson’s
26
20/72.69
8/64.52
Seeds
7
4/95.24
5/90.48
RF
18/75.58
11/70.54
13/82.60
13/83.70
25/68.27
17/73.17
4/90.95
3/94.76
M AN U
Diabetic
MLP
SC
Table 7
Prostate_Tumor
10,509
71/94.29
11/93.51
21/90.21
21/94.11
Lung_Cancer
12,600
126/82.82
136/82.16
11/79.09
76/84.01
GDS2960
4,068
156/97.00
71/95.05
6/91.09
76/96.42
146/96.25
161/82.18
11/76.58
136/88.20
GDS3875 16,847 a Number of original features. b
Number of selected features/average ACC (%).
TE D
4.2. Differences with other methods
With respect to other filter feature selection methods based on the max-relevance and min-redundancy criterion, such as mRMR, KCCAmRMR and QPFS, the proposed method, OMICFS,
EP
has two main differences that enable competitive data mining. First, an orthogonalization strategy is used in this method to exclude irrelevant redundancy. That is different from mRMR in which the
AC C
undesirable irrelevant redundancy is involved. Compared with KCCAmRMR, the orthogonalization strategy allows OMICFS to exclude the irrelevant redundancy while optimizing the max-relevance and min-redundancy, making this method much more computationally economical than KCCAmRMR, in which an additional time-consuming kernel canonical correlation analysis is necessary for the filtering of irrelevant redundancy. Second, MIC statistics is chosen in this method to explore the interesting relationships between pairs of variables. Compared with MI statistics in mRMR and KCCmRMR as well as Pearson correlation coefficient in QPFS, the MIC not only effectively captures a wider range of functional relationships, such as exponential, periodic, sinusoidal and so on, but can also be applicable to non-functional relationships. In addition, the MIC is general and roughly equal to the coefficient of 19
ACCEPTED MANUSCRIPT determination on functional relationships under noisy conditions [17]. Thus, the choice of statistics benefits OMICFS, although more computational time is required for MIC. 4.3. Application of OMICFS As a filter feature selection technique, OMICFS can be used to mine data from various fields. Herein,
RI PT
we analyzed high-dimensional biomedical data, such as cancer biomarker discovery, which is helpful for the accurate diagnosis of cancer. Heatmaps of the top 20 marker genes selected by OMICFS on Prostate_Tumor and Lung_Cancer datasets are shown in Fig. 3 (a) and (b), respectively. These color maps show the level of gene expression versus the type of cancers and name of marker genes, where
SC
red and blue represent the high and low expression levels relative to the mean, respectively. The results suggest that OMICFS distinguishes between different cancers by identifying a set of key
M AN U
marker genes related to each type of cancer. Compared with OMICFS in dealing with the two high-dimensional biomedical datasets, mRMR has a fairly high speed, but its accuracy is much lower using the KNN classifier. KCCAmRMR and QPFS can achieve similar accuracy, but they are much less computationally economical, and QPFS has no ability to handle Lung_Cancer dataset. Thus, it is better to choose a competitive feature selection method according to its own characteristics and
TE D
specific applications.
b
AC C
EP
a
Fig. 3. Heatmaps of the top 20 marker genes selected by OMICFS on (a) Prostate_Tumor and (b) Lung_Cancer datasets. The type of cancers and the name of marker genes are shown on the horizontal and vertical axes, respectively. Red and blue represent the high and low expression levels relative to the mean, respectively. The Standard Deviation (SD) from the mean is indicated.
20
ACCEPTED MANUSCRIPT 5. Conclusion In the present study, a new filter feature selection method, called OMICFS, was proposed for biomedical data mining. Comparing with other methods under the max-relevance and min-redundancy criterion, such as mRMR, KCCAmRMR and QPFS, the main advantage of the proposed method is that
RI PT
an orthogonalization strategy is used to exclude the undesired irrelevant redundancy without any additional procedures while optimizing the max-relevance and min-redundancy. Besides, MIC statistics has a promising ability to capture the complex relationships between pairs of variables. Thus, the proposed method outperforms the other methods in most experiments. OMICFS could serve a
SC
variety of applications related to data mining, particularly the mining of high-dimensional biomedical data.
M AN U
Acknowledgments
This work was financially supported by the National Natural Science Foundation of China under Grant 61602367 and the China Postdoctoral Science Foundation under Grant 2015M580851. References
AC C
EP
TE D
[1] E. Bender, Big data in biomedicine, Nature, 527 (2015) S1. [2] K.H. Buetow, Cyberinfrastructure: empowering a" third way" in biomedical research, Science, 308 (2005) 821-824. [3] P. Lambin, E. Rios-Velazquez, R. Leijenaar, S. Carvalho, R.G. van Stiphout, P. Granton, C.M. Zegers, R. Gillies, R. Boellard, A. Dekker, Radiomics: extracting more information from medical images using advanced feature analysis, European journal of cancer, 48 (2012) 441-446. [4] T. Mar, S. Zaunseder, J.P. Martínez, M. Llamedo, R. Poll, Optimization of ECG classification by means of feature selection, IEEE transactions on Biomedical Engineering, 58 (2011) 2168-2177. [5] J. Weston, F. Pérez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, B. Schölkopf, Feature selection and transduction for prediction of molecular bioactivity for drug design, Bioinformatics, 19 (2003) 764-771. [6] Y. Saeys, L. Wehenkel, P. Geurts, Statistical interpretation of machine learning-based feature importance scores for biomarker discovery, Bioinformatics, 28 (2012) 1766-1774. [7] Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics, 23 (2007) 2507-2517. [8] A. Sharma, K.K. Paliwal, S. Imoto, S. Miyano, A feature selection method using improved regularized linear discriminant analysis, Mach Vision Appl, 25 (2014) 775-786. [9] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics, 26 (2010) 392-398. [10] G. Wong, C. Leckie, A. Kowalczyk, FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number, Bioinformatics, 28 (2012) 151-159. [11] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE transactions on pattern analysis and machine intelligence, 27 (2005) 1226-1238. [12] A. Chin, A. Mirzal, H. Haron, H. Hamed, Supervised, Unsupervised and Semi-supervised Feature selection: A Review on Gene Selection, IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, (2015). 21
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
[13] A. Balodi, M.L. Dewal, R.S. Anand, A. Rawat, Texture based classification of the severity of mitral regurgitation, Comput Biol Med, 73 (2016) 157-164. [14] P.A. Estevez, M. Tesmer, C.A. Perez, J.M. Zurada, Normalized mutual information feature selection, IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 20 (2009) 189-201. [15] O. Kurşun, C.O. ŞAKAR, O. Favorov, N. Aydin, S.F. GÜRGEN, Using covariates for improving the minimum redundancy maximum relevance feature selection method, Turkish Journal of Electrical Engineering & Computer Sciences, 18 (2010) 975-989. [16] C.O. Sakar, O. Kursun, F. Gurgen, A feature selection method based on kernel canonical correlation analysis and the minimum Redundancy–Maximum Relevance filter method, Expert Systems with Applications, 39 (2012) 3432-3437. [17] D.N. Reshef, Y.A. Reshef, H.K. Finucane, S.R. Grossman, G. McVean, P.J. Turnbaugh, E.S. Lander, M. Mitzenmacher, P.C. Sabeti, Detecting novel associations in large data sets, Science, 334 (2011) 1518-1524. [18] D. Albanese, M. Filosi, R. Visintainer, S. Riccadonna, G. Jurman, C. Furlanello, Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers, Bioinformatics, 29 (2013) 407-408. [19] I. Rodriguez-Lujan, R. Huerta, C. Elkan, C.S. Cruz, Quadratic programming feature selection, Journal of Machine Learning Research, 11 (2010) 1491-1516. [20] J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (2008) 849-911. [21] Q. He, D.-Y. Lin, A variable selection method for genome-wide association studies, Bioinformatics, 27 (2011) 1-8. [22] C.C. Chang, C.J. Lin, LIBSVM: A Library for Support Vector Machines, Acm T Intel Syst Tec, 2 (2011). [23] M. Jirina, Jr, Classifiers Based on Inverted Distances, INTECH Open Access Publisher2011. [24] B.V. Dasarathy, Nearest neighbor ({NN}) norms:{NN} pattern classification techniques, (1991). [25] R. Remesan, J. Mathew, Hydroinformatics and Data-Based Modelling Issues in Hydrology, Hydrological Data Driven Modelling, Springer2015, pp. 19-39. [26] C. Heylman, R. Datta, A. Sobrino, S. George, E. Gratton, Supervised machine learning for classification of the electrophysiological effects of chronotropic drugs on human induced pluripotent stem cell-derived cardiomyocytes, PloS one, 10 (2015) e0144572. [27] M. Kuhn, K. Johnson, Applied predictive modeling, Springer2013. [28] J.-H. Kim, Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap, Computational statistics & data analysis, 53 (2009) 3735-3745. [29] K. Bache, M. Lichman, UCI machine learning repository, 2013. [30] A. Statnikov, I. Tsamardinos, Y. Dosbayev, C.F. Aliferis, GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data, International journal of medical informatics, 74 (2005) 491-503. [31] T. Barrett, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, M. Holko, NCBI GEO: archive for functional genomics data sets—update, Nucleic acids research, 41 (2013) D991-D995. [32] H. Bengtsson, aroma. light: Light-weight methods for normalization and visualization of microarray data using only basic R data types, 2009, UR L http://www. braju. com/R/. R package version, 1. [33] X.V. Nguyen, J. Chan, S. Romano, J. Bailey, Effective global approaches for mutual information based feature selection, Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 512-521.
22
ACCEPTED MANUSCRIPT
Highlights A novel filter feature selection method named OMICFS is proposed.
MIC statistics is employed to quantify the relevance between features and target.
An orthogonalization strategy is used to deal with the irrelevant redundancy risk.
The performance is compared in terms of both accuracy and efficiency.
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT Hongqiang Lyu received the BS, MS and PhD degrees in Electronic and Information Engineering from Xi’an Jiaotong University, China, in 2007, 2010 and 2015, respectively. Now he is a lecturer in School of Electronic and Information Engineering of the university. His current research interests include bioinformatics and biomedical image processing.
RI PT
Mingxi Wan received the MS and PhD degrees in Biomedical Engineering from Xi’an Jiaotong University, China, in 1985 and 1989 respectively. Currently he is a professor at Department of Biomedical Engineering, Xi’an Jiaotong University. His current research interests include
SC
biomedical ultrasound and biomedical signal processing.
Jiuqiang Han graduated from Xi’an Jiaotong University, China, where he joined the faculty in
M AN U
1977 and is currently a professor in School of Electronic and Information Engineering of the university. His current research interests include 3-D image measurement and fusion, bioinformatics and sensor networks.
Ruiling Liu received the BS, MS and PhD degrees in Electronic and Information Engineering
TE D
from Xi’an Jiaotong University, China, in 2000, 2003 and 2010 respectively. Currently she is a lecturer in School of Electronic and Information Engineering of the university. Her research
EP
interests include bioinformatics, artificial intelligence and computer vision.
Cheng Wang received the MS and PhD degrees in Neuroscience from Chinese Academy of
AC C
Sciences, China, in 2008 and 2012, respectively. Currently he is a postdoctor in Johns Hopkins Krieger School of Arts and Sciences. His current research focuses on brain signal analysis and processing.