Oblique Decision Tree Ensemble via Multisurface Proximal Support Vector Machine.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Oblique Decision Tree Ensemble via Multisurface Proximal Support Vector Machine Le Zhang, Student Member, IEEE, and Ponnuthurai N. Suganthan, Senior Member, IEEE

Abstract—A new approach to generate oblique decision tree ensemble is proposed wherein each decision hyperplane in the internal node of tree classifier is not always orthogonal to a feature axis. All training samples in each internal node are grouped into two hyper-classes according to their geometric properties based on a randomly selected feature subset. Then multisurface proximal support vector machine is employed to obtain two clustering hyperplanes where each hyperplane is generated such that it is closest to one group of the data and as far as possible from the other group. Then, one of the bisectors of these two hyperplanes is regarded as the test hyperplane for this internal node. Several regularization methods have been applied to handle the small sample size problem as the tree grows. The effectiveness of the proposed method is demonstrated by 44 real-world benchmark classification data sets from various research fields. These classification results show the advantage of the proposed approach in both computation time and classification accuracy. Index Terms—Bias, ensemble, generalized eigenvectors, oblique, orthogonal, proximal support vector machine (SVM), Random Forest (RaF), Rotation Forest (RoF), variance.

I. I NTRODUCTION ECENTLY, the perturb and combine strategy [1] has become an active research area in machine learning [2], pattern recognition, and computer vision [3]. Extensive research has been done, both theoretical and empirical, which demonstrate the advantages of the combination paradigm over single classifier models. This strategy, also called as ensemble classifiers [4] or multiple classifier systems [5], works in two steps. Perturb step means applying a given learning algorithm (homogeneous or heterogeneous) to a set of perturbed training datasets. Combine step means the output of the learned classifiers are aggregated in a suitable manner. The success of ensemble classifiers stems from the significant variance reduction of the classifiers [1], [6], [7]. According to bias and variance decomposition theory [1], [8], the classification error can be decomposed into bias and variance (as shown in Section IV-D). The bias measures how closely the learning algorithm’s average guess over all possible training sets of the given training set size matches the target. The variance shows how much the learning algorithm’s guess bounces around for the different training sets of the given size.

R

Manuscript received May 31, 2014; revised August 29, 2014; accepted October 20, 2014. This paper was recommended by Associate Editor X. Wang. The authors are with the Department of Electrical and Computer Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]). The supplementary file of this paper can be obtained via request to the first author or downloaded from http://web.mysites.ntu.edu.sg/ epnsugan/PublicSite/Shared%20Documents/T-Cyb-Oblique-RF-Supp-file.pdf Digital Object Identifier 10.1109/TCYB.2014.2366468

Decision tree is a divide and conquer algorithm and known for its simplicity and ease of interpretability. This recursively partitioning strategy is sensitive to perturbation of the training data, yet sufficiently accurate. Thus, it is said to be low bias and high-variance classifier. The performance of such classifier can be significantly improved by ensemble methods. Two approaches for constructing decision tree ensemble, namely Random Forest (RaF) [9] and Rotation Forest (RoF) [10], seem to be perceived as state-of-the-art at present. In particular, a recent study [63] rates Random Forests as the best classifier among 179 classifiers over 121 datasets. RaF combines the concepts of bagging [11] and random subspace [12] to build a classification ensemble with a set of decision trees that grow using randomly selected subspaces of data in each nonleaf node. Since each base classifier is trained on a bootstrap data set whose distribution is similar to the whole population, all base classifiers perform well. The random subspace strategy at each node further increases the variance of each base classifier. The rationale behind the random subspace strategy is that the explicit randomization of the attribute selection combined with ensemble averaging should be able to reduce variance more strongly than the weaker randomization schemes. RaF is a well-known ensemble classifier which has demonstrated its effectiveness in solving problems such as micro-arrays [13], time series [14], spectral data [15], object recognition, and image segmentation [12], [16]. RaF is comparable in performance to many other nonlinear learning algorithms. It performs well with little parameter tuning [17], and is able to identify relevant feature subsets even in the presence of a large number of irrelevant features [13], [18], [19]. More recently, additional applications of RaF such as feature selection [15] and the explorative analysis of sample proximities [20] have gained interest. RoF [10] draws upon the RaF idea but differs in two aspects. Firstly, in RoF each tree is trained in a rotated feature space of the whole data. Secondly, at each node of the tree, the best split is explicitly searched among all the features. The authors claim that as the tree learning algorithm builds the classification regions using hyperplanes parallel to the feature axes, a small rotation of the axes may lead to a very different tree. Under this circumstance, the variance of the tree classifiers can be improved. In their study, the authors found RoF works better than RaF and even more better for smaller ensembles sizes. Recently, RoF has been applied in various research areas such as economics and business, medical [21], [22], and bioinformatics [23], [24].

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

Although RaF and RoF have been widely studied, most of them are based on univariate decision tree [25]. Actually the approaches for learning decision trees can be classified into two broad categories: 1) univariate (or axis-parallel or orthogonal) decision tree explicitly searches the best feature to split among several features according to some impurity criteria and 2) multivariate (or oblique) decision tree performs a test at each node based on all or part of the features [26]. Generally speaking, univariate decision tree can approximate any oblique decision boundary by using a large number of stair-like decision boundaries. In a decision tree, each hyperplane at a nonleaf node separates the data so that the further separations in the children nodes are easier. The hyperplane itself does not have to be a good classifier at that stage [27]. Most decision tree induction algorithms work by finding a split that has the lowest impurity score such as the Gini index, entropy, a twoing rule. The most commonly used is the axis-parallel decision trees which for each feature, the algorithm will exhaustively find a cuttingplane and score it based on one predefined criteria. The nature of impurity criterion is to measure the skewness of the distribution of different classes in the set of samples reaching the node. It gives low impurity score to those distributions that are near uniform and gives high score to those distributions where the number of patterns of one class is much larger than the rests. Most of the popular decision tree induction algorithms work by optimizing an impurity measure. However, many of the impurity measures are not differentiable with respect to the hyperplane parameters. Thus, those algorithms employ some search techniques for finding the best hyperplane at each node. For example, CART-LC [28] uses a deterministic hill-climbing algorithm, OC1 [29] uses a randomized search based on CART-LC. Both of these approaches search in 1-D at a time, which becomes computationally cumbersome in high-dimensional feature spaces. Moreover, both algorithms face the local optimum problem. Multiple trials or restarts can be employed to decrease the possibility of ending up with a locally optimal solution. One alternative method to optimize in all dimension is evolutionary approaches [30], [31]. They are claimed to be tolerant to noisy evaluations of the rating function and also facilitate optimizing multiple rating functions simultaneously [32], [33]. Geurts et al. [6] proposed to involve stronger randomization in the tree classifier. Zhang et al. [34] generate this idea to oblique case. Other existing works relating to the decision tree integration and its generalization can be found in [35]–[37]. However, besides higher computational complexity, those search approaches involve several parameters which are difficult for practitioners to tune. In [38], a decision tree support vector machine is proposed. This paper uses standard support vector machine with RBF kernel to generate the optimal hyperplane. However, besides carefully parameter tuning via extra cross-validation, this method is rather slow. We refer the readers to [39] for more information about decision tree induction methods. It is worth noticing that the problem with all impurity measures, which is reported by [27], is that they depend only on the distribution of different classes on each side of the hyperplane. Hence, the impurity measures do not really capture the

IEEE TRANSACTIONS ON CYBERNETICS

geometric structure of class regions because the impurity measure of the hyperplane will not change if one changes the label of the data without changing the relevant features of each class on either side of a hyperplane. Instead of only considering the label information, the geometric structure considers the internal data structure which measures the distance of the data sample to the decision hyperplane. So any of the relevant feature changes, the hyperplane will change. In order to search for such a hyperplane that captures the geometric structure of class distributions, Manwani and Sastry [27] borrow ideas from a recent variant of the support vector machines (SVMs) method to generate the geometric decision tree. The multisurface proximal SVM (MPSVM) [40] finds two clustering hyperplanes, one for each class, where each hyperplane is generated such that it is closest to one class of data and as far as possible from the other class, and the data is classified based on the distance to the hyperplanes. In their work, at each node, they use MPSVM to find two clustering hyperplanes. They choose one of the angle bisector of these two hyperplanes with better impurity measure. Since MPSVM is originally proposed for binary problem, they transform the multiclass problem into binary one by grouping the majority class into one class and the rest be another class. As the tree grows, the samples reaching the subsequent nodes become fewer and the small sample size problem arises. The authors propose to use NULL space [41] method to tackle this. In this paper, we propose to evaluate the performance of decision tree ensembles which employ geometric decision trees as base classifiers. We propose a different approach to handle the multiclass problem with better geometric property. We integrate the oblique decision tree in RaF and RoF frameworks to determine their performances. We propose two different regularization approaches to address the small sample size problem. It was found that the NULL space approach to be unstable compared to others [42]. The rest of this paper is organized as follows. Section II reviews the RaF, RoF, and the MPSVM as well as the proposed method. Section III presents the experimental study comparing RaF, RoF, and the proposed method. In Section IV, we analyze the experimental results. Section V offers the conclusion and presents our future work. II. R ELATED W ORK A. RaFs Proposed by Breiman [9], RaF combines the concepts of bagging and random subspaces to increase the variance of the base classifiers. The ensemble consists of decision trees generated on bootstrap samples. RaF is presented in Table I. Each tree classifier in the RaF ensemble is trained on a bootstrap set of the original training set. At each node of the tree classifier, “mtry” features from n are randomly selected. Then one feature from the mtry features is selected to perform a partition along this feature axis according to an impurity criterion (e.g., information-gain, gini-impurity, etc.) [43]. This decision tree named as classification and regression tree (CART) by Breiman [9], is a univariate decision tree [26],

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG AND SUGANTHAN: OBLIQUE DECISION TREE ENSEMBLE VIA MPSVM

TABLE I RAF

3

C. Multisurface Proximal Support Vector Machine The MPSVM is originally proposed for binary classification problems. Let matrix A with size of m1 × n represent samples of class 1 and matrix B with size m2 × n represent samples of class 2. Then, the MPSVM seeks two planes in Rn X ∗ W1 − b1 = 0 X ∗ W2 − b2 = 0

since the test within each node is performed by only one feature. B. RoFs RoF is another kind of well-known decision tree ensemble. It aims at building high strength and low correlation classifiers [9]. Bootstrap samples are taken as the training set for the individual classifiers, as in bagging. The main heuristic is to apply feature extraction and to subsequently reconstruct a full feature set for each classifier in the ensemble. RoFs works as follows. Let x = [x1 , . . . xn ] be a data sample with n features and X be the training set in a form of an N × n matrix. 1) Split the features into m disjoint subsets. Assume m is a factor of n so that there are M = n/m features in each subset. 2) Denote by Fi,j the jth subset of features for the training set of classifier Di . For every such subset, PCA is employed on a bootstrap samples of all data. Store all (M ) (1) (2) components, αi,1 , αi,1 , . . . , αi,1 j , each of size M × 1. Note Mj may be smaller than M since some of the eigenvalues may be zero. 3) Organize the obtained vectors mentioned above in a sparse rotation matrix Ri ⎡ ⎢

Ri = ⎢ ⎢ ⎣

(1)

(M )

(2)

αi,1 ,αi,1 ,...,αi,1 1 , [0]

.. . [0]

···

[0]

αi,2 ,αi,2 ,...,αi,2 2 , ···

[0]

[0] (1)

(M )

(2)

.. .

[0]

..

.

···

···

⎤ ⎥ ⎥ ⎥. ⎦

(1)

(1) (2) (M ) αi,m ,αi,m ,...,αi,mm

The columns of Ri are rearranged so that they correspond to the original features. Denote the rearranged rotation matrix Ri a , then the training set for classifier Di is XRi a . Then each tree classifier is generated with XRi a . Note in each node of the tree, the hyperplane is searched within all the features. The details of RoF can be found in Table A-I in the supplementary file.

(2)

where the first plane (W1 , b1 ) is closest to the samples of class 1 and furthest from the samples in class 2, while the second plane (W2 , b2 ) is closest to the samples in class 2 and furthest from the samples in class 1. To obtain the first plane of class 2, the authors propose to minimize the sum of squares of two-norm distances between each of the samples of class 1 to the plane divided by the squares two-norm distance between each of the samples of class 2 to the plane. Hence, the first plane can be obtained by solving the following optimization problem: 2 AW − eb2 / Wb (3) min (W,b)=0 BW − eb2 / W 2 b and similarly, the second plane can be obtained by solving 2 BW − eb2 / Wb min (4) (W,b)=0 AW − eb2 / W 2 b

where denotes the two-norm and e is a vector of ones with the same dimension as AW and BW. By defining G = [A − e] [A − e] H = [B − e] [B − e] Z = [W b]

(5)

the optimization problem (4) becomes min z=0

z Gz z Hz

(6)

similarly, by defining L = [B − e] [B − e] M = [A − e] [A − e]

(7)

the optimization problem (5) becomes min z=0

z Lz . z Mz

(8)

Thus, the two clustering hyperplanes can be found by the eigenvectors corresponding to the smallest eigenvalues of the following two generalized eigenvalue problems: Gz = λHZ, z = 0 Lz = γ Mz, z = 0.

(9)

Due to the way they are defined, these clustering hyperplanes capture the geometric data property that are useful for discriminating the classes. Hence, a hyperplane that passes in between them could be good for splitting the feature space. In [27], they take the hyperplane that bisects the angle between the clustering hyperplanes as the split rule at a nonleaf node



TABLE II MULTI 2 BINARY

inside a decision tree. Since, in general, there would be two angle bisectors, they choose the bisector that is better, based on an impurity measure, i.e., the Gini index. If the two clustering hyperplanes happen to be parallel to each other, then they take a hyperplane midway between the two as the split rule. Promising results can be found in their work. Hence, we develop further from their idea. The following subsection describes the details of this paper. D. Proposed Method In order to employ the hyperplane that bisects the clustering hyperplanes generated from MPSVM as a test function, at least two problems need to be addressed. 1) How to handle the multiclass problem since the MPSVM can only solve the binary classification problems. 2) In solving the generalized eigenvalue problem (9), it cannot be guaranteed that the matrix G and L are always positive definite. This small sample size problem is more likely to occur in the process of decision tree induction. As the tree grows, the number of samples reaching lower nodes will be less and less but the matrix G and M have the size of (n + 1) × (n + 1) always. Several approaches have been proposed to transform the multiclass problem into several binary ones such as one against all [44], one against one [45], DAG [46], ECOC [47], and so on. Promising as they are reported, they are used as preprocessing approaches and they increase the computational complexity greatly. Recursive partitioning mechanism makes decision tree advantageous over other binary algorithms. In [27], they transform the multiclass problem into binary one by picking the majority class out and making it and the remaining classes as two hyper-classes. This approach fails to capture the geometric data structures. Here, we propose to separate all classes into two hyper-classes based on their separability. In statistics, the Bhattacharyya distance measures the similarity between two discrete or continuous probability distributions, which is considered a good measure for class two normal classes w1 and w2 , separability between N(μ1 , 1 ) and N(μ2 , 2 ). Here, we take the multivariate Gaussian distribution due to several reasons, as reported in [42]. First, it is the most natural distribution and the sum of a large number of independent random distributions obeys

TABLE III NULL S PACE A PPROACH

Gaussian distribution. It has the maximum uncertainty of all distributions having a given mean and variance. Moreover, it is an appropriate model in many situations. The Bhattacharyya distance is defined as −1

1 T 1+ 2 B(w1 , w2 ) = (μ2 − μ1 ) (μ2 − μ1 ) 8 2 + 2 /2 1 (10) + ln 1 . 2 1

2

Our multiclass to two-class transformation approach is presented in Table II. In order to handle the small sample size problem, Manwani and Sastry [27] proposed to use the NULL space approach to regularize the relevant matrix followed by decision tree pruning. Their NULL space approach is presented in Table III. In this paper, we apply two different regularization approaches because we also found that the NULL space approach often lower the performance with the perturbation of the data when there are limited data samples as highlighted in [48]. Decision tree pruning may solve this problem because most of the nodes near the leaf node may be pruned. Those nodes are highly unstable because the number of data samples reaching them are small. However, in decision tree ensembles, all base classifiers should be permitted to grow to the largest extent. There is no consensus as to whether pruning or growing to the largest extend is the best strategy for decision tree ensembles [10]. Here, we apply two regularization approaches: 1) Tikhonov regularization [49] and 2) axis-parallel split regularization. Tikhonov regularization is often used to regularize least squares and mathematical programming problems [50], [51] and other applications. It works by simply adding a constant term to the diagonal entries of the matrix to be regularized. In our case, suppose G becomes rank deficient, G is regularized by G = G + δ × I.

(11)

A similar approach can be applied to regularize matrix H. If the matrix of G or H becomes singular at a node, we can always use the axis-parallel split to continue the tree induction


5

TABLE IV C LASSIFICATION ACCURACY OF R A F AND I TS MPSVM-BASED VARIANTS

process. Hence, the decision tree grows using heterogeneous test functions. From the root node to the current node, the decision tree uses the MPSVM to perform splits. From the current node to the leaf node, the decision tree switches to use the axis-parallel splitting method. MPRaF-T, MPRaF-P, and MPRaF-N represent the MPSVM-based RaFs with Tikhonov, axis-parallel, and NULL space regularization, respectively. MPRoF-T, MPRoF-P, MPRoF-N represent the MPSVM-based RoF with Tikhonov, axis-parallel, NULL space regularization, respectively. III. E XPERIMENTAL VALIDATION Experiments were set up to evaluate the performances of the MPSVM-based RaFs and RoFs with different regularization methods. In order to compare fairly, CART [28] was used as the base classifier for all ensemble methods. Gini-impurity which is defined as 2 c

c

nwi l nwi r 2 nt l nt r Gini(t) = 1− + 1− nt nt nt r nt l i

i

(12) where nt is the number of data samples reaching this node, nt l , nt r is the number of data samples that reach the left and right child nodes of the current node, respectively, nwi l , nwi r represents the number of samples of class wi in the left and right nodes, respectively, is employed as the impurity criterion in all cases. All methods in this paper are tested on 44 realworld benchmark classification datasets from various research fields. We select two face recognition data sets: 1) ORL and 2) YALE. Two bio-informatics data sets (AFP-Pred [52] and BLProt [53]) are also selected. The remaining data sets are from UCI repository. The datasets used in this paper are summarized in Table A-II in the supplementary file.

We follow [54] and [55] to do feature extraction from the face images. After this step, the ORL has a size of 400 × 59 and Yale has a size of 165 × 14. The simulations of different algorithms on all datasets are carried out in MATLAB 2010b with Intel Core i5, 3.20-GHz CPU, and 4-GB RAM. For decision trees, there is one parameter, “minleaf ,” which controls the maximum number of samples within an unpure node, is usually set to be 1 as default. For all ensemble methods, the ensemble size L can be regarded as a hyperparameter. Generally, the performances of RaF and RoF will not become worse if we increase L. Considering the computational complexity, we conduct all experiments with L = 50. For RaF, there is another parameter, mtry, which represents the number of features randomly selected, controls the randomness of the algorithm. This √ parameter is set to be round( n) where n is the dimensionality of a given data set. For RoF, each feature subset has three features [10]. If n cannot be divided by 3, the final subset is completed with 1 or 2 features, as necessary, randomly selected from other feature subsets. The Tikhonov regularization term δ is set to be 0.01 for all experiments in this paper. One may argue that it may be optimized by either a separate validation dataset or by out-of-bag data. However, it may become computationally infeasible in that case since there are a lot of nodes in the ensemble especially for very large number of training samples. We attempted to optimize this parameter in each node with out-of-bag data. But, no significant improvement was observed. The nature of ensemble method makes it robust with respect to this parameter. IV. E XPERIMENTAL R ESULTS AND D ISCUSSION A. Does MPSVM Improve the Decision Tree Ensemble? In the first set of experiments, we compare the classification accuracies of MPSVM-based RaF and RoF with three different regularization approaches with the standard RaF and



TABLE V C LASSIFICATION ACCURACY OF RO F AND I TS MPSVM-BASED VARIANTS

RoF. The classification accuracies of each dataset are presented in Tables IV and V. For each dataset, 10 threefold cross validations were carried out. But for some dataset where the data partition was already performed, then we just use the training set to train the model and put all the testing set on the model. This procedure was conducted ten times. The boldface indicates the best results. From Tables IV and V, we can see in most cases, the MPSVM-based RaF outperforms the standard RaF or at least achieves comparable performance. However, for RoF, it can be observed that only MPSVM regularized by axis-parallel split works satisfactorily. The reason is that in the induction of a CART in RoF, the dimensions of matrix G and H are always (n + 1) × (n + 1) (where n is the dimension of a given data sample, For RaF, this dimension turns out to be (mtry + 1) × (mtry + 1)). As the tree grows up, the number of the data samples reaching a given node will be smaller than n (under a strong assumption that each row of matrix G and H are linearly independent). Hence, there are more nodes in the decision tree to be regularized than the RaF. The eigenvalue problem is somehow sensitive to the Tikhonov regularization and NULL space approaches [48]. In our case, the problem tends to be more serious because there are numerous nodes to be regularized. In order to solve this problem, we propose to employ random subspace in the base learner of RoF and name it as random RoF. In each node, the test function is evaluated on a randomly selected sub feature set instead of the whole feature set. In this case, RaF and RoFs differ in the way that they perturb the data: RaF uses bagging to create a data subset and RoFs employs different rotation matrices for different tree. Experiments was carried out to evaluate random subspace concept in RoF. The results are summarized Table VI. From Table VI, we can see the MPSVM-based RoF

can be improved significantly by using random subspace in each node. B. Comparisons Among RaF, RoF, and RoF With Random Subspace (RRoF) The main purpose is to investigate whether the MPSVM improves the decision tree ensemble instead of comparing the performances of RaF and RoF. Therefore, we have to determine whether the improvement, if observed, is caused by different random strategies (random subspace or rotation matrix) or by the MPSVM. In order to find out the statistical significance of the results, we carry out a Friedman test as explained in [56]. Friedman test has been proven to be more robust than other approach and has been practiced by numerous researchers [56], [57]. It ranks the algorithms for each data set separately, the best performing algorithm getting the lowest rank. Let ri j be the rank of the jth of k algorithms on the ith of N data sets. The Friedman test compares the average ranks of algorithms, Rj = 1/N j r j i . Under the null-hypothesis, which states that all the algorithms are equivalent and so their ranks Rj should be equal, the Friedman statistic ⎡ ⎤ 2 12N k(k + 1) ⎣ ⎦ χF 2 = R2 j − (13) k(k + 1) 4 j

is distributed according to χF 2 with k − 1 degrees of freedom, when N and k are big enough. In that case, Friedman’s χF 2 is undesirably conservative and derived a better statistic (N − 1)χF 2 (14) N(k − 1) − χF 2 which is distributed according to the F-distribution with k − 1 and (k − 1)(N − 1) degrees of freedom. If the null-hypothesis FF =


7

TABLE VI C LASSIFICATION ACCURACIES OF THE P ROPOSED RO F AND I TS MPSVM-BASED VARIANTS

is rejected post-hoc test, the Nemenyi test [58] can be used to check whether the performance of two among k classifiers is significantly different. If the corresponding average ranks differ by at least the critical difference k(k + 1) (15) CD = qα 6N where critical values √ qα are based on the studentized range statistic divided by 2. Thus, we first rank the performance of RaF and RoF and RRoF on each data set and take average across all the datasets. After simple calculations, the average ranks for RaF, RoF, and RRoF are 2.24, 2.07, 1.69, respectively. Then

12 × 44 3×4×4 2.242 + 2.072 + 1.692 − χF 2 = 3×4 4 = 6.98; 43 × 6.98 = 3.71. (16) FF = 44 × 2 − 6.98 With three algorithms and 44 data sets, FF is distributed according to the F−distribution with 3 − 1 = 2 and (3 − 1)(44 − 1) = 86 degrees of freedom. The critical value of F (2,86) for α = 0.05 is within [2.68, 2.76], so we reject the nullhypothesis. Based on the Nemenyi test, the√critical value here √ is CD = qα (k(k + 1))/(6N) = 2.343 ∗ 3 ∗ 4/(6 ∗ 44) 0.5 for α = 0.05. Since 2.24 and 1.69 are the only pair of average rank whose difference is larger than 0.5, we can conclude that the RRoF is significantly better than RaF. Note Rodriguez et al. [10] reported that the RoF is better than RaF according to their experiment. However, we do observe that RoF is slightly better than RaF, the difference is not statistically significant under the Friedman Test. Here, we list all potential reasons. Firstly, Rodriguez et al. [10] used J48 as base classifier in RoF. In this paper, CART is used as the

base classifier in all ensemble methods. Secondly, in order to make the comparison more robust, we use datasets from various research fields. Thirdly, we use nonparametric Friedman test which has wider validity than other tests. Fourthly, as the authors suggest, the RoF works better especially for small ensemble sizes. In their study, they fixed the ensemble size to be 10. Here, based on experiences of practitioners of ensemble methods, 10 may not be big enough for bagging methods such as RaF. Hence, we increase it to 50. Though RRoF works slightly better than RoF, there is no significant difference between them. As the RRoF works well with the MPSVM, hereafter, we investigate the performance of this RoFs in various scenarios. C. Which Regularization Works Better for MPSVM-Based Decision Tree Ensemble? In this section, we analyze the performance of different regularizations for MPSVM-based RaF and RoF. Similarly, the average ranks for RaF, MPRaF-T, MPRaF-P, MPRaF-N are 2.95, 1.97, 2.11, 2.97, respectively. Then

12 × 44 4×5×5 2 2 2 2 2 2.95 + 1.97 + 2.11 + 2.97 − χF = 4×5 4 = 22.61 43 × 22.61 FF = = 8.89. (17) 44 × 3 − 22.61 With four algorithms and 44 data sets, FF is distributed according to the F−distribution with 4 − 1 = 3 and (4 − 1) × (44 − 1) = 129 degrees of freedom. The critical value of F (3,129) for α = 0.05 is smaller than 2.68, so we reject the√null-hypothesis. In this case, √ the critical value is CD = qα (k(k + 1))/(6N) = 2.569 ∗ 4 ∗ 5/(6 ∗ 44) 0.71 for α = 0.05. So, it is obvious that the MPRaF-T and



TABLE VII S IGNIFICANCE OF THE D IFFERENCE FOR R A F

MPRaF-P perform significantly better than standard RaFs and MPRaF-N. Table VII summarizes the statistical test for RaF. In [27], the NULL space approach together with MPSVM improves a single decision tree significantly. Here, in this paper, we found that the NULL space can be unstable depending on the data as observed in [48]. For a single tree, this problem can be somehow solved by decision tree pruning. As we mentioned above, the most unstable node can be eliminated from the decision tree by pruning. However, for decision tree ensemble, pruning the tree can only increase the computational complexity with negligible improvement. Similarly, the average ranks for the RRoF, MPRRoF-T, MPRRoF-P, MPRRoF-N are 2.91, 2.19, 2.15, 2.75, respectively. Then

12 × 44 4×5×5 2.912 + 2.192 + 2.152 + 2.752 − χF 2 = 4×5 4 = 11.86 43 × 11.86 FF = = 4.25. (18) 44 × 3 − 11.86 We reject the null-hypothesis. Thus, similar conclusion can be drawn that MPRRoF-T and MPRRoF-P lead to significantly better performance than RRoF and MPRRoF-N as shown in Table VIII. D. Bias-Variance Analysis To analyze the reason for the success of the MPSVM-based decision tree ensemble, we compute the bias-variance values of the ensembles. The bias-variance insight was originally borrowed from the field of regression with squared loss as the loss function [59]. For classification problems, the above decomposition is inappropriate because class labels are categorical, which means it is not proper to transplant the decomposition of error in regression tasks to classification. Fortunately, several ways to decompose error into bias and variance terms in classification tasks have been proposed [60]–[62]. Each of these definitions is able to provide some insight into different aspects of a learning algorithm’s performance. We consider classification problems with 0–1 loss function as in [8] and give a brief review . More details can be found in Section C in supplementary file. Let X and Y be the input and output spaces, respectively with cardinalities |X| and |Y| and elements x and y, respectively. The target f is a conditional probability distribution P(YF = yF |x) where YF is a Y-valued random variable. Then for a single test sample P(X)[(biasx )2 + σx2 + variancex ] (19) E(C) = x

TABLE VIII S IGNIFICANCE OF THE D IFFERENCE FOR R ANDOM RO F

where 2 1 P(YF = y) − P(YH = y) 2 y∈Y ⎤ ⎡ 1⎣ variancex = P(YH = y)2 ⎦ 1− 2 y∈Y ⎤ ⎡ 1 σx2 = ⎣1 − P(YH = y)2 ⎦. 2 (biasx )2 =

(20)

y∈Y

The (biasx )2 and variancex are estimated for each method and each dataset. (biasx )2 is abbreviated as biasx . Note that theoretically, the prediction error of a classifier should also be decomposed into three terms [noise (also named as irreducible error), squared bias, and variance]. However, it is usually difficult to estimate noise in real-world learning tasks of which the true underlying class distribution is unknown and there are generally too few instances at any given point in the input feature space to reliably estimate the class distribution at that point. In the commonly used methods, the noise is generally aggregated into both bias and variance or only the bias term due to the fact that noise is invariant across learning algorithms for a single learning task and hence not a significant factor in comparative evaluations. We refer the reader to Section III in the supplementary file for detailed information. Tables A-III and A-IV in the supplementary file represent the estimated bias and variance values. The boldface values represent the best bias/variance. In most cases, the MPSVM-based decision tree ensembles have the best bias/variance values. The statistical test for significant difference is carried out for each ensemble among the bias/variance values. In this test, the smallest value of bias/variance gets rank 1, the largest value gets highest rank. The results are summarized in Fig. 1. Since no significant difference among the variance of the base classifiers for RaF and its various MPRaF variants is detected, and neither for RRoF, we do not offer the graphic visualization here. In the graphic, the top line in the diagram is the axis on which we plot the average ranks of methods. The axis is reversed so that the lowest (best) ranks are to the right since we perceive the methods on the right side as better. The analysis reveals the following. 1) For RaF and its MPSVM-based variants, though there are no significant difference among their variances, the MPRaF tend to reduce the variance to a larger extent, especially for MPRaF-T and MPRaF-P. Though the Nemenyi test fails to detect the significance of difference between MPRaF-P with RaF, the difference


9

TABLE IX S IGNIFICANCE OF THE D IFFERENCE B ETWEEN MPR A F AND MPRRO F W ITH THE S AME R EGULARIZATION M ETHOD

TABLE X AVERAGE R ANK F OR D IFFERENT MINLEAF PARAMETER IN E ACH E NSEMBLE M ETHOD

Fig. 1. Comparison of the bias or variance against each other with the Nemenyi test. Groups of methods with the value that are not significantly different (at α = 0.05) are connected. (a) Bias-RaF. (b) Bias-base of RaF. (c) Bias-RRoF. (d) Bias-base of RRoF. (e) Variance-RaF. (f) Variance-RRoF.

between them is very close to the critical difference. Actually, a sign test between RaF and MPRaF-P is successful to detect the significance of the difference. The exactly same conclusion can be drawn for RRoF and its MPSVM-based variants. 2) For the bias part, MPRaF-T and MPRaF-P generates lower bias than the others and MPRaF-P is slightly better, which demonstrates that MPSVM can better capture the geometric structure of the data. For RRoF, the MPRRoF-N is significantly worse than the others. This further indicates that RoF tends to generate base classifiers with slightly lower-bias than RaF. E. For Given Regularization Approach, Which Ensemble Method is Better? In order to investigate which ensemble strategy, namely RaF or RRoF, performs better with MPSVM and a given regularization approach, we use sign test to compare each pair of algorithms. If the two algorithms compared are, as assumed under the null-hypothesis, equivalent, each should win on approximately N/2 √ out of N data sets: if the number of wins is at least N/2 + 1.96 N/2, the algorithm is significantly better with p < 0.05. Since tied matches support the null-hypothesis we should not discount them but split them evenly between the two classifiers; if there is an odd number of them, we again ignore one. The results are summarized in Table IX. From Table IX, we can see, except for the NULL space regularization, there is no significant difference between the MPRaF and MPRRoF with the same regularization method. F. On the Effect of mtry The parameter mtry denotes the number of features randomly selected at each node. For a given problem, the smaller mtry is, the stronger the randomization of the trees and the weaker the dependence of their structures on the output.

However, if mtry is small, the features randomly selected at a node may fail to capture the geometry of the data samples. In order to see how this parameter influences accuracy, and to support our default settings, we conducted a set of experiments for several data sets by varying the parameter over its range. Fig. A-II in the supplementary file shows the evolution of the accuracy with respect to different values of mtry for Parkinsons and wine quality (red) data sets. For very small value of mtry, the accuracy of all the ensemble methods are very low, especially for the MPSVM based ensembles. However, as the mtry grows, the accuracies of all MPSVM-based ensembles grow significantly and become stable very quickly except for the MPSVM with NULL space regularization. As we mentioned√before, the NULL space is unstable. Setting mtry to round( n) may lead to satisfactory result. For other dataset, the similar trend can be observed. We omit the figures for other datasets considering the limited space. G. On the Effect of minleaf In a decision tree ensemble, the parameter minleaf is the maximum number of the data samples in an impure leaf node. Generally, larger value of minleaf leads to smaller trees with higher bias and lower variance. Zhang and Zhang [7] reported that the accuracy of decision tree ensemble is quite robust to this parameter. However, others [19] advocate that the optimal value of this parameter varies in different situations. Here, we conduct another set of experiments with all ensemble methods on all 44 datasets with minleaf varying from 1 to 3. Considering limited space here, we omit the exact value of the accuracy of each minleaf . The average rank for each method with different minleaf is summarized in Table X. Table X reveals several properties, which are as follows. 1) Though there is no significant difference in some cases, the smaller the minleaf , the better since smaller minleaf



MPSVM problem needs more than n + 1 data samples whose features vector are independent to each other. Here, we use two novel regularization strategies: 1) Tikhonov and 2) axisparallel regularization. Moreover, we use the Bhattacharyya distance to transform the multiclass classification problem to binary classification problem. The performances of these two strategies together with the NULL space are tested on 44 benchmark datasets from various research fields. Several conclusions can be drawn from this investigation. H. Computational Complexity 1) The MPSVM-based decision trees perform well with Here, we make no assumption about the tree structure. In the RaF. However, for RoF, the split at each node is based uni-variate decision tree case, for a given node, suppose there on all features. So, more nodes need to be regularare m data samples with n features each. The axis-parallel split ized in RoF than RaF. Here, we propose an RRoF algorithm firstly rank each feature and try to find a threshold approach which employs random subspace in each node to optimize the impurity criterion. Regardless of the computaof the RoF. This modification leads to a significant tional time for calculating the impurity value, the computation improvement for MPSVM-based RoF in most cases. complexity to find such a best-split takes time on the order of 2) The MPSVM-based ensembles (MPRaF-T, MPRaF-P, nm log m (m log m for each feature). For the MPSVM-based MPRRoF-T, and MPRRoF-P) have more significant decision tree, according to [40], complexity of the generalvariance reduction and lower bias than the standard ized problem is of order n3 . So, in most cases, the number of ensemble method. the data samples with a node is far larger than n2 , which indi3) RoF is slightly better than RaF. RRoF is slightly better cates that the MPSVM-based decision tree is much more faster than RoF and significantly better than RaF. than the uni-variate one. Even for those nodes near the leaf, 4) For a given regularization approach, there is no sigthe MPRaF-P, MPRRoF-P has exactly the same computation nificant difference between RaF and RRoF except for complexity as the uni-variate split approach. NULL space. Though the performances of MPRaF-N Tables A-V and A-VI in the supplementary file summarize and MPRRoF-N are all poor, MPRRoF-N is significantly the average number of nodes in a decision tree in each case better than MPRaF-N. and computation time for each ensemble. 5) Among the three regularization strategies, the NULL Fig. A-I in the supplementary file gives a graphical overview space method is unstable. However, in the decision tree of the results in Tables A-V and A-VI. Here, we define the ensemble, the unstable nodes cannot be pruned by deciSpeedUp Factor for MPRaF-T as sion tree pruning. Hence, the MPRaF-N or MPRRoF-N Training time of RaF yields lower performance in most cases. . SpeedUp Factor for MPRaF-T = 6) The performances of MPRaF and MPRRoF are robust Training time of MPRaF-T to their parameters. The classification accuracy is low, if (21) the mtry parameter, which controls the number of feaSimilarly, we define the node ratio for MPRaF-T as tures to be randomly selected in a node, √ is extremely low or high. Setting mtry around round( n) (n is the number Nav (MPRaF-T) Node Ratio for MPRaF-T = (22) of features of a given data sample) leads to satisfacNav (RaF) tory performance. Setting the minleaf parameter, which where “Nav (E)” stands for the average number of nodes in the stands for the maximum number of data samples in an base classifier of “E” (for example, E can be RaF). impure leaf node, to 1 is the optimum. Here, only the training time is considered since we find 7) The MPSVM-based ensemble methods have lower comthere is not significant difference among the testing time for putation complexity than standard one in most cases. each ensemble methods. From Fig. A-I, the MPSVM-based More specifically, only MPSVM with axis-parallel regensemble methods generate deeper tree than their standard ularization has slightly larger computation complexity in counterpart, especially for Tikhonov and axis-parallel regularseveral cases than its standard counterpart. izations. However, it can be observed that the MPSVM-based Future studies and developments of MPSVM-based deciensemble needs less training time than its standard counter- sion tree ensemble include evaluating the performance of part in most cases. This is because in most cases, especially the ensembles with some nonlinearities in some nodes and for those nodes near the root, the MPSVM method is much applications of the MPSVM-based decision tree ensembles in faster than conventional exhaustive search. solving real-world problems. always lead to a lower average rank. So, it is reasonable to set minleaf = 1. This further evidences the assumption that the decision tree in the ensemble should grow as large as possible. 2) Except for MPSVM regularized by NULL space approach, the RoF seems to be more robust to minleaf than RaF.

V. C ONCLUSION We have proposed a novel approach to generate decision tree ensembles and applied them to well-known RaF and RoF. It consists of splitting the data samples in a node by a hyperplane with unconstrained orientations. The hyperplane can be generated by the MPSVM. However, the optimization of the

R EFERENCES [1] L. Breiman, “Bias, variance, and arcing classifiers,” Dept. Stat., Univ. California, Berkeley, CA, USA, Tech. Rep. 460, 1996. [2] M. A. Wiering and H. van Hasselt, “Ensemble algorithms in reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 930–936, Aug. 2008.


[3] J. S. Goerss, “Tropical cyclone track forecasts using an ensemble of dynamical models,” Mon. Weather Rev., vol. 128, no. 4, pp. 1187–1193, 2000. [4] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems. Berlin, Germany: Springer, 2000, pp. 1–15. [5] Z.-H. Zhou, F. Roli, and J. Kittler, “Multiple classifier systems,” in Proc. 2013 11th Int. Workshop Mult. Classifier Syst. (MCS), Nanjing, China. [6] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006. [7] C.-X. Zhang and J.-S. Zhang, “RotBoost: A technique for combining rotation forest and AdaBoost,” Pattern Recognit. Lett., vol. 29, no. 10, pp. 1524–1536, 2008. [8] R. Kohavi and D. H. Wolpert, “Bias plus variance decomposition for zero-one loss functions,” in Proc. Int. Conf. Mach. Learn. (ICML), Bari, Italy, 1996, pp. 275–283. [9] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001. [10] J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier ensemble method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 10, pp. 1619–1630, Oct. 2006. [11] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140, 1996. [12] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 8, pp. 832–844, Aug. 1998. [13] H. Jiang et al., “Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes,” BMC Bioinformat., vol. 5, no. 1, p. 81, 2004. [14] K.-Q. Shen, C.-J. Ong, X.-P. Li, Z. Hui, and E. P. Wilder-Smith, “A feature selection method for multilevel mental fatigue EEG classification,” IEEE Trans. Biomed. Eng., vol. 54, no. 7, pp. 1231–1237, Jul. 2007. [15] B. H. Menze et al., “A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data,” BMC Bioinformat., vol. 10, no. 1, p. 213, 2009. [16] T. Hastie, R. Tibshirani, and J. J. H. Friedman, The Elements of Statistical Learning, vol. 1. New York, NY, USA: Springer, 2001. [17] T. Hothorn, F. Leisch, A. Zeileis, and K. Hornik, “The design and analysis of benchmark experiments,” J. Comput. Graph. Stat., vol. 14, no. 3, pp. 675–699, 2005. [18] A. Liaw and M. Wiener, “Classification and regression by randomforest,” R News, vol. 2, no. 3, pp. 18–22, 2002. [19] Y. Lin and Y. Jeon, “Random forests and adaptive nearest neighbors,” J. Amer. Stat. Assoc., vol. 101, no. 474, pp. 578–590, 2006. [20] B. H. Menze, W. Petrich, and F. A. Hamprecht, “Multivariate feature selection and hierarchical classification for infrared spectroscopy: Serum-based detection of bovine spongiform encephalopathy,” Anal. Bioanal. Chem., vol. 387, no. 5, pp. 1801–1807, 2007. [21] K.-H. Liu and D.-S. Huang, “Cancer classification using rotation forest,” Comput. Biol. Med., vol. 38, no. 5, pp. 601–610, 2008. [22] A. Ozcift, “SVM feature selection based rotation forest ensemble classifiers to improve computer-aided diagnosis of Parkinson disease,” J. Med. Syst., vol. 36, no. 4, pp. 2141–2147, 2012. [23] J.-F. Xia, K. Han, and D.-S. Huang, “Sequence-based prediction of protein–protein interactions by means of rotation forest and autocorrelation descriptor,” Protein Peptide Lett., vol. 17, no. 1, pp. 137–145, 2010. [24] G. Stiglic and P. Kokol, “Effectiveness of rotation forest in metalearning based gene expression classification,” in Proc. 2007 20th IEEE Int. Symp. Comput.-Based Med. Syst. (CBMS), Maribor, Slovenia, pp. 243–250. [25] R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “A comparison of decision tree ensemble creation techniques,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 173–180, Jan. 2007. [26] K. V. S. Murthy and S. L. Salzberg, “On growing better decision trees from data,” Ph.D. dissertation, Dept. Comput. Sci., Johns Hopkins Univ., Baltimore, MD, USA, 1995. [27] N. Manwani and P. Sastry, “Geometric decision tree,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 1, pp. 181–192, Feb. 2012. [28] L. Breiman, Classification and Regression Trees. Boca Raton, FL, USA: CRC Press, 1993. [29] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, “OC1: A randomized algorithm for building oblique decision trees,” in Proc. 11th Nat. Conf. Artif. Intell. (AAAI), vol. 93. Washington, DC, USA, 1993, pp. 322–327. [30] W. Pedrycz and Z. A. Sosnowski, “Genetically optimized fuzzy decision trees,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 3, pp. 633–641, Jun. 2005.

11

[31] S.-H. Cha and C. Tappert, “A genetic algorithm for constructing compact binary decision trees,” J. Pattern Recognit. Res., vol. 4, no. 1, pp. 1–13, 2009. [32] E. Cantu-Paz and C. Kamath, “Inducing oblique decision trees with evolutionary algorithms,” IEEE Trans. Evol. Comput., vol. 7, no. 1, pp. 54–68, Feb. 2003. [33] J. M. Pangilinan and G. K. Janssens, “Pareto-optimality of oblique decision trees from evolutionary algorithms,” J. Global Optim., vol. 51, no. 2, pp. 301–311, 2011. [34] L. Zhang, Y. Ren, and P. Suganthan, “Towards generating random forests via extremely randomized trees,” in Proc. IEEE 2014 Int. Joint Conf. Neural Netw. (IJCNN), Beijing, China, pp. 2645–2652. [35] X.-Z. Wang, J.-H. Zhai, and S.-X. Lu, “Induction of multiple fuzzy decision trees based on rough set technique,” Inf. Sci., vol. 178, no. 16, pp. 3188–3202, 2008. [36] X.-Z. Wang and C.-R. Dong, “Improving generalization of fuzzy IF–THEN rules by maximizing fuzzy entropy,” IEEE Trans. Fuzzy Syst., vol. 17, no. 3, pp. 556–567, Jun. 2009. [37] L. Zhang and P. Suganthan, “Random forests with ensemble of feature spaces,” Pattern Recognit., vol. 47, no. 10, pp. 3429–3437, 2014. [38] L. Zhang, W.-D. Zhou, T.-T. Su, and L.-C. Jiao, “Decision tree support vector machine,” Int. J. Artif. Intell. Tools, vol. 16, no. 1, pp. 1–15, 2007. [Online]. Available: http://search.ebscohost.com.ezlibproxy1.ntu.edu.sg/ login.aspx?direct=true&db=cph&AN=24099026&site=ehost-live [39] L. Rokach and O. Maimon, “Top-down induction of decision trees classifiers—A survey,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 35, no. 4, pp. 476–487, Nov. 2005. [40] O. L. Mangasarian and E. W. Wild, “Multisurface proximal support vector machine classification via generalized eigenvalues,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 1, pp. 69–74, Jan. 2006. [41] L.-F. Chen, H.-Y. M. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “A new LDA-based face recognition system which can solve the small sample size problem,” Pattern Recognit., vol. 33, no. 10, pp. 1713–1726, 2000. [42] X. Jiang, “Linear subspace learning-based dimensionality reduction,” IEEE Signal Process. Mag., vol. 28, no. 2, pp. 16–26, Mar. 2011. [43] L. Breiman, Classification and Regression Trees. Belmont, CA, USA: Wadsworth Int. Group, 1984. [44] L. Bottou et al., “Comparison of classifier methods: A case study in handwritten digit recognition,” in Proc. 12th IEEE IAPR Int. Conf. Pattern Recognit. B, Comput. Vis. Image Process., vol. 2. Jerusalem, Israel, 1994, pp. 77–82. [45] S. Knerr, L. Personnaz, and G. Dreyfus, “Single-layer learning revisited: A stepwise procedure for building and training a neural network,” in Neurocomputing. Berlin, Germany: Springer, 1990, pp. 41–50. [46] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large margin DAGs for multiclass classification,” in Proc. Adv. Neural Inf. Process. Syst., vol. 12. Denver, CO, USA, 1999, pp. 547–553. [47] T. G. Dietterich and G. Bakiri, “Solving multiclass learning problems via error-correcting output codes,” J. Artif. Intell. Res., vol. 2, pp. 263–286, Jan. 1995. [48] X. Jiang, B. Mandal, and A. Kot, “Eigenfeature regularization and extraction in face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 383–394, Mar. 2008. [49] J. Marroquin, S. Mitter, and T. Poggio, “Probabilistic solution of illposed problems in computational vision,” J. Amer. Stat. Assoc., vol. 82, no. 397, pp. 76–89, 1987. [50] O. L. Mangasarian and R. Meyer, “Nonlinear perturbation of linear programs,” SIAM J. Control Optim., vol. 17, no. 6, pp. 745–752, 1979. [51] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks and support vector machines,” Adv. Comput. Math., vol. 13, no. 1, pp. 1–50, 2000. [52] K. K. Kandaswamy et al., “AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties,” J. Theor. Biol., vol. 270, no. 1, pp. 56–62, 2011. [53] K. Kandaswamy, G. Pugalenthi, M. Hazrati, K.-U. Kalies, and T. Martinetz, “BLProt: Prediction of bioluminescent proteins based on support vector machine and relieff feature selection,” BMC Bioinformat., vol. 12, no. 1, p. 345, 2011. [54] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition using Laplacianfaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 328–340, Mar. 2005. [55] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal Laplacianfaces for face recognition,” IEEE Trans. Image Process., vol. 15, no. 11, pp. 3608–3614, Nov. 2006. [56] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, Jan. 2006.



[57] L. I. Kuncheva and J. J. Rodriguez, “Classifier ensembles with a random linear oracle,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 4, pp. 500–508, Apr. 2007. [58] P. Nemenyi, Distribution-Free Multiple Comparisons. Princeton, NJ, USA: Princeton Univ., 1963. [Online]. Available: http://books.google.com.sg/books?id=nhDMtgAACAAJ [59] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias/variance dilemma,” Neural Comput., vol. 4, no. 1, pp. 1–58, 1992. [60] E. B. Kong and T. G. Dietterich, “Error-correcting output coding corrects bias and variance,” in Proc. Int. Conf. Mach. Learn. (ICML), Tahoe City, CA, USA, 1995, pp. 313–321. [61] J. H. Friedman, “On bias, variance, 0/1—Loss, and the curse-ofdimensionality,” Data Min. Knowl. Disc., vol. 1, no. 1, pp. 55–77, 1997. [62] G. M. James, “Variance and bias for general loss functions,” Mach. Learn., vol. 51, no. 2, pp. 115–135, 2003. [63] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” J. Mach. Learn. Res., vol. 15, pp. 3133–3181, Oct. 2014. [Online]. Available: http://jmlr.org/papers/v15/delgado14a.html

Le Zhang (S’13) received the B.Eng. degree from the University of Electronic Science and Technology of China, Chengdu, China, and M.Sc. degree from Nanyang Technological University, Singapore, in 2011 and 2012, respectively. He is currently pursuing the Ph.D. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University. His current research interests include various ensemble-based classification methods.

Ponnuthurai N. Suganthan (S’90–M’92– SM’00–F’15) received the B.A., Post-Graduate Certificate, and M.A. degrees in electrical and information engineering from the University of Cambridge, Cambridge, U.K., in 1990, 1992, and 1994, respectively, and the Ph.D. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. Since 1999, he has been with the School of Electrical and Electronic Engineering, Nanyang Technological University. His current research interests include evolutionary computation, pattern recognition, multiobjective evolutionary algorithms, applications of evolutionary computation, and neural networks. He has 13 000 Google Scholar citations to his publications. His SCI indexed publications attracted over 1000 SCI citations in 2013 alone. Prof. Suganthan was the recipient of the Outstanding Paper Award for the IEEE T RANSACTIONS ON E VOLUTIONARY C OMPUTATION in 2012, for SaDE paper published in 2009. He is an Associate Editor of the IEEE T RANSACTIONS ON C YBERNETICS, the IEEE T RANSACTIONS ON E VOLUTIONARY C OMPUTATION, Information Sciences (Elsevier), Pattern Recognition (Elsevier), and the International Journal of Swarm Intelligence Research. He is a Founding Co-Editor-in-Chief of Swarm and Evolutionary Computation (Elsevier). He is an Editorial Board Member of the Evolutionary Computation Journal (MIT Press).

Support Vector Machine with Ensemble Tree Kernel for Relation Extraction.

Ensemble Feature Learning of Genomic Data Using Support Vector Machine.

Predicting metabolic syndrome using decision tree and support vector machine methods.

A hybrid approach of stepwise regression, logistic regression, support vector machine, and decision tree for forecasting fraudulent financial statements.

Stable feature selection based on the ensemble L 1 -norm support vector machine for biomarker discovery.

Privacy preserving RBF kernel support vector machine.

Overcome support vector machine diagnosis overfitting.

PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine.

Prediction of the severity of obstructive sleep apnea by anthropometric features via support vector machine.

An MR brain images classifier system via particle swarm optimization and kernel support vector machine.

Experimental realization of a quantum support vector machine.

New support vector machine-based method for microRNA target prediction.

Computational Detection of piRNA in Human Using Support Vector Machine.

Targeted Local Support Vector Machine for Age-Dependent Classification.

Support vector machine classification of streptavidin-binding aptamers.

Visualization and Interpretation of Support Vector Machine Activity Predictions.

Screening for pre-diabetes using support vector machine model.

Predicting full thickness skin sensitization using a support vector machine.

Mobile Phonocardiogram Diagnosis in Newborns Using Support Vector Machine.

Support Vector Machine Classification of Drunk Driving Behaviour.

Quantum support vector machine for big data classification.

Meta-analytic support vector machine for integrating multiple omics data.

Support vector machine with hypergraph-based pairwise constraints.

Extended robust support vector machine based on financial risk minimization.