Neural Networks 76 (2016) 29–38

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

A Fast Reduced Kernel Extreme Learning Machine Wan-Yu Deng a,b , Yew-Soon Ong a,∗ , Qing-Hua Zheng c a

Rolls-Royce@NTU Corporate Lab c/o, School of Computer Engineering, Nanyang Technological University, Singapore

b

School of Computer, Xian University of Posts & Telecommunications, Shaanxi, China

c

Department of Computer Science and Technology, Xi’an Jiaotong University, China

article

info

Article history: Received 7 March 2015 Received in revised form 10 July 2015 Accepted 15 October 2015 Available online 6 January 2016 Keywords: Extreme learning machine Kernel method Support vector machine RBF network

abstract In this paper, we present a fast and accurate kernel-based supervised algorithm referred to as the Reduced Kernel Extreme Learning Machine (RKELM). In contrast to the work on Support Vector Machine (SVM) or Least Square SVM (LS-SVM), which identifies the support vectors or weight vectors iteratively, the proposed RKELM randomly selects a subset of the available data samples as support vectors (or mapping samples). By avoiding the iterative steps of SVM, significant cost savings in the training process can be readily attained, especially on Big datasets. RKELM is established based on the rigorous proof of universal learning involving reduced kernel-based SLFN. In particular, we prove that RKELM can approximate any nonlinear functions accurately under the condition of support vectors sufficiency. Experimental results on a wide variety of real world small instance size and large instance size applications in the context of binary classification, multi-class problem and regression are then reported to show that RKELM can perform at competitive level of generalized performance as the SVM/LS-SVM at only a fraction of the computational effort incurred. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction Kernel based learning methods have been extensively used for solving classification and regression problem due to their high generalization performance and the mathematical rigor of the field (Neuvial, 2013). To date, a plethora of kernel based learning methods like the support vector machine (SVM) (Vapnik, 1995) and its variants including LS-SVM (Suykens & Vandewalle, 1999), RSVM (Lee & Huang, 2007) and LLSVM (Zhang, Lan, Wang, & Moerchen, 2012) have been proposed for data analysis. The classical SVM involves a mapping of the training data into a high dimensional feature space through some nonlinear feature mapping function. A standard optimization method then follows so as to arrive at appropriate solution that maximizes the margin of separation between the different classes in the nonlinear feature space, while minimizing training errors. A least square version of the SVM classifier was subsequently proposed in Suykens and Vandewalle (1999). In contrast to the inequality constraints adopted in a classical SVM, equality constraints are considered in the LS-SVM (Suykens & Vandewalle,



Corresponding author. E-mail address: [email protected] (Y.-S. Ong).

http://dx.doi.org/10.1016/j.neunet.2015.10.006 0893-6080/© 2015 Elsevier Ltd. All rights reserved.

1999). Instead of quadratic programming, one can thus implement the least square approach with ease by solving a set of linear equations. Notably, LS-SVM has been reported to exhibit superior generalization performance and low computational requirements in many applications. Recently, a unified learning framework for regression and classification, which is termed as the Kernel Extreme Learning Machine (KELM) was also proposed as an extension of the Extreme Learning Machine (ELM) learning theory and the classification capability theorem. In the KELM unified learning framework, the bias term in the optimization constraints of SVM, LS-SVM, and PSVM can be eliminated. This gives rise to learning algorithm with mild optimization constraints that offers improved generalization performance and low computational complexity. Nonetheless, when dealing with large scale problems, the conventional SVM and LS-SVM as well as the KELM do not scale well with the big datasets in general. In the past decades, many large scale learning algorithms have been proposed. Lee and Huang (2007) proposed the reduced support vector machines (RSVM) which restrict the number of candidate support vectors. The main characteristic of this method ˜ where N is to reduce the kernel matrix from N × N to N × N, is the number of training instances and N˜ denotes the size of a randomly selected subset of training data that serve as candidate support vectors. Wang, Crammer, and Vucetic (2012) proposed a budgeted stochastic gradient descent approach for training SVMs

30

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

(BSGD-SVM). The approach bounds the number of support vectors during training through several budget maintenance strategies including removal, projection and merging. Chang, Guo, Lin, and Lu (2010) proposed an approach that employs a decision tree to decompose the given data space as a first step before training SVMs on the decomposed regions. Results reported indicated notable speed up in training time at competitive test accuracy. LASVM (Bordes, Ertekin, Weston, & Bottou, 2005), on the other hand, is a one-pass online SVM that involves iterations of sequential minimal optimization (SMO) during each model update so as to remove data samples that are deemed as unlikely to serve as suitable SVs from the training set. Tsang, Kwok, and Cheung (2005) scale up the kernel SVM by reformulating the quadratic programming used in SVM as a minimum enclosing ball problem and then use an efficient approximation algorithm to attain a nearoptimal solution. In spite of the extensive efforts in the area to cope with the increasing data instance size and dimensionality more elegantly, existing works have mainly focused on developing effective strategies for identifying the optimal set of support vectors. The computational process of identifying support vectors, however can become very intensive, especially when dealing with large scale data. In what follows, the core objectives and contributions of the present work are outlined: (1) we propose a fast noniterative kernel machine, which is referred to as the fast Reduced Kernel Extreme Learning Machine (RKELM), based on kernel-based learning and extreme learning machine. A key characteristics of the present work is that support vectors are randomly chosen from the training set as opposed to some sophisticated process which is often compute intensive. (2) We prove that RKELM can approximate any nonlinear functions with zero error only if the kernel is strictly positive definite and all training data are chosen as support vectors. (3) We analyze the relation of the present work to other related works, such as KELM, ELM and RSVM, and reveal the effects of hidden nodes size and the regularization parameter on generalization performances. The rest of the paper is organized as follows: Section 2 gives a brief overview of the classic ELM (Huang, Zhu, & Siew, 2006) and the KELM (Huang et al., 2006). Section 3 presents the mathematical derivation of RKELM. The relation of RKELM to other relevant state-of-the-art algorithms is then discussed in Section 4. The performance evaluation and validation of RKELM is subsequently presented in Section 5, using commonly used well established datasets from the UCI repository. Last but not least, the brief conclusions and future works are given in Section 6. 2. A brief review of Extreme Learning Machine and its kernel extension In this section, an overview of the ELM and its kernel extension (Huang et al., 2006) is presented. This serves to provide the necessary background for the development of the RKELM in Section 3. 2.1. ELM The Extreme Learning Machine (ELM) was proposed as a fast learning method for Single-hidden-Layer Feedforward Neural network (SLFN) in Feng, Ong, and Lim (2013); Feng, Ong, Lim, and Tsang (2015), Huang, Chen, and Siew (2006); Huang, Zhu, and Siew (2004) and Rong, Ong, Tan, and Zhu (2008), where the hidden layer can be any form of piecewise continuous computational functions including Sigmoid, Radial Basis, trigonometric, threshold, ridge polynomial, fully complex, fuzzy inference, high-order, wavelet, etc. (Huang & Chen, 2007). In ELM, the number of hidden nodes poses as a structural parameter that needs to be predefined, while

parametric settings of the hidden nodes, for example, the impact factors of the RBF nodes, the biases and/or input weights of the additive nodes are randomly assigned. Given training samples {(xi , ti )|xi ∈ Rd , ti ∈ Rm }Ni=1 , where N is the number of instances, d is the dimension, m is the number of output nodes. For regression problems m = 1, while for classification problems m is the number of categories, classes or labels. The output function of ELM for SLFNs is given by L 

f (x) =

βi h(x, ai , bi ) = h(x)β

(1)

i =1

where L is the number of hidden nodes, β = [βi , . . . , βL ]T is the vector of output weights, ai is the center of RBF nodes or input weights of additive nodes, bi is the impact factor of RBF nodes or bias of additive nodes, and h(·) is the activation function which is but not limited to Sigmoid, Sine and hardlim functions. The ELM can be solved as a constrained optimization problem (Huang, 2014; Huang, Zhou, Ding, & Zhang, 2012; Huang et al., 2006): Minimizeβ : ∥Hβ − T∥αp 1 +

C 2

∥β∥αq 2

(2)

where α1 > 0, α2 > 0, p, q = 0, 12 , 1, 2, . . . , F , +∞ and C is control parameter for a tradeoff between structural risk and empirical risk, H is the hidden-layer output matrix



h(w1 , x1 , b1 )

··· .. . ···

.. .

 H=

h(w1 , xN , b1 )

h(wL , x1 , bL )

.. .

h(wL , xN , bL )

 T

  

t1

. and T =  ..  . tTN

(3) A number of efficient methods may be used to determine the output weights β such as the orthogonal projection method, iterative methods (Luo, Vong, & Wong, 2014), eigenvalue decomposition method (Golub & VanLoan, 1996) and others. When p, q = F and α1 , α = 2, a popular and efficient closed-form solution (Huang et al., 2012) is:



−1

N ≥L

H⊤ T

N ≤ L.

H⊤ C I + HH⊤



β= 

C I + H⊤H

−1

T

(4)

2.2. Kernel Extreme Learning Machine As proposed in Huang et al. (2012), if h(·) is unknown, i.e., an implicit function, one can apply the Mercer’s conditions on ELM, and define a kernel matrix for ELM that takes the form: KELM = HHT : KELMi,j = h(xi ) · h(xj ) = κ(xi , xj ).

(5)

Then, substituting (5) and (4) into (1), we can obtain the kernel form of the output function as follows,

T κ(x, x1 )   .. −1 f (x) =   (C I + KELM ) T. . κ(x, xN ) 

(6)

Similar to the SVM (Vapnik, 1995) and LS-SVM (Suykens & Vandewalle, 1999), h(x) need not be known; instead, its kernel κ(u, v) (e.g., Gaussian kernel κ(u, v) = exp(∥u − v∥2 /σ )) can be provided. L need not be available beforehand either. The experimental and theoretical analysis of Huang et al. (2012) showed that KELM produces improved generalization performance over the SVM/LSSVM. The work, however was established only on small datasets. When dealing with Big data, however, the training time of O (N 3 ) and kernel matrix size of O (N 2 ) become a significant concern (Zhai, Ong, & Tsang, 2014).

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

Algorithm 1 Proposed Fast RKELM Algorithm

3. Fast reduced kernel extreme learning machine For N arbitrary distinct samples {(xi , ti )|xi ∈ Rd , ti ∈ Rm }Ni=1 , the inputs are denoted as X = {xi |xi ∈ Rd }Ni=1 and the outputs are denoted as T = {ti |ti ∈ Rm }Ni=1 . On regression problem, ti ∈ Rm is a continuous real vector. On m-categories multi-class problem, output ti ∈ Rm is a m-dimensional Boolean vector {0, 1}m . If the class label is p, the expected output vector of size m is then p

ti = [0, . . . , 0, 1, 0, . . . , 0]1×m . In this case, only the pth element of ti = [t1 , t2 , . . . , tm ] has value ’1’, while the rest of the elements are ‘0’. Binary classification is considered as a special case of the multi-class problem where m = 2. Differing from the conventional kernel-based algorithms, for instance, the conventional SVM (LibSVM) (Chang & Lin, 2011) and BSGD-SVM (Wang et al., 2012), which involve some form of iterative optimization scheme to identify the support vectors, or eventually using all the training sample as support vectors (e.g., KELM and LS-SVM), we assert in this work that the support vectors can be randomly selected from the instances of the training dataset. Since the current work is derived from KELM but only uses a reduced kernel matrix instead of the full kernel matrix to build the model, the proposed algorithm is labeled here as the Reduced Kernel Extreme Learning (RKELM). In what follows, we shall describe the proposed algorithm in detail. The SLFN with kernel function κ(·, ·) and L support vectors XL = {xi |xi ∈ Rd }Li=1 can be mathematically modeled as, L 

βs κ(xi , xs ) = ti ,

i = 1, 2, . . . , N

(7)

s=1

or compressed in the matrix form, KN ×L β = T

(8)

= κ(X, XL ) is the reduced kernel matrix, β = [β1 , β2 , . . . , βL ] is the output weight vector or matrix.

where KN ×L

Theorem 1. For N arbitrary distinct samples {(xi , ti )|xi ∈ Rd , ti ∈ Rm }Ni=1 , if a SLFN with SPD kernel and L = N random support vectors XL = R(X) is considered (i.e., all training samples are support vectors), then the kernel matrix KN ×N ∈ RN ×N obtained from N samples and N support vectors is invertible and ∥KN ×N β − T∥2 = 0. (Proof is provided in Appendix A.) Theorem 1 implies that, if κ(·, ·) is a strict positive definite (SPD) kernel (Vapnik, 1995), the SLFNs with L = N random support vectors can learn N distinct samples at zero error. This implies that ∥KN ×N β − T∥ = 0. Furthermore, given any small positive value ϵ and SPD kernel, there should exist L ≤ N random support vectors such that for N arbitrary distinct samples {(xi , ti )}Ni=1 , ∥KN ×L β − T∥ < ϵ . In many real world applications of the Big data era, one can safely assume that the number of support vectors L shall always be less than N, and hence, the training error cannot be made exactly zero but approaches a non-zero training error ϵ . Similar to Eq. (2), the RKELM with optimization constraints such as the ℓ2 norm minimization can be formulated as Minimize

C

1

∥β∥2 + ∥ξ∥2

2 2 Subject to : ξ = KN ×L β − T.

(9)

Based on the KKT theorem, the solution is derived as

 −1 T β = C I + KTN ×L KN ×L KN ×L T. The proposed fast RKELM is summarized in Algorithm 1,

31

(10)

Input: Given N training samples ℵ = {(xi , ti )|xi ∈ Rd , ti ∈ Rm }Ni=1 , the number of support vectors L, and the kernel function κ(·, ·) such as Gaussian. Output: The output weights β 1: Randomly select L input samples from the training set as support vectors: XL ← R(X) 2: Generate the reduced kernel matrix from N samples and L support vectors: KN ×L = κ(X, XL ) T −1 T 3: Estimate the output weights: β = (C I + KN ×L KN ×L ) KN ×L T Remark 1. Although the KELM does not involve an iterative learning procedure, the time complexity O (N 3 ) can still be very high, and often higher than many iterative-based algorithms. The core benefit of ELM, which is fast learning, is nonetheless lost. In contrast, the training cost of RKELM with O (L3 ) is far lower than KELM with O (N 3 ), since L ≪ N in most cases. The space complexity on memory requirement of RKELM at O (NL) is also lower than that of KELM with O (N 2 ). Thus, the low computation cost of RKELM makes it suitable for large scale data processing in the context of Big data. Remark 2. It was proven in Huang et al. (2006) that with sufficient hidden nodes, for widespread forms of h(·), ELM can approximate any continuous target functions, that is,

  L     lim ∥fL (x) − f (x)∥ = lim  βi hi (x) − f (x) = 0. L→∞ L→∞   i=1

(11)

However, when h(·) is a kernel, not all forms of kernel satisfy this condition. To date, it remains unestablished which forms of kernel possess universal learning properties. As one contribution of this paper, in Theorem 1 we prove that only the SPD kernels satisfy this condition. Remark 3. Classification theorems in Huang et al. (2006, 2012) have shown that for given m disjoint regions K1 , K2 , . . . , Km in Rd and their corresponding labels   c1 , c2 , . . . , cm , there exists a continuous function f (x) ∈ C Rd or one compact set of Rd such that f (x) = ci if x ∈ Ki . Theorem 1 indicates RKELM with SPD kernel can approximate any nonlinear function f (x) only if L = N. Using the two theorems, RKELM may be shown to separate the decision regions regardless of the shapes of the regions. 4. Related work In this section, we discuss the relation of RKELM to other stateof-the-art algorithms including RSVM, KELM, RBF-network and the ELM-RBF. 4.1. RSVM In RSVM (Lee & Huang, 2007), a small random sample of the dataset is used as representative samples of the original full dataset so as to accelerate optimization of the smooth support vector machine (SSVM) (Lee & Huang, 2007). This is similar to the RKELM which uses the reduced kernel matrix to represent the full kernel matrix. Nevertheless, RSVM employs a globally quadratic convergent Newton algorithm in the training process, which involves solving a successive linear problem of the gradient of the objective function iteratively. RKELM on the other hand randomly selects the support vectors and computes the output weights without any iterative steps involved. Further,

32

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

although RSVM can converge linearly to a solution, however, the universal approximation ability of RSVM remains unproven. In contrast, RKELM with SPD kernel is proven to possess universal approximation properties only if L = N. 4.2. KELM From the equations 28a, 28b and 28c in Huang et al. (2012) we have,

β = HT ξ/C .

(12)

¯ where ∥ξ∥ ¯ 2 is Further, considering ∥H ∥ ∝ NL and ∥ξ∥ ∝ N ∥ξ∥ 2 the average of ∥ξi ∥ , we have, T

2

¯ 2 /C . ∥β∥2 ∝ N 2 L∥ξ∥

2

2

(13)

Only focusing on C and L, we have,

∥β∥2 ∝ L/C .

(14)

It can be found that the task of ∥β∥2 minimization can be optimized from two aspects: the number of hidden nodes L and the control parameter C . In KELM (Huang et al., 2006), L is fixed at L = N, so the minimization of ∥β∥2 can only be achieved by regulating the parameter C . This limits the flexibility and controllability of the model. In contrast, since L and C are both flexible in RKELM, a more robust RKELM model can be attained, and in most cases RKELM is also more efficient than KELM since L ≪ N. 4.3. RBF Network and ELM-RBF The conventional RBF network (Lowe, 1989) focuses on specific RBF network with the same impact factor σ assigned to all RBF hidL den nodes: f (x) = i=1 βi h(∥x − ai ∥/σ ), where the centers ai and impact factor σ of the RBF hidden nodes are typically determined based on the training data via some model selection scheme. In L ELM-RBF, on the other hand, where f (x) = i=1 βi h(∥x − ai ∥/σi ), the RBF hidden nodes have different impact factors σi , and both ai and σi are random real values unrelated to the training data. L In our proposed RKELM, where f (x) = i=1 βi h(∥x − xi ∥/σ ) or f (x) = i=1 βi h(∥x − xi ∥/σi ), the RBF hidden nodes can use the same or different impact factors; the centers xi are randomly selected from the training data, and the σ or σi are random real values. For RKELM, we have the following theorem,

L

Theorem 2. Given κ(x, xi ) = exp(−∥x − xi ∥2 /σ ) or κ(x, xi ) = exp(−∥x − xi ∥2 /σi ), regardless of the impact factor, i.e., σ is the same or σi is different for each kernel, the generated kernel matrix has full rank. (Proof is provided in Appendix B.) Theorem 2 implies that RKELM possesses universal learning capability regardless of whether the impact factors are the same or different. Park and Sandberg (1991) have shown that under certain mild conditions, conventional RBF-network (Lowe, 1989) with the same impact factor possesses universal approximation. Huang et al. (2006) on the other hand showed that ELM-RBF possesses universal learning ability when the impact factors σi are different and assigned with random real values. Moreover, the nodes of RKELM can exist in the form of RBF kernel, Sigmoid kernel, wave kernel, etc. In contrast, those in the conventional RBF networks (Lowe, 1989) only hold for specific type of RBF kernel. 5. Experimental study In this section, we verify the learning performance of the proposed RKELM by assessing it against several state-of-theart machine learning algorithms including the conventional

ELM, KELM, LS-SVM, conventional SVM, and RSVM using well established real world benchmark datasets in the field of regression, binary classification and multi-class problems. Note that the source codes of the state-of-the-art machine learning algorithms used in the present experimental study are based on those made available online at the websites of the respective researchers.1 To maintain consistency and a fair comparison, our proposed RKELM2 is also implemented on MATLAB. All simulation runs made in the present study are conducted on a 2.6 GHz CPU and 4G memory desktop PC, except for REAL-SIM, RCV1.BIN, RCV1.MUL and WEBSPAM which are conducted on a machine with 256G RAM. 5.1. Datasets To provide a comprehensive investigation on the performances of the different algorithms considered, a wide range of datasets have been considered in our experimental study, which comprises datasets of small (low) and big (high) instance size (dimensionality). The small-scale datasets include three multi-class datasets, four binary classification datasets and eight regression datasets as described in Table 1. The large scale datasets include three binary classification datasets and three multi-class datasets as described in Table 8. Most of these datasets have been taken from the UCI Machine Learning Repository (Bache & Lichman, 2013), Statlib3 and LIBSVM website.4 For each problem, the results reported are the average of 20 independent trials. All input attributes are normalized within the range of [−1, 1]. The training and testing sets are defined as described in Table 1 and Table 8, but the order of the dataset is randomly shuffled for each independent trial. 5.2. User-specified parameters Unless explicitly stated, the Gaussian kernel κ(x, xi ) = exp (−∥x − xi ∥2 /σ ) has been used in the experimental study of RKELM, KELM, LS-SVM, SVM and RSVM, while the sigmoid activation function h(x, a, b) = 1/(1 + e−λ(a·x+b) ) has been considered in ELM. For RKELM and RSVM, the average results of 20 independent simulation trials on each combination of (C , L) are then obtained and the best average performance is reported. With KELM, LS-SVM and SVM, the best average performance of 20 trials on different C = [2−30 , . . . , 25 ] is reported. Last but not least, for ELM with Sigmoid function, the best average performances of 20 trials on each combination of (λ, L) is then obtained and reported. 5.3. Results on small-scale regression problems The performances of RKELM, ELM, KELM, LS-SVM and SVM are computed on eight real-world regression benchmark datasets that cover various fields of applications. The user-specified parameters considered in the simulation trials are then given in Table 2. Table 3 on the other hand tabulates the algorithmic performances obtained, including the training time and testing RMSE.

1 The Matlab codes of the conventional ELM and KELM are downloaded online from (http://www.ntu.edu.sg/home/egbhuang). The conventional SVM is based on the compiled C-coded SVM packages available in LIBSVM, i.e., (http://www.csie.ntu.edu.tw/~cjlin/libsvm/#download). The LS-SVM Matlab code is obtained from (http://www.esat.kuleuven.be/stadius/lssvmlab/). The RSVM code is downloaded from http://dmlab8.csie.ntust.edu.tw/downloads/Download/ SSVMtoolbox.zip. The LLSVM and BSGD code is obtained from (http://www. dabi.temple.edu/budgetedsvm/download.html). The LaSVM code is obtained from (http://leon.bottou.org/projects/lasvm). 2 The source code of RKELM is made available online at http://yunpan.cn/ QDqESbIYiKhid. 3 http://lib.stat.cmu.edu/datasets/. 4 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

33

Table 1 Specification of small datasets. Dataset

#Classes

#Train set

Regression

Auto-Mpga Abalonea California Housingb YearPredictionMSDd Breast cancerc Computer Activityc Triazinesc Bankc

7 8 8 90 32 21 61 9

– – – – – – – –

320 3,000 8,000 463,715 100 4,000 100 4,500

72 1,177 12,640 51,630 94 4,192 86 3,692

Multi-calss

Image Segmenta Satellite Imagea Shuttlea

19 36 9

7 6 7

1,500 4,435 43,500

810 2,000 14,500

Binary-class

Wave forme, f USPSe, g Adulte Skin Segmenta

21 256 122 3

2 2 2 2

4,000 7,329 32,562 145,057

1,000 2,000 16,282 100,000

a b c d e f g

#Attributes

#Test set

http://archive.ics.uci.edu/ml/datasets. http://www.dcc.fc.up.pt/ltorgo/Regression/cal_housing.html. http://www.cs.waikato.ac.nz/ml/weka/datasets.html. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. http://leon.bottou.org/papers/bordes-ertekin-weston-bottou-2005. Class 1 vs. rest. Class 0 vs. rest.

Table 2 Parameters on regression applications. Dataset Auto-Mpg Abalone California Housing YearPredictionMSD Breast cancer Computer activity Triazines Bank

(L, C , σ )

RKELM

ELM (L, λ)

KELM (C , σ )

LS-SVM (C , σ )

(25, 2−20 , 22 ) (25, 2−20 , 22 ) (50, 2−20 , 20 ) (30, 2−20 , 27 ) (20, 2−10 , 22 ) (250, 2−20 , 20 ) (10, 2−10 , 25 ) (200, 2−20 , 21 )

(25,1) (25,1) (50,1) (30, 0.05) (20,1) (250,1) (10,1) (200,1)

(2−10 , 22 ) (2−10 , 22 )

(210 , 22 ) (210 , 22 ) (210 , 20 )

– – (2 1 , 2 2 ) (210 , 20 ) (2 3 , 2 5 ) (2 3 , 2 1 )



(210 , 2−2 ) (210 , 20 ) (24 , 25 ) (24 , 21 )

SVM

RSVM

(#SV , C , σ )

(#SV , C , σ )

(168, 210 , 2−2 ) (1514, 210 , 22 ) (4122, 27 , 20 ) (4023, 25 , 2−5 ) (100, 210 , 22 ) (2083, 27 , 20 ) (65, 22 , 25 ) (2305, 22 , 21 )

(25, 220 , 2−2 ) (25, 220 , 2−2 ) (50, 210 , 20 ) (30, 210 , 2−5 ) (20, 210 , 22 ) (250, 210 , 20 ) (10, 210 , 2−5 ) (200, 220 , 2−1 )

Table 3 Training time (in seconds) and testing error (in RMSE) on regression applications. Dataset

Auto-Mpg Abalone California MSD Breast Computer Triazines Bank

RKELM

ELM

KELM

LS-SVM

SVM

RSVM

Tr.time

Ts.RMSE

Tr.time

Ts.RMSE

Tr.time

Ts.RMSE

Tr.time

Ts.RMSE

Tr.time

Ts.RMSE

Tr.time

Ts.RMSE

0.0006 0.0028 0.0104 8.8039 0.0006 0.0355 0.0005 0.0423

0.0737 0.0765 0.1290 0.1105 0.2794 0.0295 0.1534 0.0435

0.0078 0.0156 0.0827 3.0228 0.0076 0.0938 0.0259 0.1178

0.0778 0.0782 0.1320 0.1118 0.2893 0.0308 0.1730 0.0430

0.0102 4.6388 – – 0.0013 13.946 0.0024 19.321

0.0718 0.0750 – – 0.2713 0.0241 0.1406 0.0423

0.0064 1.2679 15.898 – 0.0016 2.6813 0.0037 3.629

0.0720 0.0754 0.1237 – 0.3047 0.0240 0.1414 0.0426

0.0142 0.668 44.95 5.9372 0.0035 7.1277 0.0031 2.0247

0.0763 0.0779 0.1196 0.1326 0.2812 0.0286 0.1616 0.0433

0.0059 0.0111 0.1754 21.684 0.0056 0.1323 0.0141 0.1311

0.0828 0.0843 0.1373 0.1170 0.2745 0.0458 0.1694 0.0475

‘‘–’’: out of memory; Tr.time: Training Time; Ts.RMSE: Testing RMSE.

RKELM can be observed to incur the lowest training time efforts than the conventional ELM, KELM, LS-SVM, SVM and RSVM in most cases, while attaining competitive testing RMSE to the conventional ELM, LS-SVM, SVM and RSVM. Although the KELM is observed to report the lowest RMSE among all the algorithms considered, i.e., RKELM, ELM, SVM, LS-SVM and RSVM, it is shown to have incurred the highest training time. Taking the Computer activity dataset as an example, RKELM took 0.0355 s to train while the KELM incurred a CPU wall clock time of 13.946 s. Notably, RKELM is 300 times faster than the KELM counterpart. 5.4. Results on small-scale classification problems The performance efficacy of the RKELM has also been considered on seven small-scale classification problems. Table 4 summarizes the parameters settings, while Tables 5–7 report the results of the respective algorithms. On the multi-class problems, training

time and testing accuracy are employed to measure the performance of the algorithms. On binary classification problems, besides training time and testing accuracy, three additional performance metrics including Precision, Recall and F1-measure have been adopted. As observed from Tables 5–7, the training time taken by RKELM is much lower than ELM, KELM, LS-SVM, SVM and RSVM, and the testing accuracy is superior to ELM while competitive to LS-SVM/SVM/RSVM. Although RKELM did not perform better than KELM in terms of testing accuracy, it takes significantly smaller amount of time to train a model than the KELM. Taking the Satimage as an example, RKELM incurred a training time of 0.42 s, while the ELM at 1.53 s, KELM at 18.19 s, LS-SVM at 0.98 s, SVM at 1.317 s and the RSVM at 0.83 s. From these results, it is worth noting that RKELM is 150 times more efficient compared to the KELM and 3 times faster than SVM. KELM and LS-SVM have been noted to reach the ‘‘run out of memory’’ state on a desktop computer that is equipped with a memory size of 4G. The results also indicate

34

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38 Table 4 Parameters on classification applications. Dataset

RKELM

ELM (L, λ)

(C , σ )

KELM

LS-SVM (C , σ )

SVM (#SV , C , σ )

(#SV , C , σ )

(400, 225 , 20 ) (400, 225 , 21 ) (300, 220 , 2−3 ) (300, 220 , 2−3 ) (250, 220 , 21 ) (500, 220 , 28 ) (600, 220 , 28 )

(400, 2−4 ) (500, 24 )

(25 , 20 ) (25 , 20 )

(210 , 20 ) (210 , 20 )

(300,1) (300,1) (250,1) (500,1) (600,1)

– – (2 3 , 2 1 ) – –

– – – – –

(186, 25 , 20 ) (2012, 25 , 21 ) (279, 210 , 23 ) (186, 210 , 2−3 ) (968, 21 , 21 ) (208, 25 , 2−8 ) (11483, 25 , 2−8 )

(400, 210 , 20 ) (400, 25 , 2−1 ) (300, 210 , 23 ) (500, 210 , 2−3 ) (250, 210 , 2−1 ) (500, 210 , 2−8 ) (600, 210 , 2−8 )

(L, C , σ )

Segment Satimage Shuttle Skin Waveform USPS Adult

RSVM

Table 5 Performance comparison of RKELM, ELM, KELM, LS-SVM, SVM and RSVM on multi-class classification applications. Dataset

RKELM

Segment Satimage Shuttle

ELM

KELM

LS-SVM

SVM

RSVM

Tr. time

Ts. rate

Tr. time

Ts. rate

Tr. time

Ts. rate

Tr. time

Ts. rate

Tr. time

Ts. rate

Tr. time

Ts. rate

0.03 0.42 0.67

96.99% 91.31% 99.66%

0.07 1.53 0.76

95.43% 89.21% 99.36%

0.47 18.19 –

97.68% 91.83% –

159.38 0.98 –

97.37% 91.79% –

0.150 1.317 1.55

96.72% 91.67% 99.92%

0.76 0.83 31.62

96.30% 90.44% 99.78%

‘‘–’’: out of memory, Tr.time: Training time (in seconds), Ts.rate: Testing accuracy. Table 6 Performance comparison of RKELM, Conventional ELM and KELM on binary classification applications. Dataset

Waveform Waveform USPS Adult Skin

RKELM

ELM

KELM

Tr. time

Ts. rate

Precision

Recall

F1

Tr. time

Ts. rate

Precision

Recall

F1

Tr. time

Ts. rate

Precision

Recall

F1

0.04 0.04 0.24 1.04 2.85

91.48% 91.48% 99.55% 85.25% 99.84%

88.66% 88.66% 99.70% 72.98% 96.89%

87.14% 87.14% 97.08% 59.28% 99.99%

87.90% 87.90% 98.37% 65.42% 98.35%

0.09 0.09 0.47 2.01 3.28

91.24% 91.24% 99.44% 84.99% 99.3%

88.64% 88.64% 99.10% 73.20% 96.77%

83.38% 83.38% 96.78% 57.18% 99.88%

85.93% 85.93% 97.93% 64.20% 98.24%

15.36 15.36 62.3 – –

91.14% 91.14% 99.49% – –

89.64% 89.64% 98.26% – –

82.20% 82.20% 98.83% – –

85.76% 85.76% 98.54% – –

‘‘–’’: out of memory, Tr.t: Training Time (Seconds), Ts.acc: Testing Accuracy. Table 7 Performance comparison of RKELM, SVM and RSVM on binary classification applications. Dataset

Waveform USPS Adult Skin

RKELM

SVM

RSVM

Tr. time

Ts. rate

Precision

Recall

F1

Tr. time

Ts. rate

Precision

Recall

F1

Tr. time

Ts. rate

Precision

Recall

F1

0.04 0.24 1.04 2.85

91.48% 99.55% 85.25% 99.84%

88.66% 99.70% 72.98% 99.82%

87.14% 97.08% 59.28% 99.99%

87.90% 98.37% 65.42% 99.90%

0.337 1.56 88.91 5.49

90.92% 99.61% 84.94% 99.95%

85.71% 98.52% 72.83% 99.85%

88.12% 98.52% 57.85% 99.99%

86.90% 98.52% 64.48% 99.92%

0.43 1.29 6.93 79.50

90.32% 99.34% 85.11% 99.74%

83.38% 97.35% 72.80% 96.89%

84.69% 97.93% 59.02% 99.19%

84.03% 97.64% 65.19% 98.02%

‘‘–’’: out of memory, Tr.time: Training Time (in seconds), Ts.rate: Testing accuracy. Table 8 Specification of big data sets.

5.5. Results on large scale problems

Dataset

#Train set

#Test set

#Attributes

#Classes

REAL-SIM RCV1.BIN WEBSPAM RCV1.MUL SECTOR NEWS20.MUL

32,309 677,399 300,000 518,571 6,412 15,935

40,000 20,242 50,000 15,564 3,207 3,993

20,958 47,236 16,609,143 47,236 55,197 62,061

2 2 2 53 105 20

that although KELM can produce the best generalization performance, the high time and memory complexities requirements of KELM make it inappropriate for dealing with the many challenges of large scale data, which is prevalent in the era of Big data. On binary classification datasets, we can find that RKELM can obtain competitive or improved Precision, Recall and F1-measure results. Taking Waveform dataset as an example, in which the proportion of positive and negative samples is Np /Nn = 0.6738, RKELM can obtain the performance Precision = 88.66%, Recall = 87.14% and F1 = 87.90%, which is an improvement over the conventional ELM (88.64%, 83.38% and 85.93%) on all three metrics, while competitive to KELM in terms of Precision (89.64%) and an improvement over the KELM in Recall (82.20%) and F1(85.76%).

Table 8 summarizes the large scale datasets used in the experiments. The datasets are part of the LibSVM binary data collection.5 Besides ELM, we pit our method against other stateof-art large scale algorithms including LLSVM (Zhang et al., 2012), BSGD (Wang et al., 2012) and DTSVM (Chang et al., 2010). Tables 9 and 10 tabulate the experimental results of multi-class and binary classification applications, respectively. From the results in Tables 9 and 10, it can be observed that with respect to the ELM, RKELM exhibited improved generalization performances and incurred a lower training time on all six datasets considered. For example on REAL-SIM, RKELM took 29 s to achieve the 95.89% test accuracy while ELM took 41 s to achieve an accuracy of 92.93%; on the bigger WEBSPAM dataset, RKELM took 2583 s to achieve the 98.23% test performance, while ELM took 2698 s to arrive at 94.22%. The reason for the higher efficiency of RKELM than ELM lies in the sparsity of the datasets, and the vector production occurring on few non-zero elements in the RKELM, thus leading

5 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

35

Table 9 Performance comparison of RKELM to other state-of-the-art machine learning algorithms on large scale multi-class classification applications. Dataset

NEWS20.MUL

RCV1.MUL

SECTOR

Methods

Training time

Testing rate

Speedup

RKELM ELM LASVM (Sindhwani & Keerthi, 2006) DTSVM (Sindhwani & Keerthi, 2006) CBD RKELM ELM K-Pegasos (Chen et al., 2011) K-binaryLR (Chen et al., 2011) Multi-class LR (Chen et al., 2011) RKELM ELM LIBSVM(OVA) (Chang et al., 2010) NB(OVA) (Chang et al., 2010) LibSVM(BCH63) (Chang et al., 2010) NB(BCH63) (Chang et al., 2010)

33 s 84 s 23,339 s 3053 s 39,590 s 145 s 521 s N.A. N.A. N.A. 18 s 20 s N.A. N.A. N.A. N.A.

85.19% 82.46% 83.10% 83.22% 75.23% 85.81% 82.74% 84.50% 83.00% 88.50% 93.86% 91.22% 92.80% 64.30% 93.3% 87.2%

1× 1.4× 12.8× 38.9× 344× 1× 3.5× N.A. N.A. N.A. 1× 1.04× 11.3× 24.3× 24.3× 24.3×

Table 10 Performance comparison of RKELM to other state-of-the-art machine learning algorithms on large scale binary classification applications. Dataset

REAL-SIM

RCV1.BIN

WEBSPAM

Methods RKELM ELM TSVM (Sindhwani & Keerthi, 2006) DA (Sindhwani & Keerthi, 2006) RSVM RKELM ELM LLSVM BSGD SVM (RBF) RKELM ELM CART (Chang et al., 2010) DTSVM (Chang et al., 2010)

Training time s

Testing rate

Precision

Recall

F1

Speedup

29 41 373

95.89% 92.93% 93.10%

94.49% 89.14% N.A.

88.66% 83.37% N.A.

91.48% 86.16% N.A.

1× 1.4× 12.8×

1,129

92.80%

N.A.

N.A.

N.A.

38.9×

10,002 406 762 1,800 5,400 68,687 2 583 2 698 29,332 63,015

96.46% 96.85% 94.70% 95.77% 97.08% 97.57% 98.23% 94.22% 98.44% 99.03%

94.54% 96.40% 94.34% 95.21% 96.63% 96.84% 95.40% 91.11% N.A. N.A.

88.72% 97.89% 95.19% 96.35% 97.91% 98.11% 96.89% 92.81% N.A. N.A.

91.53% 97.14% 94.76% 95.77% 97.26% 97.47% 96.14% 91.95% N.A. N.A.

344× 1× 1.8× 4.4× 13.3× 169× 1× 1.04× 11.3× 24.3×

to the significant cost saving. In contrast, for ELM, despite the sparse data, the random input weights of ELM are not sparse, thus the production operation needs to be carried out on every elements. With respect to the other algorithms, although RKELM did not emerge as superior in terms of prediction accuracy on the RCV1.BIN, REAL-SIM and WEBSPAM datasets, the testing accuracy of 96.85% on RCV1.BIN, 95.89% on REAL-SIM and 98.23% on WEBSPAM, is found to be highly competitive to the existing state-of-the-art algorithms. Notably, in terms of training time, it is significantly more efficient than all the algorithms considered in all cases. For example, on the WEBSPAM dataset which has more than 16 million features and 300,000 training samples, RKELM only took 2583 s to attain the 98.23% testing accuracy reported. Note that this represents a 24.3 times speed up over the DTSVM. To summarize, on big data problems where memory space and time complexity is a constant concern, RKELM serves as an excellent choice of providing good balance in the training complexity and generalization accuracy.

Fig. 1. Testing accuracy change of RKELM on Satimage w.r.t C and L.

In this section, we presented a sensitivity study on the parameters of RKELM.

varying from 2−30 to 25 , under differing L of 200, 400, 800, 1200, 1800, 2500, 4000 and 4435. From the figure, it can be observed that when L is small (i.e., L = 200, 400, 800, 1200, 1800), no form of over-fitting has been observed, hence indicating that any fine tuning of C is ineffective for improving the testing accuracy. On the other hand, when L becomes large (i.e., L = 2500, 4000, 4435), the phenomenon of over-fitting can be observed. This indicates that a fine tuning of C can be useful in mitigating the effects of over-fitting and improves generalization performance.

6.1. Performances of RKELM w.r.t to Parameters C and L

6.2. Performances of RKELM for common σ vs. different σi

6. Parameters sensitivity and stability analysis

Fig. 1 summarizes the testing accuracy of RKELM on the

Satimage w.r.t parameters C and L. The detailed result is provided in Fig. 2 to showcase the testing accuracies for different C values,

For RKELM with Gaussian kernel, the impact factor can be a common value (σ ) or different (σi ) for each kernel. Here, we study and analyze the effects of the impact factor on the performance

36

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

Fig. 2. Performance accuracies of RKELM for different L and C values on hidden nodes L are large.

Satimage dataset. C is noted to pose a high influence on the performance of RKELM when the

Table 11 Training time and testing RMSE of RKELM-σ , RKELM-σi and ELM-RBF on regression problems. Dataset

Auto Abalone California YearPred Breast Computer Triazines Bank

Ď

RKELM-σ ∗

RKELM-σi

ELM-RBF

Train time

Test RMSE

Train time

Test RMSE

Train time

Test RMSE

0.0006 0.0028 0.0104 8.8039 0.0006 0.0355 0.0005 0.0423

0.0737 0.0760 0.1290 0.1105 0.2794 0.0295 0.1534 0.0435

0.0005 0.0029 0.0123 8.7329 0.0006 0.0424 0.0004 0.0337

0.0757 0.0765 0.1293 0.1104 0.2812 0.0248 0.1511 0.0431

0.0009 0.0046 0.0155 4.4786 0.0015 0.0801 0.0004 0.0604

0.0756 0.0764 0.1344 0.1138 0.2853 0.0315 0.1529 0.0421

Ď

RKELM-σ ∗ : RKELM with same impact factor; RKELM-σi : RKELM with different impact factors. Table 12 Training time and testing classification accuracy of RKELM-σ , RKELM-σi and ELM-RBF on classification problems. Dataset

Segment Satimage Shuttle Skin Waveform USPS Adult

RKELM-σ

RKELM-σi

ELM-RBF

Train time

Test rate

Train time

Test rate

Train time

Test rate

0.0282 0.4210 0.6699 2.8506 0.0405 0.2401 1.0411

96.99% 91.31% 99.66% 99.85% 91.48% 99.55% 85.25%

0.0328 0.4419 0.6380 2.1250 0.0465 0.2681 1.1315

96.70% 91.64% 99.73% 99.79% 91.28% 99.58% 85.26%

0.0982 1.9198 0.9996 3.1019 0.0931 0.5393 2.0844

95.17% 88.79% 99.48% 99.41% 91.16% 99.21% 85.09%

of the RKELM as reported in Tables 11 and 12. The results obtained indicate that regardless of whether the impact factors of the kernels are the same or unique, RKELM remains capable of generating improved or competitive generalized performance to the ELM-RBF. For example, on the Satimage dataset, RKELM-σ exhibited a testing accuracy of 91.31% and RKELM-σi showcased a competitive testing accuracy of 91.64%. Note that both RKELM-σi and RKELM-σ displayed improved generalized performances over the ELM-RBF (88.79%). 6.3. Stability analysis of the RKELM In the subsection, we further analyze the stability of the proposed RKELM experimentally. In particular, the average results of 20 independent simulation trials for different data subsets

in each trial are plotted in Fig. 3. Due to space constraints, the results of two classification datasets (Satimage and Image segment) and two regression datasets (Bank and Auto Mpg) are presented as the representatives. Widely used kernel types including the Gaussian kernel, Sigmoid kernel, Polynomial kernel, Inverse Multiquadric kernel and Rational Quadratic kernel are considered here. From the results obtained, it can be observed that the testing accuracy performances do not vary much, regardless of the kernel function used. For example, on the Satimage dataset, the standard derivation of RKELM-σi is about 0.01; on Auto Mpg, the standard derivation of RKELM-σi is around 0.02. This indicates that there is no significant change in the testing accuracy across the 20 independent trials and stable generalization performance can be achieved with the use of random support vectors.

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

37

Fig. 3. 20 trials testing accuracy of RKELM with different random support vectors.

Fig. 4. Training time comparison of RKELM and ELM with respect to data density.

6.4. The influence of data density on RKELM computational efficiency In the conventional ELM, the random weights of ELM exist in the form of a dense matrix. In contrast to the conventional ELM, however, since the support vectors of RKELM are selected from the data itself, the sparse characteristics of the RKELM are naturally inherited from the sparsity of the dataset. In particular, since the input data and input weights are in the form of sparse matrices, the computational efficiency of RKELM is naturally higher than that of the ELM. To illustrate the impacts of data density on the computational efficiency of RKELM, three datasets of differing dimensionalities, i.e., d = [104 , 105 , 106 ], have been generated here for experimental study. For each dataset, the degree of data density is designed to decrease according to the following [0.6, 0.4, 0.2, 0.1, 0.08, 0.05, 0.02, 0.01, 0.005, 0.002, 0.001], and the training time of RKELM (L = 500) and ELM (L = 500) w.r.t different density level is then summarized in Fig. 4. It can be observed that the computational efficiency of RKELM is always higher than that of the conventional ELM. For data density in the range of [0.2, 0.01], the RKELM is noted to exhibit the largest improvement in computational efficiency over the ELM counterpart. 7. Conclusion and future work Kernel-based algorithms including SVM, LS-SVM and KELM have been extremely popular in many real world applications involving classification and regression problems due to their excellent generalization performance reported. However, on Big data or large scale problems, for conventional kernel-based approaches such as SVM, LS-SVM and KELM, the computational process of identifying support vectors can become very intensive due to the Big kernel matrix size, leading to poor computational efficiency. In this paper we proposed a fast kernel-based algorithm RKELM which randomly selects a subset of the available data as support vectors. Based on universal learning condition involving reduced kernel-based SLFN, we prove that RKELM can approximate any nonlinear functions with zero error only if the kernel is strictly

positive definite. By avoiding iterative steps or the curse of kernel matrix size, significant cost savings in the RKELM training process can be readily attained. Widely experimental study on various of benchmarks shows that RKELM can produce competitive and stable performance at fast learning speed. Last but not least, it is worth noting that the sparseness of RKELM remains an open issue thus would warrant greater investigations in the near future. Acknowledgments This work was conducted within the Rolls-Royce@NTU Corporate Lab with support from the National Research Foundation (NRF) Singapore under the Corp Lab@University Scheme. The first author is also grateful for the support from the Innovation fund research group 61221063; National Science Foundation of China 61572399, 61532015; Shaanxi New Star of Science & Technology 2013 KJXX-29; New Star Team of Xian University of Posts & Telecommunications; Provincial Key Disciplines Construction Fund of General Institutions of Higher Education in Shaanxi. Appendix A. Proof of Theorem 1 From Theorem 3 we know, if kernel κ(·, ·) is SPD kernel, then for every finite set {x1 , . . . , xL } ⊂ Ω where Ω is a domain (bounded or not) in Rd , the kernel matrix KN ×N has full rank, so KN ×N is invertible and ∥KN ×N β − T∥ = 0. Appendix B. Proof of Theorem 2 We first show the functions κ(·, xi ) = exp(−∥x − xi ∥2 /σ ), i = 1, 2, . . . , L are linear independent. The linear relation equation L 

αi · exp(−∥x − xi ∥2 /σ ) = 0,

αi ∈ R

(15)

i =1

can be rewritten in the form, 0=

L L   2 2 2 (αi e−∥xi ∥ /σ )e−∥x∥ e2⟨x,xi ⟩ = e−∥x∥ bi e2⟨x,xi ⟩ i=1

i =1

(16)

38

W.-Y. Deng et al. / Neural Networks 76 (2016) 29–38

where bi = αi e L 

−∥xi ∥2 /σ

, then we have,

bi e2⟨x,xi ⟩ = 0.

(17)

i =1

Since this equation holds for any sample x ∈ Rd , one can get the homogeneous system as following, L 

bi e2⟨xj ,xi ⟩ = 0,

j = 1 , 2 , . . . , L.

(18)

i =1

For a fixed sample x ∈ Rd , this is a Vandermonde system (Demeure & Scharf, 1989) of equations for the bi . Consequently, bi = 0 and thus αi = 0 for all i as desired. So κ(·, xi ) are linear independent. Further considering Theorem 3, we know Gaussian kernel matrix has full rank no matter whether the impact factor σ is same or not in each kernel. Appendix C. Theorem 3 Theorem 3. If H is a reproducing kernel Hilbert function space with reproducing kernel κ : Ω × Ω → R, then κ is positive definite. Moreover, κ is strictly positive definite if and only if κ(·, xi ), i = 1, 2, . . . , N∞ are linearly independent where N∞ denotes one integer that may tend to be infinite. Proof. For N∞ distinct samples x1 , . . . , xN∞ and nonzero vector c ∈ RN∞ we have, N∞  N∞ 

cj ck κ(xj , xk ) =

j=1 k=1

N∞  N∞ 

cj ck κ(·, xj ), κ(·, xk )



 H

j=1 k=1

=

 N∞  j=1

cj κ(·, xj ),

N∞ 

 cj κ(·, xj )

j=1

2  N∞     cj κ(·, xj ) ≥ 0 =   j=1

H

(19)

H

κ is positive definite. Moreover, if κ is strictly positive definite, then ∥ Nj=∞1 cj κ(·, xj )∥2H > 0. This means that for nonzero vector c ∈ N RN∞ , j=∞1 cj κ(·, xj ) ̸= 0, i.e., κ(·, xj ) are linear independent. Conversely, if κ(·, xj ) are linear independent, then for nonzero vector N N N∞ c ∈ RN , j=∞1 cj κ(·, xj ) ̸= 0 and thus j=∞1 k= 1 cj ck κ(xj , xk ) > 0, i.e. κ is strictly positive definite. This means that for every finite set {x1 , . . . , xN } ⊂ Ω , if κ is strictly positive definite, then the kernel matrix κ(xi , xj ) has full rank. References Bache, K., & Lichman, M. (2013). UCI machine learning repository. URL: http://archive.ics.uci.edu/ml. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6, 1579–1619.

Chang, F., Guo, C.-Y., Lin, X.-R., & Lu, C.-J. (2010). Tree decomposition for large-scale SVM problems. Journal of Machine Learning Research, 11, 2935–2972. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1–27:27. Chen, D., Chen, W., & Yang, Q. (2011). Characterizing inverse time dependency in multi-class learning. In 2011 IEEE 11th international conference on data mining, ICDM (pp. 1020–1025). Demeure, C., & Scharf, L. (1989). Fast least squares solution of vandermonde systems of equations. In: acoustics, speech, and signal processing, 1989. ICASSP-89., 1989 international conference on, vol.4 (pp. 2198–2210) http://dx.doi.org/10.1109/ICASSP.1989.266900. Feng, L., Ong, Y.-S., & Lim, M.-H. (2013). Elm-guided memetic computation for vehicle routing. IEEE Intelligent Systems, 28(6), 38–41. Feng, L., Ong, Y.-S., Lim, M.-H., & Tsang, I. (2015). Memetic search with interdomain learning: A realization between CVRP and CARP. IEEE Transactions on Evolutionary Computation, 19(5), 644–658. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press. Huang, G.-B. (2014). An insight into extreme learning machines: Random neurons, random features and kernels. Cognitive Computation, 6(3), 376–390. Huang, G.-B., & Chen, L. (2007). Convex incremental extreme learning machine. Neurocomputing, 70(16–18), 3056–3062. Huang, G.-b., Chen, L., & Siew, C.-K. (2006). Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks, 17(4), 879–892. Huang, G.-B., Zhou, H., Ding, X., & Zhang, R. (2012). Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(2), 513–529. Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2004). Extreme learning machine: a new learning scheme of feedforward neural networks. In 2004 IEEE international joint conference on neural networks, 2004. Proceedings, Vol. 2 (pp. 985–990). Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2006). Extreme learning machine: Theory and applications. Neurocomputing, 70(1–3), 489–501. Lee, Y.-J., & Huang, S.-Y. (2007). Reduced support vector machines: A statistical theory. IEEE Transactions on Neural Networks, 18(1), 1–13. Lowe, D. (1989). Adaptive radial basis function nonlinearities, and the problem of generalisation. In artificial neural networks, 1989., First IEE international conference on (Conf. Publ. No. 313) (pp. 171–175). Luo, J., Vong, C.-M., & Wong, P.-K. (2014). Sparse Bayesian extreme learning machine for multi-classification. IEEE Transactions on Neural Networks and Learning Systems, 25(4), 836–843. Neuvial, P. (2013). Asymptotic results on adaptive false discovery rate controlling procedures based on kernel estimators. The Journal of Machine Learning Research, 14(1), 1423–1459. Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basisfunction networks. Neural Computation, 3(2), 246–257. Rong, H.-J., Ong, Y.-S., Tan, A.-H., & Zhu, Z. (2008). A fast pruned-extreme learning machine for classification problem. Neurocomputing, 72(1–3), 359–366. Sindhwani, V., & Keerthi, S. S. (2006). Large scale semi-supervised linear SVMs. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06. (pp. 477–484). New York, NY, USA: ACM. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300. Tsang, I. W., Kwok, J. T., & Cheung, P.-M. (2005). Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6, 363–392. Vapnik, V. N. (1995). The Nature of statistical learning theory. New York, NY, USA: Springer-Verlag, New York, Inc.. Wang, Z., Crammer, K., & Vucetic, S. (2012). Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale svm training. Journal of Machine Learning Research, 13(1), 3103–3131. Zhai, Y., Ong, Y.-S., & Tsang, I. (2014). The emerging ‘‘big dimensionality’’. IEEE Computational Intelligence Magazine, 9(3), 14–26. Zhang, K., Lan, L., Wang, Z., & Moerchen, F. (2012). Scaling up kernel SVM on limited resources: A low-rank linearization approach, pp. 1425–1434.

A Fast Reduced Kernel Extreme Learning Machine.

In this paper, we present a fast and accurate kernel-based supervised algorithm referred to as the Reduced Kernel Extreme Learning Machine (RKELM). In...
630KB Sizes 6 Downloads 16 Views