Robust support vector regression for uncertain input and output data.

1690

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

Robust Support Vector Regression for Uncertain Input and Output Data Gao Huang, Student Member, IEEE, Shiji Song, Cheng Wu, and Keyou You

Abstract— In this paper, a robust support vector regression (RSVR) method with uncertain input and output data is studied. First, the data uncertainties are investigated under a stochastic framework and two linear robust formulations are derived. Linear formulations robust to ellipsoidal uncertainties are also considered from a geometric perspective. Second, kernelized RSVR formulations are established for nonlinear regression problems. Both linear and nonlinear formulations are converted to second-order cone programming problems, which can be solved efficiently by the interior point method. Simulation demonstrates that the proposed method outperforms existing RSVRs in the presence of both input and output data uncertainties. Index Terms— Robust, second-order cone programming, support vector regression, uncertain data.

I. I NTRODUCTION

S

UPPORT VECTOR REGRESSION (SVR) is a universal function regression approach based on the Vapnik– Chervonenkis theory [1]–[3]. It aims to minimize a combination of the empirical risk and a regularization term and has achieved excellent performance in a wide variety of problems [4]–[8]. The traditional SVR assumes that the training data are known exactly [9]–[11]. However, data uncertainty is inevitable in practical regression problems where data cannot be observed precisely, because of sampling errors, modeling errors, and/or measurement errors. Recent studies have shown that robust support vector regressions (RSVRs), which explicitly handle the data uncertainties, perform better than the traditional SVRs [12]–[18]. According to the type of uncertainties studied, most of the existing RSVRs can be classified into the following categories. 1) RSVRs for stochastic uncertainty [14], [15], [19]: Similar to the traditional regression methods, several RSVRs treat data uncertainties as random noise. By replacing the constraints in the standard ε-SVR [9] with probability constraints, chance-constrained robust regression formulations can be obtained. For example, an RSVR robust to Gaussian noise has been proposed

Manuscript received January 12, 2011; revised July 4, 2012; accepted July 29, 2012. Date of publication August 27, 2012; date of current version October 15, 2012. This work was supported in part by the National Natural Science Foundation of China under Grant 61273233, the Research Foundation for the Doctoral Program of Higher Education under Grant 20090002110035, the Project of China Ocean Association under Grant DY125-25-02, and Tsinghua University Initiative Scientific Research Program under Grant 2010THZ07002 and Grant 2011THZ07132. The authors are with the Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TNNLS.2012.2212456

by Trafalis and Alwazzi [14]. In [15], Shivaswamy et al. have introduced an RSVR without noise distribution assumption, provided that the mean and covariance of the noise are known. 2) RSVRs for ellipsoidal uncertainty [13]–[15]: From a geometric perspective, some researchers have studied RSVRs by assuming that each data point takes values from an ellipsoid which is determined by its center, metric, and radius [13]–[15]. Note that spherical uncertainty can be viewed as a special case of ellipsoidal uncertainty. 3) RSVRs for box-type uncertainty (interval data) [16], [20]–[22]: Under certain situations where data cannot be expressed by exact points, intervals may be a better way for data representation. The RSVR proposed in [16] can handle such data uncertainties by introducing two different distances (the maximum distance and the Hausdorff distance) for measuring the prediction error. Later, a kernelized RSVR with precise input and interval output data has been studied in [20]. 4) RSVRs for fuzzy uncertainty [12], [23], [24]: The support vector fuzzy regression method proposed in [12] has been applied to data with fuzzy uncertainty. These works have discussed two models under different situations: a) both input and output data are symmetric triangular fuzzy numbers and b) the input data are crisp and the output data are triangular fuzzy numbers. They apply the standard ε-SVR method to the fuzzy SVR and assume that the mode and the extremes of the intervals must satisfy the common constraints. Under the crisp inputfuzzy output case, nonlinear regressors are introduced via kernel methods as well [12]. However, some of the aforementioned RSVRs consider only input uncertainty, while others consider only output uncertainty. Few works have considered both input and output data uncertainty simultaneously. As summarized in Table I, the first two types of RSVRs, which can deal with stochastic and ellipsoidal uncertainties, assume that the uncertainties are limited to the input data [14], [15], [19], [25]. For box-type uncertainty and fuzzy uncertainty, several authors have studied the case when both input and output data are perturbed, but limited to linear regressions [12], [16]. As is well known, it is a great strength for SVR in the nonlinear regression by using kernel functions. Thus, the robust formulations without using kernel method significantly limit their applicability. In addition, most of the existing RSVRs assume the output data are precise, although these data may also be disturbed

2162–237X/$31.00 © 2012 IEEE

HUANG et al.: RSVR FOR UNCERTAIN INPUT AND OUTPUT DATA

TABLE I E XISTING RSVR S AND THE A SSUMPTION ON D ATA U NCERTAINTY

Type of uncertainty

Uncertainty assumption Linear case

Nonlinear case

Stochastic [14], [15], [19]

Input

Input

Geometric [13]–[15]

Input

Input

Interval [16], [20]–[22]

Input & Output

Output

Fuzzy [12], [23], [24]

Input & Output

Output

by noise. For example, in the field of system identification, both input and output data may be perturbed by measurement errors and/or transmission noise. Actually, in many applications, one receives a list of variables from which one then has to choose a response variable as the output and some explanatory variables as the input. Hence, the assumption that uncertainty is only in the input data is insufficient. To some extent, we may encounter more cases with output uncertainties. The classical linear least squares (LS) method assumes that only the output data is subject to independently and identically distributed Gaussian noise. Golub and Van Loan [26], [27] have extended LS to total LS which take both input and output uncertainties into consideration. Artificial neural networks that are robust to output data noise or outliers have been widely studied in the past several decades [28], [29]. Other traditional regression algorithms, e.g., ridge regression [30] and fuzzy regression [31], [32], also consider output data uncertainties. Thus it is necessary to consider both input and output data uncertainty simultaneously in RSVRs and extend the linear robust formulations to kernelized forms. In this paper, RSVR formulations with uncertain input and output data are proposed and their extensions to kernelized formulations are addressed. It should be noted that we consider only the stochastic uncertainty and ellipsoidal uncertainty in this paper. These two types of uncertainties allow correlated noise for different data entries. In contrast, the fuzzy uncertainty and box-type uncertainty treat the noise on different data entries as independent disturbances, which has limitations in terms of applicability. In this paper, we first investigate the RSVR with uncertain input and output data in a stochastic framework. The observations are treated as random vectors whose mean and covariance are assumed to be known. In order to formulate a linear regression problem, we adopt two probability constraints, namely, close to mean (CTM) and small residual (SR) [15]. It is worth mentioning that the formulations in [15] consider only input data uncertainties. Here we extend them to the case with both input and output data uncertainties. Thus, our models generalize those proposed in [15]. By using Markov’s and Chebyshev’s inequalities [33], we convert the chanceconstrained problem into a second-order cone programming (SOCP) [34] which can be solved efficiently by the interior point method (IPM) [35]. Then, data uncertainty is studied from a geometric perspective, and a linear RSVR formulation with input and

1691

output data uncertainties is established. This model generalizes the formulations in [15] and [36] where only input data uncertainties are considered. Finally, we extend the RSVR to kernelized formulation for nonlinear regression. Two methods are proposed to estimate the spectral norms (the largest singular value of a matrix) of covariance matrices of the observations in the highdimensional feature space. The first one is a statistical method based on the results of kernel principal component analysis. The second one is a geometric approach based on a spherical assumption on the uncertainties in the feature space. Based on the estimated norms for the kernelized formulation, we obtain an SOCP. Generally, solving an SOCP is more difficult than solving quadratic programming (QP). The computational complexity analysis given in this paper indicates that the proposed method is more time-demanding than the traditional ε-SVR. It is obvious that there is a tradeoff between prediction accuracy and computational cost. Besides the time complexity, the proposed method is also more space-demanding, and this problem will be addressed in the simulation section. The rest of this paper is organized as follows. In Section II, probability constraints for RSVR are introduced and more tractable sufficient conditions are derived. Section III studies the ellipsoidal uncertainty from a geometric perspective. Section IV formulates the optimization problem as an SOCP. We then extend our formulations to the nonlinear regression by using kernel methods in Section V. The computational complexity of the proposed method is discussed in Section VI. Simulation results are included in Section VII and conclusions are drawn in Section VIII. II. ROBUST C ONSTRAINTS FOR SVR A. SVR Without Uncertainties In the standard ε-SVR, a training data set ⊆ Rn × R is given, with elements zi = (xi , yi ) ∈ Rn × R, where xi is the input vector and yi is the output variable. The aim of ε-SVR is to find w ∈ Rn and b ∈ R such that for each (xi , yi ), the affine function defined by f (xi ) = wT xi + b − yi

(1)

is less than a given positive constant ε. The parameters can be obtained by solving a quadratic optimization problem min w,b

s.t.

1 w2 2 w T x i + b − yi ≤ ε

∀i ∈ {1, . . . , p}

yi − w x i − b ≤ ε

∀i ∈ {1, . . . , p}

T

(2)

where w denotes the Euclidean norm of w, and p is the number of training patterns. In many cases, the above optimization might be infeasible. For this reason, positive slack variables ξ = [ξ1 , ξ2 , . . . , ξ p ]T and ξ = [ ξ1 , ξ2 , . . . , ξ p ]T are introduced for the constraints in (2), and a penalty term is added to the objective function. This leads to the soft margin formulation with L 1

1692


parameter, while αi s are a set of positive decision variables which will be penalized for large values. The second one is called SR

regularization [2] 1 w2 + C 2

min

w,b,ξ, ξ

p

(ξi + ξi )

Pr {|ei | ≥ ε + ξi } ≤ β

i=1

wT xi + b − yi ≤ ε + ξi ξi yi − w T x i − b ≤ ε + ξi ≥ 0, ξi ≥ 0

s.t.

∀i ∈ {1, . . . , p} ∀i ∈ {1, . . . , p} ∀i ∈ {1, . . . , p}

(3)

where C is a penalty parameter on the training error. One can reformulate (3) as an SOCP by introducing a new positive variable t t +C

min

w,b,ξ, ξ ,t

s.t.

p (ξi + ξi ) i=1

wT xi + b − yi ≤ ε + ξi ξi yi − w T x i − b ≤ ε +

∀i ∈ {1, . . . , p} ∀i ∈ {1, . . . , p}

w ≤ t ξi ≥ 0, ξi ≥ 0

∀i ∈ {1, . . . , p}.

(4)

B. SVR With Input and Output Data Uncertainties For the case with data uncertainties, we investigate the RSVR in a stochastic framework. Suppose both the input and output data are perturbed by noise; the observations Z i = (X iT , Yi )T ∈ R(n+1) , (X i ∈ Rn , Yi ∈ R) are random vectors. We assume that their mean and covariance of the observations X¯ i = E(X i ), xi x xi y

Y¯i = E(Yi )

= Cov(X i , X i ),

iyy = Cov(Yi , Yi )

= Cov(X i , Yi ),

iyx = Cov(Yi , X i )

= ( X¯ i , Y¯i ) i x x xi y = Cov(Z i , Z i ) = . iyx iyy

Z¯ i = i zz

E((X iT , Yi )T )

T

T

(5) (6)

In order to achieve robustness, the constraints in (3) can be replaced with probability constraints in two ways [15]. The first one is called CTM, which requires the prediction error robust to the distribution of Z i Pr {|ei − e¯i | ≥ αi } ≤ β

∀i ∈ {1, . . . , p}

C. Sufficient Condition for CTM i be defined in (1); Theorem 1: Let w ∈ Rn , b ∈ R and zz then a sufficient condition for (7) is 1 i 2 a ≤ αi β (9) zz 1

i 2 is the square root of i and a = (w T , −1)T . where zz zz Proof: Using the definition of ei and e¯i in (5) and (6), and noticing that Xi X¯ i w ¯ Zi = , Zi = ¯ , a= Yi −1 Yi

ei − e¯i = aT (Z i − Z¯ i ).

(10)

Thus the CTM constraints (7) can be written as Pr {|aT (Z i − Z¯ i )| ≥ αi } ≤ β

∀i ∈ {1, . . . , p}.

(11)

Since aT (Z i − Z¯ i ) is a zero-mean random variable with i a, we use Chebyshev’s inequality to bound the variance aT zz left-hand side of (11)

Denote ei as the deviation between the i th observed output and its prediction, and e¯i as the expectation of ei ei = wT X i + b − Yi e¯i = wT X¯ i + b − Y¯i .

(8)

where ξi ≥ 0, and 0 < β ≤ 1. In this probability inequality, ε is a given constant, and ξi is a slack variable. It is clear that small values of ξi and β will yield observed prediction errors to be close to zero. In analogy with the settings for the CTM constraint, we set β as a user-defined parameter, while ξi s are a set of positive decision variables which will be penalized for large values. Since the optimization problems with probability inequality constraints (7) and (8) are difficult to solve, we now derive some more tractable sufficient conditions for these constraints and convert the optimization problem into a solvable SOCP problem.

we obtain

are known. Here E(·) and Cov(·, ·) denote the expectation and the covariance of random variables, respectively. Then, we have that

∀i ∈ {1, . . . , p}

(7)

where αi ≥ 0 is a threshold to control the deviation between e¯i and ei , and 0 < β ≤ 1 is the maximum probability allowing the absolute deviation, i.e., |ei − e¯i |, to be greater than αi . It is obvious that if we require the prediction error to be close to its expected value, then small values of αi and β should be used. In this algorithm, β is a user-defined

Pr {|aT (Z i − Z¯ i )| ≥ αi } ≤

i a aT zz

αi2

∀i ∈ {1, . . . , p}.

(12) i a)/α 2 ≤ β is a sufficient condition Hence, (aT zz i for (7). Taking the square root on both sides of (12) yields (9). D. Sufficient Condition for SR i be defined in (1); Theorem 2: Let w ∈ Rn , b ∈ R and zz then a sufficient condition for (8) is i a + (w T X ¯ i + b − Y¯i )2 ≤ (ξi + ε) β aT zz (13)

where a = (wT , −1)T . Proof: Using Markov’s inequality, we have Pr {|ei | ≥ ε + ξi } ≤

E[ei2 ] . (ε + ξi )2

(14)


1693

By the above inequality, we obtain that

III. G EOMETRICAL I NTERPRETATION

E[ei2 ] ≤β (ε + ξi )2

(15)

is a sufficient condition for (8). In addition, we have i E[ei2 ] = E[ei2 − e¯i2 ] + e¯i2 = aT zz a + (wT X¯ i + b − Y¯i )2 . (16) Combining (15) and (16), we obtain a sufficient condition for (8)

a

T

i zz a

+ (w X¯ i + b − Y¯i ) ≤ (ε + ξi ) β. T

2

2

(17)

Taking its square root yields (13).

In the previous subsection, the formulations were derived without any distribution assumption on Z i . However, if Z i follows a certain distribution, a less conservative sufficient condition can be derived. Specifically, if we assume that Z i follows a normal distribution with mean Z¯ i and covariance i , less conservative bounds for CTM constraints can be zz obtained. It is easy to verify that (ei − e¯i ) is a normally distributed random variable with zero mean and variance σe2i = aT izz a. Based on the symmetry property of a normal distribution function, the CTM constraints (7) are equivalent to β αi 1 − ( ) ≤ σei 2

∀i ∈ {1, . . . , p}

where 1 (t) = √ 2π

t −∞

τ2

e− 2 dτ.

(18)

(19)

Thus, (18) can be expressed as αi . −1 (1 − β/2)

(20)

1

i − 2 (Z − Z¯ ) ≤ β − 12 , the Z¯ i ) ≤ β −1 is equivalent to zz i i left-hand side of (24) can be bounded as below 1

i 2 i |ei − e¯i | = aT zz zz 1 2

αi . −1 (1 − β/2)

(21)

1

i 2 zz a ≤ γ (β)αi

where γ (β) =

1/ −1 (1 − β/2), γ N (β) = √ γG (β) = β,

(22)

normal case general case.

One can verify that γ N (β) ≥ γG (β) for all β ∈ (0, 1). This is due to that we have more information about the distribution information on Z i and a less conservative result can be obtained.

(Z i − Z¯ i )

− 12

(Z i − Z¯ i )

1 2

i ≤ β − 2 zz a. 1

(25)

From the above inequality 1

i 2 β − 2 zz a ≤ αi 1

(26)

is a sufficient condition for (24). This yields our claim. For the SR case, a geometrical formulation can be derived as well. Theorem 4: Assume that Z i is uniformly distributed in the − 12 i ¯ . Then, a sufficient condition for ellipsoid Bi Z i , zz , β the inequity 1 i |ei | ≤ ε + ξi ∀Z i ∈ Bi Z¯ i , zz , β− 2 (27) is that 1

i 2 β − 2 zz a + |wT X¯ i + b − Y¯i | ≤ ε + ξi . (28) Proof: Using the result in the proof of Theorem 3 and the definition (6), we obtain 1

i 2 a+|wT X¯ i +b− Y¯i |. |ei | ≤ |ei − e¯i |+|e¯i | ≤ β − 2 zz 1

Comparing the constraints (9) with (21), the CTM constraints are written in the following unified form:

− 12

i i ≤ zz azz

1

Now we obtain the CTM constraints as i 2 zz a ≤

Theorem 3: Assume that Z i is uniformly distributed in i , β − 12 ); then (9) is a sufficient condition for the Bi ( Z¯ i , zz following inequality: 1 i ∀Z i ∈ Bi Z¯ i , zz , β− 2 . |ei − e¯i | ≤ αi (24) T −1 ¯ Proof: From (10) and noticing that (Z i − Z i ) zz (Z i − 1

E. Normal Distribution Case

σei ≤

In this section, we show that the CTM and SR formulations can also be handled from a geometric viewpoint. Assume that the observation Z i = (X iT , Yi )T takes values from an i , and ellipsoid Bi with center Z¯ i = ( X¯ iT , Y¯i )T , metric zz −1/2 radius β . Mathematically, we have 1 i Z i ∈ Bi Z¯ i , zz , β− 2

i −1 (Z i − Z¯ i ) ≤ β −1 . (23) Z i |(Z i − Z¯ i )T zz

(29)

If the input data are perturbed spherical noise and the output data is precise, then we have I 0 w i a = (wT , −1) n = w (30) aT zz 0 0 −1 where In ∈ Rn×n is an identity matrix. Thus the SR constraints (28) are reduced to β − 2 w + |wT X¯ i + b − Y¯i | ≤ ε + ξi 1

which is the RSVR formulation given in [13] and [14].

(31)

1694


IV. O PTIMIZATION P ROBLEMS FOR RSVR Using the robust constraints derived in the previous sections, we now establish RSVR formulations with uncertain input and output data.

2) If we consider uncertain input data X i and precise output data Yi , it follows that i w x x 0 T i T a zz a = (w , −1) −1 0 0 = wT xi x w.

A. Formulation for CTM Under the CTM constraints (7), the RSVR model can be formulated as an SOCP problem as follows: t +C

min

w,b,α,ξ, ξ ,t

s.t.

In this case, the CTM constraints (9) become 1 √ xi x 2 w ≤ αi β, which are reduced to the constraints given in [15].

p p (ξi + ξi ) + D αi i=1

V. K ERNELIZED F ORMULATIONS A. Kernelized Formulations for CTM and SR

i=1

wT X¯ i + b − Y¯i ≤ ε + ξi ∀i ∈ {1, . . . , p} T ¯ ¯ Yi − w X i − b ≤ ε + ξi ∀i ∈ {1, . . . , p} 1 i 2 zz a ≤ αi β ∀i ∈ {1, . . . , p} w ≤ t ξi ≥ 0, ξi ≥ 0 ∀i = 1, . . . , p (32)

where D is a positive user-defined parameter, a larger D corresponds to assigning a higher penalty to the deviation between observed errors and theirs expected values.

In this section, we propose the kernelized RSVR for nonlinear regression. To this purpose, we first map the input data onto a high-dimensional feature space using a mapping called ϕ. Then linear regression can be performed in the feature space. Following from (4), we give the standard SVR formulation in the feature space as w, b,ξ, ξ ,t

s.t.

min

w,b,ξ,t

s.t.

t +C

p

ξi

i=1

i a + (w T X ¯ i + b − Y¯i ) aT zz ≤ (ξi + ε) β ∀i ∈ {1, . . . , p} 2

ξi ≥ 0

i=1

b − Y¯i ≤ ε + ξi T ϕ( X¯ i ) + w T w ϕ( X¯ i ) − b ≤ ε + ξi Y¯i −

, w b,α,ξ, ξ ,t

∀i ∈ {1, . . . , p}.

(33)

Notice that the above formulation is also an SOCP problem since the constraint i a + (w T X ¯ i + b − Y¯i )2 ≤ (ξi + ε) β aT zz is equivalent to a second-order cone constraint 1 i 2 ≤ (ξi + ε) β. T zz a w X¯ i + b − Y¯i C. Discussion We now compare the above formulations with the traditional ε-SVR and RSVRs which only consider input data uncertainties. It can be shown that the proposed formulations are generalizations of the existing SVR models. i = 0, it 1) Without input and output uncertainties, i.e., zz is obvious that the CTM constraints (9) always hold and the robust constraints can be eliminated from (32). This reduces to the standard ε-SVR formulation (4).

(ξi + ξi ) ∀i ∈ {1, . . . , p} ∀i ∈ {1, . . . , p}

∀i ∈ {1, . . . , p}

(34)

where w and b are the parameters of the regression hyperplane in the feature space. By repeating the derivation of Section II-C in the feature space, it is easy to incorporate the CTM constraints into the above formulation min

w ≤ t

p

w ≤ t ξi ≥ 0, ξi ≥ 0

B. Formulation for SR Under the SR constraints (8), the RSVR model is formulated as follows:

t +C

min

s.t.

t +C

p p (ξi + ξi ) + D αi i=1

i=1

T ϕ( X¯ i ) + w b − Y¯i ≤ ε + ξi ∀i ∈ {1, . . . , p} T ¯ ¯ w ϕ( X i ) − b ≤ ε + ξi ∀i ∈ {1, . . . , p} Yi − i 12 a ≤ αi β ∀i ∈ {1, . . . , p} (zz ) w ≤ t ξi ≥ 0, ξi ≥ 0

∀i ∈ {1, . . . , p}

(35)

i is the covariance in the feature zz where a = ( wT , −1)T , and space. Note that w is in the span of the training data points in the feature space, such that

w=

p

θ j ϕ( X¯ j )

(36)

j =1

where θ = (θ1 , θ2 , . . . , θ p ) is a real vector. Using a kernel function k(x, y) = ϕ(x)T ϕ(y), we have T ϕ( X¯ i ) = w

p

θ j k( X¯ j , X¯ i ) = θ T K i

(37)

θi θ j k( X¯ j , X¯ i ) = θ T K θ

(38)

j =1

w= T w

i, j


1695

where K is the kernel matrix with the (i, j )th element K i, j = k( X¯ i , X¯ j ), and K i = (K 1,i , K 2,i , . . . , K p,i ) is the i th column of K . Let Q = K 1/2 (note that K is real, symmetric, and positive semidefinite); then θ T K θ = (Qθ )T Qθ and w = Qθ . Thus, the optimization problem (35) is rewritten as t +C

min

θ, b,α,ξ, ξ ,t

p

(ξi + ξi ) + D

i=1

p

∀i ∈ {1, . . . , p} ∀i ∈ {1, . . . , p} ∀i ∈ {1, . . . , p}

Similarly, the kernelized version of formulation SR is given by min

s.t.

t +C

ξi

i=1

2 i (θ T K θ + 1) + (θ T K + zz b − Y¯i ) i ≤ (ξi + ε) β ∀i ∈ {1, . . . , p}

Qθ ≤ t ξi ≥ 0 ∀i ∈ {1, . . . , p}.

m=1

where

M 1 j ϕ(X i ) M

b ∗ , x) = h(θ ∗ ,

(44)

θi∗ k( X¯ i , x) + b∗.

and ϕ(X im ) are nonlinear mappings of the input xi x and v be variables X im . Let λ be an eigenvalue of the corresponding eigenvector xi x v = λv.

(40)

(41)

i=1

B. Covariance in the Feature Space In the kernelized formulations (39) and (40), the covariances i ) is required. In the original space, the zz in the feature space ( covariances are accessible. However, in the feature space, the covariances cannot be obtained since the nonlinear mapping ϕ is usually unknown. Notice that the kernelized formulations only involve the spectral norms of the covariance matrices. Thus, it is sufficient to estimate the spectral norms. We assume that the perturbed input and output data are uncorrelated, e.g., xi y = 0. Then, it yields i

x x 0 i i i , . (42) zz = = max xx yy iyy 0 iyy = iyy is a scalar, its norm can be easily obtained. Since xi x . We now provide two methods to estimate 1) Statistical approach: For each input data X i , we randomly generate M samples {X i1 , X i2 , . . . , X iM } according to the distribution of X i with mean X¯ i and

(45)

According to Schölkopf [37], (45) can be transformed into the following eigenvalue problem: K¯ d = Mλd

(46)

where d = [d1 , . . . , d M ]T is a column vector that satisfies M v= dm ϕ(X ¯ im ) (47) m=1

and K¯ = K − 1 M K − K 1 M + 1 M K 1 M . Here, K is a symmetric M × M matrix with elements K i j = ϕ(X im )T ϕ(X im ) = k(X im , X mj ). 1 M is an M × M matrix with elements (1 M )i j = 1/M. Let λ¯ 1 ≤ · · · ≤ λ¯ M be the eigenvalues of K¯ . From (45) and (46), it follows that λ1 = λ¯ 1 /M,

We mention that for an optimal solution θ ∗ and b∗ to (39) or (40), the approximate function is given by p

(43)

j =1

∀i ∈ {1, . . . , p}. (39) 1√ i 2 θ K θ + 1 ≤ α √β zz Notice that the constraint i √ 1 i 2 in (39) is a sufficient condition for (zz ) a ≤ αi β in (35) since √ i 12 i 12 i 12 zz zz zz ( ) a ≤ a = θ K θ + 1.

θ, b,ξ,t

M 1 ϕ(X ¯ im )ϕ(X ¯ im )T M

ϕ(X ¯ im ) = ϕ(X im ) −

αi

Qθ ≤ t ξi ≥ 0, ξi ≥ 0

p

xi x =

i=1

θ T Ki + b − Y¯i ≤ ε + ξi b ≤ ε + ξi Y¯i − θ T K i − 1√ i 2 zz θ K θ + 1 ≤ αi β

s.t.

covariance xi x . Then, the covariance matrix of ϕ(X i ) in the feature space can be estimated by

λ2 = λ¯ 2 /M, . . . ,

λ M = λ¯ M /M

xi x is xi x . Thus the norm of are the eigenvalues of estimated by xi x = max{λ1 , . . . , λ M } = λ M = λ¯ M /M.

(48)

2) Geometrical approach: From a geometric perspective, Trafalis [36] has proved that the perturbations in the feature space are bounded if the perturbations are bounded in the original space. We assume spherical perturbations in the original space and feature space, such that xi x = diag(u), Rn

xi x = diag( u)

(49)

where u ∈ and u ∈ are the perturbation in the original space and the perturbation in the feature space, respectively. Here, diag(u) and diag( u) denote the diagonal matrices whose diagonal elements are the elements from u and u. The bound of the perturbation in the feature space is given by Rp

ui = max ϕ(X i + ui ) − ϕ(X i ) ui

= max(k(X i + ui , X i + ui ) + k(X i , X i ) ui

1

−2k(X i , X i + ui )) 2 . For the Gauss kernel function x − y2 k(x, y) = exp − 2σ 2

(50)

(51)

1696


we have that k(X i + ui , X i + ui ) = k(X i , X i ) = 1 ui 2 nxi x 2 = exp − . k(X i , X i + ui ) = exp − 2σ 2 2σ 2 Thus (50) becomes nxi x 2 ) . ui = 2 1 − exp(− 2σ 2

(52)

Now the norm of the covariance matrix in the feature space is estimated by √ xi x = ui / p 2 nxi x 2 ) . (53) = 1 − exp(− p 2σ 2

e¯i = aT Z¯ i and ei = aT Z i for any Z i that takes the value i , β − 12 ). Thus we define the from an ellipsoid Bi ( Z¯ i , zz RE as 1 1 1 i i i 2 (zz , β − 2 ) = β − 2 zz a. (57) erobust 2) Expected residual (ER): From (16) and (17), it is obvious that the SR constraints are to bound the expectation of the square of the residual. Thus we give the ER measure as i i i a + e¯ 2 . ¯ eexp ( Z i , zz ) = aT zz (58) i 3) Worst case error (WCE): Both the CTM and SR coni a and e¯ 2 by minimizing a straints are to bound aT zz i combination of the two. From (29), the maximum of |ei | 1 i , β − 12 ) is β − 12 i 2 a+|w T X ¯ i +b−Y¯i |. over Bi ( Z¯ i , zz zz Thus we define the worst case residual measure as 1

i i i 2 eworst ( Z¯ i , zz , β − 2 ) = β − 2 zz a + |e¯i |. 1

VI. C OMPUTATIONAL C OMPLEXITY

1

(59)

The SOCP problem is expressed as A. Performances on Synthetic Datasets

min g T x x

s.t. Ai x + bi ≤ ciT x + di

∀i ∈ {1, . . . , N}

(54)

where x ∈ Rl is the optimization variable, and the problem parameters are g ∈ Rl , Ai ∈ R(ni −1)×l , ci ∈ Rl , di ∈ R, and the constraint Ai x + bi ≤ ciT x + di

(55)

is called a second-order cone constraint of dimension n i . Note that when n i = 1, the above constraint reduces to a linear inequality constraint. The SOCPs are usually solved by the primal-dual IPM method, in which the number of iterations required to solve a problem is almost a constant [38]. Thus the computational complexity of optimizing an SOCP is proportional to the complexity in each iteration, which is given by O(l 2 i n i ) [39]. Considering the linear CTM formulation (32), we have

l = 3p + n + 2 n i = 4 p + (n + 1) + (n + 2) p.

(56)

i

Hence, the computational complexity of the linear CTM formulation (32) is O(np3 ) (n p). Similarly, the computational complexity of the linear SR formulation (33), the kernelized formulation (39), and the kernelized formulation (40) are given by O(np3 ), O( p 4 ), and O( p4 ), respectively. VII. E XPERIMENTAL R ESULT In this section, we empirically test the proposed robust formulations on both synthetic and real-world regression problems with input and output data uncertainties. The SOCP problems have been solved by SeDuMi M ATLAB toolbox, version 1.3 [40], [41]. In order to compare our method with the standard ε-SVR and other RSVRs, we follow [15] to introduce three error measures. 1) Robustness error (RE): From a geometric perspective, the CTM aims to bound the difference between

1) Linear Regression: We start by testing the proposed linear RSVR on a synthetic problem. The regression problem for testing is y = wT x + b,

wT = [1, 2, 3, 4, 5],

b = −7.

(60)

We randomly generate 100 samples for x according to a standard normal distribution and compute the corresponding observations y by (60). Then Gaussian distributed noise with zero mean and a random chosen covariance is added to both the input and output data. We randomly choose 90% of the data for training and the rest for testing. In the CTM formulation, there are four hyperparameters, i.e., C, D, β, and ε, to be tuned. In this experiment, ε is empirically selected to be 0.1, and β increases from 0.1 to 1.0 with a step size of 0.1. The hyperparameters C and D are both selected by a grid search method from the exponential sequence [2−5 , 2−4 , . . . , 210 ] using 10-fold cross-validation on the whole dataset. In the SR formulation, there are three hyperparameters, i.e., C, β, and ε, to be tuned. Similarly, ε is set to 0.1, β increases from 0.1 to 1.0 with a step size of 0.1, and C is selected by cross validation as described above. The standard ε-SVR as well as the RSVRs [15] that assume noise only in the input data are used for comparison. Note that the RSVRs proposed in [15] have the same hyperparameters as in our algorithm. The standard ε-SVR is solved by using LIBSVM [42] whose hyperparameters are set as suggested in [43]. For all the algorithms, the testing errors over 100 random realizations are reported. The results of CTM formulations are shown in Fig. 1. In this figure, CTM-prop., ε-SVR, and CTM-Shivas., respectively, denote our proposed robust CTM method, traditional ε-SVR method, and the CTM method introduced by Shivaswamy et al. [15] which assumes noise only in the input data. Fig. 1(a) shows the RE for the three training methods, and Fig. 1(b) shows the WCE. The results demonstrate that at any value of β, the proposed CTM formulation yields a smaller error than that of the standard ε-SVR and the CTM formulation in [15].


1697

6.5

9 CTM−Prop. CTM−Shivas. ε−SVR

8

SR−Shivas. ε−SVR

5.5 Expected Residual

7 Robust Error

SR−Prop.

6

6 5

5 4.5

4

4

3

3.5

2 0.1

0.2

0.3

0.4

0.5

β

0.6

0.7

0.8

3 0.1

0.9

0.2

0.3

0.4

0.5

β

0.6

0.7

0.8

0.9

(a)

(a)

10

20 CTM−Prop. CTM−Shivas. ε−SVR

9

SR−Prop. SR−Shivas. ε−SVR

18 16 Worst Case Error

Worst Case Error

8 7 6 5

14 12 10 8 6

4

4

3 0.1

0.2

0.3

0.4

0.5

β

0.6

0.7

0.8

2 0.1

0.9

0.3

0.4

0.5

β

0.6

0.7

0.8

0.9

(b)

(b) Fig. 1.

0.2

Results of CTM formulation. (a) RE. (b) WCE.

Fig. 2.

Results of SR formulation. (a) ER. (b) WCE.

TABLE II RSVR P ERFORMANCE (WCE) ON A D ATASET W ITH D IFFERENTLY D ISTRIBUTED N OISE Noise distribution Gaussian

CTM Mean Std

Mean

Std

4.8800

4.4815

0.0983

0.2106

SR

Student’s

4.8849

0.2182

4.4862

0.1007

Uniform

4.8632

0.2263

4.4966

0.1012

Exponential

4.8538

0.1991

4.4740

0.0942

Fig. 2 shows the testing error of proposed SR formulation where SR-prop., SR-Shivas., and ε-SVR, respectively, represent our proposed robust SR method, the SR method introduced in [15], and the traditional ε-SVR method. It is clear that our formulation outperforms the other two in terms of the WCE. The above experiments are conducted on an uncertain dataset with Gaussian noise. In practice, the noise may follow

other types of distribution. As we discussed in Section II, the proposed RSVR does not need any probability distribution information on the noise, except that the mean and covariance are known. To test the performance of our method under different situations, we train the proposed RSVR on the above dataset with four types of noise, namely, Gaussian, Student’s, uniform, and exponential distribution. These four types of noise all have zero mean and the same covariance. The hyperparameter β is fixed to 0.5, and all the other parameters are selected as mentioned above. The mean and standard deviation of WCE over 100 realizations are summarized in Table II. From the simulation results, we can conclude that the performance of both CTM and SR methods are resilient to the noise distribution provided that the mean and covariance are known. 2) Nonlinear Regression: To validate our kernelized formulation, we train our RSVR on a toy dataset, i.e., the sinc

1698


TABLE III T ESTING RMSE AND WCE ON THE S INC D ATASET Data sets

Performance measures

Traditional ε-SVR

CTM-Shivas.

CTM-Proposed

SR-Shivas.

SR-Proposed

r = 0.01

RMSE

0.1459 (0.007)

0.1211 (0.006)

0.1205 (0.009)

0.1202 (0.013)

0.1196 (0.008)

WCE

0.5776 (0.037)

0.3346 (0.041)

0.2711 (0.048)

0.2651 (0.021)

0.2540 (0.015)

r = 0.05

RMSE

0.2502 (0.059)

0.2018 (0.030)

0.1648 (0.021)

0.1934 (0.014)

0.1909 (0.018)

WCE

0.9500 (0.042)

0.4473 (0.045)

0.4389 (0.020)

0.4215 (0.028)

0.4020 (0.021)

r = 0.1 r = 0.2

RMSE

0.3955 (0.093)

0.3007 (0.034)

0.2830 (0.027)

0.2766 (0.020)

0.2601 (0.025)

WCE

1.3243 (0.103)

0.7547 (0.069)

0.7036 (0.043)

0.7138 (0.031)

0.6773 (0.048)

RMSE

0.5957 (0.039)

0.4907 (0.021)

0.4830 (0.028)

0.4929 (0.014)

0.4821 (0.026)

WCE

2.5243 (0.198)

2.0985 (0.145)

0.9036 (0.099)

2.1186 (0.104)

0.9364 (0.097)

TABLE IV

function problem f (x) =

sin(x) . x

In this experiment, 50 observations were generated from the interval [−3, 3]. Note that the SeDuMi M ATLAB toolbox used for solving (39) and (40) needs to store a matrix whose size is proportional to p2 × p, and owing to the memory limitation of the M ATLAB platform, only small datasets (less than 200 patterns) are used in our simulation. Then normally distributed noise with zero mean and covariance 1 0.5 zz = r ∗ 0.5 1 are added to the observations. The parameter r is used to scale the level of uncertainties of the dataset. In our experiment, 80% of the observations are used for training and the other 20% points are used for testing. The Gaussian kernel (K (x, y) = exp(−γ x − y2 )) is adopted and the hyperparameter γ is selected from the exponential sequence [2−5 , 2−4 , . . . , 210 ] using 10-fold cross validation on the whole dataset. The hyperparameter β is selected from the sequence [0.1, 0.2, . . . , 1.0] using 10-fold cross validation as well. The other hyperparameters, i.e., C, D, and ε, are determined by using the same method as in the linear regression case. The norms of the covariance matrices in (39) and (40) are estimated by the statistical approach presented in Section V. The number of samples in this approach is set as M = 50. The testing root mean squared error (RMSE) and testing WCE averaged over 20 independent realizations are summarized in Table III. One can observe that the proposed CTM and SR formulations give a significantly smaller WCE than that of the traditional ε-SVR and the RSVR which only consider input data uncertainties. The RMSE given by our RSVR is comparable to that given by the other methods. As the uncertainty-level parameter r increases, the advantage of our method becomes more evident. It is clear that the main advantage of our proposed method is that it usually leads to much smaller prediction errors under the worst case situation. We also conduct an experiment to verify the proposed statistical approach for estimating the norms of the covariance matrices in the feature space. Various numbers of samples are generated in the experiment; the mean and standard deviation of WCE over 100 realizations are used to evaluate the performance. Additionally, the running time spent on estimating

P ERFORMANCE W ITH D IFFERENT N UMBERS OF S AMPLES U SED FOR E STIMATING THE C OVARIANCE N ORMS IN THE F EATURE S PACE Samples

WCE mean

WCE std.

Time

M = 10

0.8186

0.0555

0.0046

M = 20

0.8171

0.0366

0.0121

M = 50

0.8102

0.0249

0.0747

M = 100

0.8114

0.0193

0.3262

M = 200

0.8122

0.0223

1.5981

TABLE V S UMMARY OF THE UCI D ATASETS U SED Datasets

No. of training data

No. of testing data

mpg

353

39

No. of features 8

housing

455

51

14

mg

1246

139

6

abalone

3759

418

8

the norms is also given in the table. As is shown in Table IV, the mean of the WCE does not vary much by the number of samples, while the standard deviation decreases as M increases. However, when M is larger than 100, the standard deviation stops decreasing. From Table IV, the running time is approximately proportional to M 2 . This is due to the fact that most running time is spent on computing the kernel matrix K¯ in (46) whose computational complexity is proportional to M 2 . Thus, if we take the standard deviation and the running time into consideration, M = 50 could be a good choice since it gives a relatively steady result within an acceptable time. B. Performances on Real Datasets We further validate our method on four real datasets from the UCI database [44] and StatLib repository [45]. We randomly pick 90% of the patterns for training, and the rest for testing. Table V lists the characterization of these datasets. Similar to the preceding experiment, the input and output data are perturbed by Gaussian noise with zero mean and randomly chosen covariance matrix. For all the datasets, the linear formulations are adopted and the hyperparameters are selected by using the same method as described in Section VII-A1. The average testing error and corresponding standard deviation over 20 random partitions of the training set and testing set


1699

TABLE VI T ESTING RMSE AND WCE ON THE UCI D ATASETS Data sets

Performance measures RE ER WCE RE ER WCE RE ER WCE RE ER WCE

mpg

housing

mg

abalone

Traditional ε-SVR 1.884 (0.307) 4.153 (0.302) 4.631 (0.398) 2.852 (0.263) 5.842 (0.795) 7.771 (0.608) 1.068 (0.002) 1.034 (0.021) 1.711 (0.032) 1.714 (0.311) 2.816 (0.034) 4.073 (0.032)

CTM-Shivas. 1.502 (0.375) 4.426 (0.342) 2.096 (0.496) 6.844 (0.645) 1.066 (0.002) 1.702 (0.034) 1.353 (0.023) 3.834 (0.041)

TABLE VII RUNNING T IME ( IN S ECONDS ) Data sets mpg housing mg abalone

Traditional ε-SVR 0.09 0.18 1.15 5.85

CTMShivas. 5.09 10.2 50.9 175.2

CTMProposed 5.10 10.2 52.2 178.1

SRShivas. 1.12 3.20 11.8 36.7

SRProposed 1.19 3.61 10.2 34.2

are given in Table VI. It is obvious the proposed CTM and SR approach outperform traditional ε-SVR and the RSVR proposed in [15] on all the five datasets. The running time spent on training is shown in Table VII. We can observe that, when both input and output uncertainties are taken into consideration, the running time is significantly longer than the traditional ε-SVR. This is due to the fact that the RSVR formulation has converted the linear constraints in the traditional ε-SVR into SOCP constraints, which are more difficult to solve. However, the running time of our algorithm is comparable with the RSVR proposed in [15], while our algorithm yielded lower testing errors. VIII. C ONCLUSION For both linear and nonlinear regression problems with uncertain input and output data, we obtained new robust support vector formulations to deal with stochastic uncertainty provided that the mean and covariance of the noise are known. In particular, for normally distributed perturbations we developed a robust formulation which gives a less conservative result. Under the case of ellipsoidal uncertainty, similar robust formulations were obtained from a geometrical perspective. Both the linear and nonlinear RSVR were formulated as SOCP which can be solved using off-the-shelf solvers. In the simulation, we compared our RSVR with the standard ε-SVR and existing RSVRs which only consider input data uncertainties. The simulation results showed that, in the presence of input and output data uncertainty, our method leads to a better performance. Though the SOCP can be solved efficiently by existing tools, it is computationally more expensive than solving the QP in a traditional ε-SVR. The major limitation with the proposed

CTM-Proposed 1.323 (0.290) 4.293 (0.392) 1.174 (0.622) 5.589 (0.787) 1.059 (0.002) 1.670 (0.034) 1.266 (0.072) 3.732 (0.043)

SR-Shivas. 4.018 (0.345) 4.291 (0.298) 5.592 (0.795) 5.821 (0.643) 1.042 (0.019) 1.738 (0.033) 2.735 (0.028) 3.890 (0.033)

SR-Proposed 3.980 (0.340) 4.021 (0.277) 5.442 (0.795) 5.305 (0.629) 1.029 (0.021) 1.672 (0.034) 2.713 (0.027) 3.754 (0.029)

algorithm is its relatively high memory demand. The proposed linear and kernelized formulations have a space complexity of O(np 2 ) and O( p 3 ), respectively. This makes it impossible to solve many real-world problems. A problem for future research is the development of sparse matrix techniques to make the proposed RSVR feasible for large datasets. R EFERENCES [1] V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. [2] C. Cortes and V. Vapnik, “Support vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995. [3] V. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [4] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, “OP-ELM: Optimally pruned extreme learning machine,” IEEE Trans. Neural Netw., vol. 21, no. 1, pp. 158–162, Jan. 2010. [5] E. J. Bayro-Corrochano and N. Arana-Daniel, “Clifford support vector machines for classification, regression, and recurrence,” IEEE Trans. Neural Netw., vol. 21, no. 11, pp. 1731–1746, Nov. 2010. [6] L. Duan, D. Xu, and I. W. H. Tsang, “Domain adaptation from multiple sources: A domain-dependent regularization approach,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp. 504–518, Mar. 2012. [7] J. B. Yang and C. J. Ong, “Feature selection using probabilistic prediction of support vector regression,” IEEE Trans. Neural Netw., vol. 22, no. 6, pp. 954–962, Jun. 2011. [8] J. Lopez and J. R. Dorronsoro, “Simple proof of convergence of the SMO algorithm for different SVM variants,” IEEE Trans. Neural Netw., vol. 23, no. 7, pp. 1142–1147, Jul. 2012. [9] V. Vapnik, S. E. Golowich, and A. Smola, “Support vector method for function approximation, regression estimation, and signal processing,” in Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press, 1996, pp. 281–287. [10] H. Drucker, C. J. C. Burges, L. Kaufman, A. J. Smola, and V. Vapnik, “Support vector regression machines,” in Advances in Neural Information Processing Systems, vol. 9. Cambridge, MA: MIT Press, 1997, pp. 155–161. [11] A. J. Smola and B. Scholkopf, “A tutorial on support vector regression,” Stat. Comput., vol. 14, no. 3, pp. 199–222, 2004. [12] D. H. Hong and C. Hwang, “Support vector fuzzy regression machines,” Fuzzy Sets Syst., vol. 138, no. 2, pp. 271–281, 2003. [13] T. B. Trafalis and R. C. Gilbert, “Robust classification and regression using support vector machines,” Eur. J. Oper. Res., vol. 173, no. 3, pp. 893–909, Jul. 2006. [14] T. B. Trafalis and S. A. Alwazzi, “Support vector regression with noisy data: A second order cone programming approach,” Int. J. General Syst., vol. 36, no. 2, pp. 237–250, Apr. 2007. [15] P. K. Shivaswamy, C. Bhattacharyya, and A. J. Smola, “Second order cone programming approaches for handling missing and uncertain data,” J. Mach. Learn. Res., vol. 7, pp. 1283–1314, Dec. 2006. [16] E. Carrizosa, J. E. Gordillo, and F. Plastria, “Support vector regression for imprecise data,” Dept. MOSI, Vrije Univ. Brussel, Belgium, Tech. Rep., 2007.

1700


[17] C. Chuang, S. Su, J. Jeng, and C. Hsiao, “Robust support vector regression networks for function approximation with outliers,” IEEE Trans. Neural Netw., vol. 13, no. 6, pp. 1322–1330, Nov. 2002. [18] G. Camps-Valls, L. Bruzzone, J. Rojo-Álvarez, and F. Melgani, “Robust support vector regression for biophysical variable estimation from remotely sensed images,” IEEE, Geosci. Remote Sensing Lett., vol. 3, no. 3, pp. 339–343, Jul. 2006. [19] C. Bhattacharyya, K. Pannagadatta, and A. Smola, “A second order cone programming formulation for classifying missing data,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2004. [20] E. Carrizosa, J. E. Gordillo, and F. Plastria, “Kernel support vector regression with imprecise output,” Dept. MOSI, Vrije Univ. Brussel, Brussel, Belgium, Tech. Rep., 2008. [21] J. Jeng, C. Chuang, and S. Su, “Support vector interval regression networks for interval regression analysis,” Fuzzy Sets Syst., vol. 138, no. 2, pp. 283–300, 2003. [22] C. Hwang, D. Hong, and K. H. Seok, “Support vector interval regression machine for crisp input and output data,” Fuzzy Sets Syst., vol. 157, no. 8, pp. 1114–1125, 2006. [23] D. H. Hong and C. Hwang, “Extended fuzzy regression models using regularization method,” Inf. Sci., vol. 164, nos. 1–4, pp. 31–36, Aug. 2004. [24] C.-F. Lin and S.-D. Wang, “Training algorithms for fuzzy support vector machines with noisy data,” Pattern Recognit. Lett., vol. 25, no. 14, pp. 1647–1656, Oct. 2004. [25] A. Ben-Tal, S. Bhadra, C. Bhattacharyya, and J. S. Nath, “Chance constrained uncertain classification via robust optimization,” Math. Program., vol. 127, no. 1, pp. 145–173, Mar. 2011. [26] G. Golub, “Some modified matrix eigenvalue problems,” SIAM Rev., vol. 15, no. 2, pp. 318–344, 1973. [27] G. Golub and C. V. Loan, “An analysis of the total least squares problem,” SIAM J. Numer. Anal., vol. 17, pp. 883–893, Dec. 1973. [28] J. Connor, R. Martin, and L. Atlas, “Recurrent neural networks and robust time series prediction,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 240–254, Mar. 1994. [29] D. S. Chen and R. C. Jain, “A robust backpropagation learning algorithm for function approximation,” IEEE Trans. Neural Netw., vol. 5, no. 3, pp. 467–479, May 1994. [30] E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970. [31] H. Tanaka, S. Uejima, and K. Asai, “Linear regression analysis with fuzzy model,” IEEE Trans. Syst. Man Cybern., vol. 12, no. 6, pp. 903– 907, Nov. 1982. [32] P. Diamond, “Fuzzy least squares,” Inf. Sci., vol. 46, pp. 141–157, Dec. 1988. [33] A. W. Marshall and I. Olkin, “Multivariate Chebyshev inequality,” Ann. Math. Stat., vol. 31, no. 4, pp. 1001–1014, Dec. 1960. [34] M. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret, “Applications of second-order cone programming,” Linear Algebra Its Appl., vol. 284, pp. 193–228, Nov. 1998. [35] S. Mehrotra, “On the implementation of a primal-dual interior point method,” SIAM J. Opt., vol. 2, no. 4, pp. 575–601, 1992. [36] T. B. Trafalis and R. C. Gilbert, “Robust support vector machines for classification and computational issues,” Optim. Methods Softw., vol. 22, no. 1, pp. 187–198, Feb. 2007. [37] B. Schölkopf, A. Smola, and K. R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319, 1998. [38] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret, “Applications of second-order cone programming,” Linear Algebra Its Appl., vol. 284, nos. 1–3, pp. 193–228, 1998. [39] Y. Nesterov and A. Nemirovskii, Interior-Point Polynomial Algorithms in Convex Programming, vol. 13. Philadelphia, PA: SIAM, 1987. [40] I. Polik. (2010). SeDuMi [Online]. Available: http://sedumi.ie.lehigh.edu [41] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optim. Methods Softw., vol. 11, nos. 1–4, pp. 625–653, 1999. [42] C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, 2011. [43] C. Hsu, C. Chang, and C. Lin. (2003). A Practical Guide to Support Vector Classification [Online]. Available: http://www.csie.ntu.edu.tw/ ∼cjlin/papers/guide/guide.pdf

[44] A. Frank and A. Asuncion. (2011). UCI Machine Learning Repository [Online]. Available: http://archive.ics.uci.edu/ml [45] D. Michie, D. J. Spiegelhalter, and C. Taylor, Machine Learning, Neural and Statistical Classification. New York: Ellis Horwood, 1994.

Gao Huang (S’12) was born in 1988. He received the B.S. degree from the School of Automation Science and Electrical Engineering, Beihang University, Beijing, China, in 2009. He is currently pursuing the Ph.D. degree with the Department of Automation, Tsinghua University, Beijing. His current research interests include machine learning and system identification.

Shiji Song was born in 1965. He received the Ph.D. degree from the Department of Mathematics, Harbin Institute of Technology, Harbin, China, in 1996. He is currently a Professor with the Department of Automation, Tsinghua University, Beijing, China. His current research interests include optimization, system identification, machine learning, and supply chains.

Cheng Wu was born in 1940. He received the B.S. and M.S. degrees in electrical engineering from Tsinghua University, Beijing, China. He has been with Tsinghua University since 1967, where he is currently a Professor with the Department of Automation. His current research interests include complex system modeling and optimization, and modeling and scheduling in supply chains. Mr. Wu is a member of the Chinese Academy of Engineering.

Keyou You was born in Jiangxi Province, China, in 1985. He received the B.S. degree in statistical science from Sun Yat-sen University, Guangzhou, China, and the Ph.D. degree in electrical and electronic engineering from Nanyang Technological University, Singapore, in 2007 and 2012, respectively. He was with the ARC Center for Complex Dynamic Systems and Control, University of Newcastle, Newcastle, Australia, as a Visiting Scholar in 2010, and was with the Sensor Network Laboratory, Nanyang Technological University, as a Research Fellow from 2011 to 2012. Since 2012, he has been with the Department of Automation, Tsinghua University, Beijing, China, as an Assistant Professor. His current research interests include control and estimation of networked systems, distributed control and estimation over complex networks, and sensor networks. Dr. You was a recipient of the Guan Zhaozhi Best Paper Award from the 29th Chinese Control Conference, Beijing, in 2010.

valence space emotion assessment.

Incremental learning for ν-Support Vector Regression.

Hierarchical approach for multiscale support vector regression.

Robust GRAPPA reconstruction using sparse multi-kernel learning with least squares support vector regression.

Fast metabolite identification with Input Output Kernel Regression.

Complex support vector machines for regression and quaternary classification.

Deriving statistical significance maps for support vector regression using medical imaging data.

Surgical planning for horizontal strabismus using Support Vector Regression.

A Longitudinal Support Vector Regression for Prediction of ALS Score.

Support vector regression for improved real-time, simultaneous myoelectric control.

Robust regression on noisy data for fusion scaling laws.

Extended robust support vector machine based on financial risk minimization.

World Input-Output Network.

Multivariate lesion-symptom mapping using support vector regression.

Lagrangian support vector regression via unconstrained convex minimization.

Electricity load forecasting using support vector regression with memetic algorithms.

Quantum support vector machine for big data classification.

Meta-analytic support vector machine for integrating multiple omics data.

Robust model reference adaptive output feedback tracking for uncertain linear systems with actuator fault based on reinforced dead-zone modification.

Robust estimation of marginal regression parameters in clustered data.

Investigation of Multi-Input Multi-Output Robust Control Methods to Handle Parametric Uncertainties in Autopilot Design.

Use of multivariate linear regression and support vector regression to predict functional outcome after surgery for cervical spondylotic myelopathy.

A hybrid approach of stepwise regression, logistic regression, support vector machine, and decision tree for forecasting fraudulent financial statements.

Applying different independent component analysis algorithms and support vector regression for IT chain store sales forecasting.