1148

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

RBF Networks Under the Concurrent Fault Situation Chi-Sing Leung, Member, IEEE, and John Pui-Fai Sum, Senior Member, IEEE

Abstract— Fault tolerance is an interesting topic in neural networks. However, many existing results on this topic focus only on the situation of a single fault source. In fact, a trained network may be affected by multiple fault sources. This brief studies the performance of faulty radial basis function (RBF) networks that suffer from multiplicative weight noise and open weight fault concurrently. We derive a mean prediction error (MPE) formula to estimate the generalization ability of faulty networks. The MPE formula provides us a way to understand the generalization ability of faulty networks without using a test set or generating a number of potential faulty networks. Based on the MPE result, we propose methods to optimize the regularization parameter, as well as the RBF width. Index Terms— Fault tolerance, prediction error, RBF, weight decay.

I. I NTRODUCTION In the implementation of neural networks, network faults take place unavoidably [1]–[8]. For example, the finite precision of the trained weights in the implementation introduces multiplicative weight noise [3]–[5]. Besides, physical faults in the implementation introduce open weight fault [2], [6]–[8]. One of classical fault-tolerant approaches is to generate a number of faulty networks during training. Injecting a random weight fault [9] during training is a typical example. In this approach, if the number of training epochs is not large enough, the training algorithm cannot capture the statistical behavior of weight faults. Simon and Sherief [10] and Zhou et al. [2] formulated the learning problem as an unconstrained optimization problem in which the objective function consists of two terms. The first term is the mean square error (MSE) of the faultless network, and the second term is the sum of MSEs of some potential faulty networks. This approach is computationally complicated when a multi-weight fault situation is considered. In [11] and [12], the effect of input noise on the output of a trained radial basis function (RBF) network was studied. Furthermore, Bernier et al. showed that injecting input noise during training could improve the generalization error of a faultless network. However, the selection guideline on the regularization parameter was not addressed. Apart from the computational issue, most existing results on fault tolerance focus on the training error of faulty networks. Besides, many training methods focus on one kind of weight faults. For example, in [7] and [9], the algorithms were used Manuscript received February 10, 2012; accepted April 5, 2012. Date of publication May 21, 2012; date of current version June 8, 2012. This work was supported by the City University of Hong Kong under Research Grant 7002760. C.-S. Leung is with the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong (e-mail: [email protected]). J. P.-F. Sum is with the Institute of Technology Management, National Chung Hsing University, Taichung 402, Taiwan (e-mail: pfsum@ dragon.nchu.edu.tw). Digital Object Identifier 10.1109/TNNLS.2012.2196054

to handle the open weight fault only. In [3], the algorithm was used to handle the multiplicative weight noise. Clearly, in real situations, different kinds of weight faults could coexist unavoidably. There are some results related to the generalization ability of faulty networks. In [1], two individual mean prediction error (MPE) formulae are derived to handle two fault models, namely, multiplicative weight noise and open weight fault. However, the two formulae are individually designed for the two fault models. In a real situation, it is unreasonable to assume that trained networks have either multiplicative weight noise or open weight fault. Hence, it is interesting to develop a single MPE formula that can concurrently handle the two fault models. This brief first introduces a concurrent fault model that combines the effects of multiplicative weight noise and open weight fault. Afterwards, we derive an MPE formula to describe the generalization error of a faulty RBF network that is affected by multiplicative weight noise and open weight fault concurrently. With this formula, we are able to understand the generalization ability of faulty RBF networks without a test set or generating a large number of faulty networks. Lastly, we present some applications of the formula. We discuss the way to use the formula to optimize the weight decay parameter and the RBF width. Section II presents the data model and the learning model. In Section III, the concurrent fault model is presented. Afterwards, the training error of faulty networks with concurrent fault is investigated. The MPE formula, which estimates the generalization error of faulty networks, is then derived in Section IV. Section V discusses the ways to optimize the regularization parameter and RBF width. Section VI presents our simulation results. Section VII concludes this brief. II. BACKGROUND To learn a function using the RBF approach, we have a training dataset Dt = {(x i , yi ) : x i ∈  K , yi ∈ , i = 1, . . . , N}, where x i and yi are the input and output of the i th sample, respectively, and K is the input dimension. The output is generated by an unknown stochastic system, given by yi = f (x i ) + i

(1)

where f (·) is a nonlinear function, and i s are the independent zero-mean Gaussian random variables with variance σ2 . The unknown system f (·) is approximated by a number of RBFs, given by M  f (x) ≈ fˆ(x, w) = w j φ j (x) (2) j =1

where w = [w1 , . . . , w M ]T is the weight vector, and φ j (x) = exp (−x − c j 2 /s) is the j th basis function. Vectors c j s are the RBF centers. Parameter s controls the width of the basis functions. In the vector-matrix notation, (2) can be rewritten as

2162–237X/$31.00 © 2012 IEEE

fˆ(x, w) = φ T (x)w

(3)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

where φ(x) = [φ1 (x), . . . , φ M (x)]T . Given a weight vector, the training set error E(Dt ) is given by E(Dt ) =

N 1  (yi − φ T (x i )w)2 . N

where w ˜ b = [w˜ 1,b , . . . , w˜ M,b ]T , b = [b1 , . . . , b M ]T , β = [β1 , . . . , β M ]T , and ⊗ is the element-wise multiplication operator. Hence the output of a faulty network is given by fˆ(x, w ˜ b,β ) = φ T (x)w ˜ b,β .

(4)

i=1

Since minimizing the training set error cannot attain a network with good generalization ability, we usually use regularization techniques to improve the generalization ability. The objective function of a regularized network is given by J =

N 1  (yi − φ T (x i )w)2 + λw T w N

(5)

i=1

where λ is the regularization parameter. Given a fixed λ, the weight vector w can thus be obtained by the following equation: N 1  w = (H + λI)−1 φ(x i )yi (6) N i=1 N where I is the identity matrix, and H = (1/N) i=1 φ(x i ) φ T (x i ). Remark 1: This brief assumes that the standard weight decay is used. Following our approach in Sections III-V, we can extend our results to some other regularization algorithms.

N

2 1  ˜ b,β yi − φ T (x i )w N i=1 ⎡ N M   1 ⎣ yi2 − 2yi = β j w j φ j (x i ) N

E(Dt )b,β =

j =1

i=1

+

M  M 

β j β j w j w j φ j (x i )φ j (x i )

j =1 j =1

+

M M  

b j b j β j β j w j w j φ j (x i )φ j (x i )

j =1 j =1

+

M M   j =1

j =1 M 

(b j + b j )β j β j w j w j φ j (x i )φ j (x i ) ⎤

b j β j w j φ j (x i )⎦.

(11)

j =1

Among different forms of network faults, weight noise and weight fault are the most common fault models [2], [3], [6]–[8], [13]. The multiplicative weight noise [3], [14] results from the finite precision representation in the implementation of trained weights. For the open weight fault [2], [7], some RBF nodes are disconnected from the output layer. When we implement neural networks in VLSI, open weight fault may appear as a result of defects in the silicon, open circuits in metal, and holes in oxides used in transistors. In a real situation, network weights may suffer from multiplicative weight noise and open weight fault concurrently. In this situation, an implemented weight is given by ∀ j = 1, . . . , M

(7)

where the multiplicative fault factors b j s are used to model the behavior of the multiplicative weight noise. They are identical independent zero-mean random variables with variance σb2 . The open fault factors β j s are used to model the behavior of open weight fault. They are identical independent binary random variables. The j th weight is out of work when β j = 0, otherwise, it operates properly. Let Prob(β j = 0) = p and Prob(β j = 1) = 1 − p, then we have β j  = β 2j  = 1 − p and β j β j  = (1 − p)2

(10)

From (7)–(10), the training error of a faulty network is given by

− 2yi

III. C ONCURRENT FAULT M ODE

w˜ j,b,β = (w j + b j w j )β j

1149

∀ j = j . (8)

Taking the expectation over bs and βs, we have the training error of faulty networks: namely, the mean training error (MTE) given by N  ¯ t )b,β = (1 − p)E(Dt ) + p yi2 E(D N i=1

+( p2 − p)w T (H − G) w + (1 − p)σb2 w T Gw (12) N where H = (1/N) i=1 φ(x i )φ T (x i ) and G = diag(H). N In (12), E(Dt ) = (1/N) i=1 (yi − φ T (x i )w)2 is the training error of a faultless network [see (4)]. IV. G ENERALIZATION A BILITY OF FAULTY N ETWORKS Equation (12) tells us the performance of faulty networks on the training set. In the neural network community, we are more interested in how well the network performs on unseen samples. This section will derive a formula to estimate the generalization ability of faulty networks under the concurrent fault model. The estimation is called the MPE. ¯ t )b,β of faulty networks can also From (11), the MTE E(D be expressed as ¯ t )b,β = y 2 D − 2(1 − p)yφ T (x)wD E(D t t +(1 − p)2 w T Hw + (1 − p)( p + σb2 )w T Gw

(13)

In this model, a weight is first affected by multiplicative weight noise. However, when the implemented weights are affected by an open fault, the faulty weights are clamped at zero no matter what the multiplicative noise is. In vector notation, the faulty weight vector is given by

N where · is the expectation operator. Let D f = {(x i , yi )i=1 be the testing dataset. The MPE of faulty networks is expressed as follows:

w ˜ b,β = (w + b ⊗ w) ⊗ β = w ⊗ β + b ⊗ w ⊗ β

+(1 − p)2 w T H w + (1 − p)( p + σb2 )w T G w

(9)



¯ f )b,β = y 2 D − 2(1 − p)y φ T (x )wD E(D f f (14)

1150

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

N where H = (1/N ) i=1 φ(x i )φ T (x i ) and G = diag(H ). Denote the true weight vector as wo . Hence yi = φ T (x i )wo + i

and

yi = φ T (x i )wo + i

(15)

where i s and i s are independent zero-mean Gaussian random variables with variance σ2 . Since w is obtained entirely from Dt , the second term in (14) can be expressed as ⎛ ⎞ N !  1 −2(1− p) y φ T (x )w = −2(1− p) ⎝ yi φ T (x i )⎠w. Df N i=1

(16) From (6) and (15), (16) is then given by ! −2 (1− p) y φ T x w Df ⎛ ⎞ N N 

1  1 = −2 (1− p) ⎝ yi φ T x i ⎠ (H +λI)−1 yi φ (x i ). N N i=1

i=1

(17) N

As i s and i s are independent, from (15), “(1/N ) i=1 N yi φ T (x i )” and “(1/N) i=1 yi φ(x i )” in (17) can be simplified to

N 1  T yi φ (x i ) = woT H , N i=1

N 1  yi φ(x i ) = Hwo . (18) N i=1

From (17)–(18), the second term in (14) becomes −2(1 − p)woT H (H + λI)−1 Hwo .

So far, we have an MPE formula to estimate the generalization ability of a faulty network. The formula helps us to understand the performance of RBF networks under the concurrent fault situation. Another application of the MPE formula is to select an appropriate setting for RBF networks. Here we demonstrate the way to use the MPE formula to select a suitable value of λ such that the generalization error is minimized. To minimize the MPE value, we need to estimate the gradient of the MPE with respect to λ. A. Gradient of MPE With Respect to the Regularization Parameter λ In (22), the weight vector w is obtained by a regularization method [see (6)]. Thus, it is not feasible to directly get ¯ f )b,β /∂λ) based on (22). Recall from (∂MPE/∂λ) = (∂ E(D (14) that the MPE is expressed as +(1 − p)2 w T H w + (1 − p)( p + σb2 )w T G w. (24)

(19)

where Tr{·} denotes the trace operation. Following the common practice, for large N and N , we can assume that H ≈ H, G ≈ G, and y 2 D f ≈ y 2 Dt . The difference between MTE and MPE is given by  Tr (H +λI)−1H . (21)

Define J2 = −2(1 − p)y φ T (x )wD f J1 = y 2 D f 2 T J3 = (1 − p) w H w J4 = (1 − p)( p + σb2 )w T G w. (25) Now, (24) is given by ¯ f )b,β = J1 + J2 + J3 + J4 . E(D

N

∂J1 = 0. ∂λ

(27)

J2 = −2(1 − p)woT H (H + λI)−1 Hwo .

(28)

From (19)

N p  2 = (1 − p)E(Dt ) + yi + ( p 2 − p)w T Hw N i=1 T

+ (1 − p)( p + σb2 )w Gw σ2  + 2(1 − p)  Tr (H + λI)−1 H . N

(26)

Since J1 is not a function of λ, we have

σ2

From (12) and (21), the MPE is given by ¯ f )b,β E(D

V. A PPLICATIONS OF THE MPE F ORMULA

¯ f )b,β = y 2 D − 2(1 − p)y φ T (x )wD E(D f f

Using a similar method, the second term in (13) becomes & ' σ2  T −1 −1 −2(1− p) wo H(H + λI) Hwo + Tr (H + λI) H N (20)

¯ f )b,β − E(D ¯ t )b,β = 2(1− p) E(D

method [15], [16], given by  2 N N   1 2 T −1 1 yi − φ (x i )H φ(x i )yi . σ ≈ N−M N i=1 i =1 (23) With (22), we can estimate the test error of faulty networks based on the training error of a faultless network, the trained weights, and the training set.

(22)

In (22), the term E(Dt ) is the training error of the trained faultless network, which can be obtained after training. The matrices H and G can be obtained from the training set. The fault statistics parameters p and σb2 are assumed to be known. The only unknown in (22) is the variance σ2 of the measurement noise. The variance can be estimated by the Fedorov’s

Similar to Section IV, we assume that H ≈ H and G ≈ G. Hence, we have J2 = −2(1 − p)woT H(H + λI)−1 Hwo .

(29)

As H is positive or semipositive definite, it can be diagonalized as H = BDB T (30) where BB T = I and D is a diagonal matrix whose elements are denoted as {d1 , . . . , d M }. Define w ˇ o = BT wo .

(31)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

With the diagonalization, we have M 2 d2  wˇ oj ∂J2 j = 2(1 − p) . ∂λ (d j + λ)2

(32)

j =1

For J3 , we have σ2

M w 2 d3 +  d2  ˇ oj ∂J3 j N j 2 = −2(1 − p) . ∂λ (d j + λ)3

(33)

j =1

For J4 , from (6) and (15), we have

& J4 = −(1 − p) p + σb2 w oT H (H + λI)−1 G (H + λI)−1 Hw o

' σ2 +  Tr H (H + λI)−1 G (H + λI)−1 . (34) N

In the above, when we diagonalize H, we will introduce a nondiagonal matrix B T GB. Hence, we need an approximation. If the RBF centers are distributed according to the input patterns, ¯ where  we can approximate G by G ≈ gI, g¯ = (1/M) M j =1 g j j . With this approximation σ2

 wˇ d g¯ + d j g¯ ∂J4 oj j N . = −2(1 − p) p + σb2 ∂λ (d j + λ)3 M

2

2

(35)

j =1

To sum up, the gradient of MPE is given by ¯ f )b,β ∂ E(D ∂λ = 2(1 − p)

M  j =1



− p

⎛ ⎝

2 d2 wˇ oj j

(d j + λ)2

− (1 − p) ⎞

σ2 2 N dj + λ)3

2 d3 + wˇ oj j

(d j

wˇ 2 d 2 g¯ + σ d j g¯ oj j N ⎠.

3 dj + λ

(36)

B. Searching the Regularization Parameter and RBF Width Based on (36), we can develop a search method for λ based on the bold-driver technique [17]. The updating of λ is given by ¯ f )b,β ∂ E(D (37) λnew = λ − η ∂λ where η is an adjustable learning rate. Since we do not know ˇ o in advance, during searching we set the wo as well as w current estimated weight vector w as wo . If an update on λ, based on (37), produces a network with MPE based on (22) greater than the previous MPE, the update on λ is rejected and η is multiplied by a factor ε less than 1. If an update on the λ decreases the MPE value, the update on λ is accepted and η is multiplied with a factor γ greater than 1. In our experiments, the initial value of λ is 0.1, the initial value of η is 0.01, γ is 1.04, and ε is 0.7. We can also optimize the RBF width s. In this situation, the updating rules for the regularization parameter λ and the RBF width s are given by λnew = λold − η

If an update on λ and s, based on (38), produces a new MPE value greater than the previous MPE value, the changes on λ and s are rejected and η is multiplied by ε. If an update on λ and s decreases the MPE value, the changes are accepted and η is multiplied by γ . In (38), there is no simple analytical way to estimate the gradient (∂MPE/∂s). So, we use the following approximation, given by MPE(s + s) − MPE(s) ∂MPE ≈ . (39) ∂s s In the above approximation, given the current RBF width s, we use the MPE formula to calculate the current MPE value MPE(s). Afterwards, we perturb the RBF width value by s = s + s and then get a new RBF network. We use the MPE formula to calculate the new MPE value MPE(s + s). VI. S IMULATION A. Datasets and Network Setting To verify our theoretical results, we consider two datasets: 1) a nonlinear autoregressive time series (NAR) [18] and 2) the Abalone dataset [19]. The NAR time series used is generated by

y (t) = 0.8 − 0.5 exp −y 2 (t − 1) y (t − 1)

− 0.3 + 0.9 exp −y 2 (t − 1) y (t − 2) + 0.1 sin (π y (t − 1)) +  (t)

2

+ σb2

1151

∂MPE ∂MPE , s new = s old − η . ∂λ ∂s

(38)

(40)

where (t) is a zero-mean Gaussian random variable with variance σ2 = 0.01 that drives the series. One-thousand samples were generated given y(0) = y(−1) = 0. The first 500 data points were used as the training set and the other 500 were used as the test set. Our RBF model is used to predict y(t) based on the past observations y(t − 1) and y(t − 2). Thus, we have x(t) = [y(t − 1), y(t − 2)]T . Fifty RBF centers are randomly selected from the training inputs, and the RBF width s is set to 0.1. The Abalone dataset consists of the values of nine physical measurements, namely, sex, length, diameter, height, whole weight, shucked weight, viscera weight, shell weight, and age. In our simulation, the second to eighth measurements are taken as the input, and the ninth measurement is taken as the output. The dataset consists of 4177 samples. The first 2000 samples were used as the training set, and the rest are used for the test set. Fifty RBF centers are randomly selected from the training inputs, and the RBF width s is set to 0.1. B. Effectiveness of MPE Formula We use the NAR time-series data to examine the effectiveness of our MPE formula under nine fault levels: σb2 = {0.005, 0.01, 0.05} and p = {0.005, 0.01, 0.05}. For each fault level, we train a number of networks with different values of λ. Afterwards, for each trained network, we randomly generate 10 000 faulty networks according to the fault level. We then measure the test error of the faulty networks. To demonstrate the effectiveness of our proposed search algorithm for the regularization parameter, we use our search method mentioned

1152

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012 NAR example with p = 0.005 and σ = 0.005

NAR example with p = 0.01 and σ = 0.01

0.08 Mean square error

Mean square error

0.08 0.07 0.06 0.05 0.04 0.03 0.02 −4 10

Fig. 1.

NAR example with p = 0.05 and σ = 0.05

b

b

MPE Test error optimized λ

0.08 Mean square error

b

MPE Test error optimized λ

0.07 0.06 0.05 0.04 0.03

−3

−2

10 10 Regularization parameter λ

−1

10

MPE Test error optimized λ

0.07 0.06 0.05 0.04 0.03

0.02 −4 10

−3

−2

−1

10 10 Regularization parameter λ

10

0.02 −4 10

−3

−2

10 10 Regularization parameter λ

−1

10

MPE and test error for the NAR example. The vertical solid line indicate the optimized λ based on our method.

TABLE I E FFICIENCY OF O UR MPE BASED ON M ETHOD FOR THE NAR E XAMPLE . A S A C OMPARISON , W E A LSO P RESENT THE R ESULT OF THE T EST S ET M ETHOD Fault levels σb2 = weight noise p = open fault rate (σb2 , p) (0.005, 0.005) (0.005, 0.01) (0.005, 0.05) (0.01, 0.005) (0.01, 0.01) (0.01, 0.05) (0.05, 0.005) (0.05, 0.01) (0.05, 0.05)

Our concurrent fault method

Test set method with faulty networks

λ

MPE value

Test error

λ

Test error

0.0004534 0.0006388 0.0017543 0.0006520 0.0008282 0.0018976 0.0021494 0.0022703 0.0030162

0.02711 0.02869 0.03899 0.02864 0.03004 0.03993 0.03746 0.03847 0.04682

0.02655 0.02827 0.03894 0.02819 0.02970 0.03988 0.03730 0.03832 0.04676

0.0003467 0.0005011 0.0015135 0.0005011 0.0006606 0.0016595 0.0018197 0.0019952 0.0027542

0.02648 0.02820 0.03891 0.02813 0.02964 0.03985 0.03726 0.03829 0.04674

in Section V to optimize the MPE value. The results are summarized in Fig. 1 and Table I. Fig. 1 shows the plots of the test error versus the regularization parameter for three fault levels. We observe that, for each fault level, the shape of the MPE curve is quite similar to that of the test error curve. The two curves have the similar minimum points at the same range of regularization parameters. That means that: 1) our MPE formula is a good approximation of the test error under the concurrent fault situation and 2) minimizing the MPE value with respect to λ is an effective way for minimizing the test error. In Fig. 1, the vertical lines indicate our searched results. Table I summarizes the comparison between our MPE search method and the test set method. We observe that our MPE search method can accurately locate the appropriate value of regularization parameters. For instance, for a fault level with {σb2 = 0.05, p = 0.05}, based on the test set method, the optimal regularization parameter is equal to 0.002754 and the corresponding test error is 0.04674. With our MPE search method, the regularization parameter is equal to 0.003016 and the corresponding test error is around 0.04676. Clearly, both methods can achieve the same test error. For other fault levels, we also observe the same phenomena. C. Applications of the MPE Formula: Optimizing the Tuning Parameters This section studies the effectiveness of using the MPE formula to optimize tuning parameters for the concurrent fault

situation. We also show the performances of the other three methods: two methods in [1] and Zhou’s method in [2]. In [1], one of the two methods is designed for handling weight noise and is called the weight noise method. Another one is designed for handling weight fault and is called the weight fault method. Tables II and III summarize the performances of different methods. From the tables, in general, our concurrent fault methods show better performance than the other three methods. This is because the our MPE technique can help the weight decay method to select a suitable weight decay parameter for the concurrent fault situation. The weight noise method [1] is designed for handling weight noise. When the weight noise is large, such as σb2 = 0.05, its performance is comparable to that of the concurrent fault method with optimized λ. However, the weight noise method has a poor performance when open fault rate is large. For instance, in the Abalone dataset (Table III), when (σb2 , p) = (0.01, 0.05), the test set error of the weight noise method is 9.861. When our concurrent fault method with optimized λ is used, the test set error is reduced to 8.817. By optimizing λ and s simultaneously, we can further reduce the test set error to 8.171. The weight fault method [1] and Zhou’s method are designed for handling weight fault. For a high fault rate, the performance of the weight fault method is better than that of Zhou’s method. When the open fault rate is large, such as p = 0.05, the performance of the weight fault method is comparable to that of our concurrent fault method.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

1153

TABLE II T EST S ET E RRORS OF FAULTY N ETWORKS FOR THE NAR E XAMPLE Fault levels σb2 = weight noise p = fault rate (σb2 , p)

Methods in [1] Zhou’s method [2]

Weight

Weight

noise method

fault method

Concurrent fault method Optimizing λ (s = 0.1)

Optimizing λ and s

(0.005, 0.005)

0.0271

0.0266

0.0266

0.0265

0.0168

(0.005, 0.01)

0.0286

0.0289

0.0282

0.0282

0.0173

(0.005, 0.05)

0.0404

0.0471

0.0390

0.0389

0.0264

(0.01, 0.005)

0.0297

0.0281

0.0289

0.0281

0.0174

(0.01, 0.01)

0.0307

0.0298

0.0299

0.0297

0.0185

(0.01, 0.05)

0.0418

0.0438

0.0398

0.0398

0.0275

(0.05, 0.005)

0.0508

0.0372

0.0467

0.0373

0.0245

(0.05, 0.01)

0.0470

0.0383

0.0432

0.0383

0.0252

(0.05, 0.05)

0.0531

0.0470

0.0475

0.0467

0.0320

TABLE III T EST S ET E RRORS OF FAULTY N ETWORKS FOR THE A BALONE D ATASET Fault levels σb2 = weight noise p = fault rate (σb2 , p)

Methods in [1] Zhou’s method [2]

Weight

Weight

noise method

fault method

Concurrent fault method Optimizing λ (s = 0.1)

Optimizing λ and s

(0.005, 0.005)

6.918

7.056

7.026

6.881

6.294

(0.005, 0.01)

7.210

7.491

7.244

7.192

6.829

(0.005, 0.05)

8.885

10.99

8.762

8.714

8.061

(0.01, 0.005)

7.439

7.239

7.433

7.190

6.682

(0.01, 0.01)

7.528

7.521

7.524

7.439

6.941

(0.01, 0.05)

9.110

9.861

8.885

8.817

8.171

(0.05, 0.005)

11.60

8.512

10.68

8.470

7.657

(0.05, 0.01)

10.55

8.645

9.757

8.583

7.778

(0.05, 0.05)

10.90

9.839

9.866

9.529

8.988

However, its performance is very poor when the weight noise is large. In the Abalone dataset (Table III), when (σb2 , p) = (0.05, 0.01), the performance of Zhou’s method is 10.55, when the weight fault method is used, the error can be reduced to 9.757. When our concurrent fault method with optimized λ is used, the test set error is reduced to 8.583. By optimizing λ and s simultaneously, we can further reduce the test set error to 7.778. D. Comparison to Support Vector Regression (SVR) As shown previously, the weight decay method with our MPE technique can effectively handle the concurrent fault situation. Apart from the weight decay method, the SVR method [20], [21] is another popular training algorithm for RBF networks. The trained support vectors (SVs) are the RBF centers. The trained Lagrange multipliers αs are used for constructing the RBF weights. In the experiments, the RBF width s is set to 0.1. Here, we use the Abalone dataset to study the performance difference between the weight decay method with our MPE technique and the ε-SVR method. In the ε-SVR method, there are two tuning parameters that could affect the performance of the constructed SVR networks. One of them is the C-parameter that limits the magnitudes of the trained

Lagrange multipliers; the other is the ε-parameter that defines the approximation tolerant level. To the best of our knowledge, there are no systematic ways to set these two parameters for fault tolerance. We use the trial-by-error method to set these two parameters. After many trial runs, we found out the meaningful ranges for these two parameters. We then tried several values in the meaningful ranges. Unlike the weight decay method, training an SVR is quite time consuming and Memory consuming.1 Hence we cannot try too many values within the meaning ranges. The meaningful ranges of the ε-parameter and the C-parameter are 3–6 and 0.5–5, respectively. For the ε-parameter, we tried four values: {3, 4, 5, 6}. For the C-parameter, we tried four values: {0.5, 1, 2, 5}. As a result, there are 16 SVR networks. The results are summarized in Table IV. The table shows the performances of SVR networks for different criteria for selecting these two parameters. In the table, the second column shows the performances of SVR networks when the number of SVs is the selection 1 For the Abalone dataset, the training set size is 2000. Hence we need to solve a constrained optimization problem with 4000 variables. The matrix size is 4000 × 4000. From our experience, the training time is very long for large C and small ε. On the other hand, for the weight decay method, to construct an RBF network, we only need to perform a matrix inverse with dimension 50 × 50.

1154

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

TABLE IV T EST S ET E RRORS OF SVR M ETHOD FOR THE A BALONE D ATASET Fault levels σb2 = weight noise p = open fault rate (σb2 , p)

Min. number of SVs C = 5, ξ = 6, and 54 SVs

Rules to select parameters in SVR Min. error of faultless networks Min. error of faulty networks C = 5, ξ = 3, and 292 SVs at different fault levels 4.7814 (C = 5.0, ξ = 3, and 292 SVs)

(0, 0)

8.4515

4.7814

(0.005, 0.005)

9.2465

18.916

6.1135 (C = 1.0, ξ = 3, and 314 SVs)

(0.005, 0.01)

9.5684

25.972

6.3164 (C = 1.0, ξ = 3, and 314 SVs)

(0.005, 0.05)

12.1306

80.108

7.3635 (C = 0.5, ξ = 3, and 327 SVs)

(0.01, 0.005)

9.7492

26.086

6.3216 (C = 1.0, ξ = 3, and 314 SVs)

(0.01, 0.01)

10.0585

33.106

6.5234 (C = 1.0, ξ = 3, and 314 SVs)

(0.01, 0.05)

12.6010

86.863

7.4135 (C = 0.5, ξ = 3, and 327 SVs)

(0.05, 0.005)

13.6903

83.444

7.0905 (C = 0.5, ξ = 3, and 327 SVs)

(0.05, 0.01)

13.9798

90.176

7.1527 (C = 0.5, ξ = 3, and 327 SVs)

(0.05, 0.05)

16.3638

141.63

7.8129 (C = 0.5, ξ = 3, and 327 SVs)

criterion. Although using this criterion can reduce the network size, the performances of the SVR method are poorer than those of our methods at all fault levels (see the fifth and sixth columns of Table III and the second column of Table IV). In Table IV, the third column shows the performances of SVR networks when the test set error of the faultless case is the selection criterion. Although using this criterion can reduce the test set error at the faultless case, the network size is large and the performances of faulty networks are extremely poor (see the fifth and sixth columns of Table III and the third column of Table IV). In Table IV, the fourth column shows the performances of SVR networks when the test set error of faulty networks is the selection criterion. When we use this criterion, for each trained SVR network we need to feed the test set to the generating faulty networks. Although using this criterion can reduce the test set error at all fault levels (see the fifth and sixth columns of Table III, and the fourth column of Table IV), the SVR network sizes are extremely large. The number of RBF nodes is more than 300. (Note that in the weight decay method with our MPE approach, we use 50 RBF centers only.) From our experiments, it is possible to use the SVR method to produce RBF networks with good fault tolerance. However, there are two main drawbacks in the SVR method. First, the SVR network size is much larger. Second, there are no theoretical ways to select the two parameters in the SVR method for fault tolerance. It should be noticed that the trialand-error method is not an efficient way because training an SVR network is quite resource-demanding for a large dataset. Also, for some values of the two parameters the training time is very long. VII. C ONCLUSION We have presented the error analysis on faulty RBF networks under the concurrent fault situation, where the multiplicative weight noise and open weight fault appear concurrently. An MPE formula and its gradient expression were derived. Our formula is a function of the training set error of the faultless network and trained weight. We also derived a search method for the regularization parameter from

the gradient expression. With our MPE formula, we could predict the generalization ability of faulty RBF networks without generating a number of faulty networks and using the test set. Simulation results showed that our formula could precisely predict the test error of faulty networks. Besides, our search method could help us to select an appropriate value of regularization parameters for minimizing the test error of faulty networks. We also demonstrated the way to optimize the regularization parameter and RBF width simultaneously. This brief assumed that the standard weight decay training is used. One of the future directions is to extend our MPE analysis to other learning algorithms and other neural network models. For example, it is interesting to investigate the performance of SVR networks under the concurrent fault situation. R EFERENCES [1] C. S. Leung, H. J. Wang, and J. Sum, “On the selection of weight decay parameter for faulty networks,” IEEE Trans. Neural Netw., vol. 21, no. 8, pp. 1232–1244, Aug. 2010. [2] Z. H. Zhou and S. F. Chen, “Evolving fault-tolerant neural networks,” Neural Comput. Appl., vol. 11, nos. 3–4, pp. 156–160, Jun. 2003. [3] J. Burr, “Digital neural network implementations,” in Neural Networks, Concepts, Applications, and Implementations. Englewood Cliffs, NJ: Prentice Hall, 1995, pp. 237–285. [4] J. L. Bernier, J. Ortega, I. Rojas, E. Ros, and A. Prieto, “Obtaining fault tolerant multilayer perceptrons using an explicit regularization,” Neural Process Lett., vol. 12, no. 2, pp. 107–113, Oct. 2000. [5] M. Stevenson, R. Winter, and B. Widrow, “Sensitivity of feedfoward neural networks to weight errors,” IEEE Trans. Neural Netw., vol. 1, no. 1, pp. 71–80, Jan. 1990. [6] G. Bolt, “Fault tolerant multilayer perceptron networks,” Ph.D. dissertation, Dept. Comput. Sci., Univ. York, U.K., Jul. 1992. [7] C.-S. Leung and J. Sum, “A fault-tolerant regularizer for RBF networks,” IEEE Trans. Neural Netw., vol. 19, no. 3, pp. 493–507, May 2008. [8] D. S. Phatak and I. Koren, “Complete and partial fault tolerance of feedforward neural nets,” IEEE Trans. Neural Netw., vol. 6, no. 2, pp. 446–456, Mar. 1995. [9] C. T. Chiu, K. Mehrotra, C. K. Mohan, and S. Ranka, “Modifying training algorithms for improved fault tolerance,” in Proc. Int. Conf. Neural Netw., Orlando, FL, Jun. 1994, pp. 333–338. [10] D. Simon and H. El-Sherief, “Fault-tolerance training for optimal interpolative nets,” IEEE Trans. Neural Netw., vol. 6, no. 6, pp. 1531– 1535, Nov. 1995.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

[11] J. L. Bernier, J. González, A. Cañas, and J. Ortega, “Assessing the noise immunity of radial basis function neural networks,” Bio-inspired Applications of Connectionism II. New York: Springer-Verlag, 2001, pp. 136–143. [12] J. L. Bernier, A. Diaz, F. Fernandez, A. Canas, J. Gonzalez, P. MartinSmith, and J. Ortega, “Assessing the noise immunity and generalization of radial basis function networks,” Neural Process Lett., vol. 18, no. 1, pp. 35–48, Aug. 2003. [13] P. Chandra and Y. Singh, “Fault tolerance of feedforward artificial neural networks–a framework of study,” in Proc. Int. Joint Conf. Neural Netw., Portland, OR, Jul. 2003, pp. 489–494. [14] J. L. Bernier, J. Ortega, I. Rojas, and A. Prieto, “Improving the tolerance of multilayer perceptrons by minimizing the statistical sensitivity to weight deviations,” Neurocomputing, vol. 31, pp. 87–103, Jan. 2000. [15] V. Fedorov, Theory of Optimal Experiments. New York: Academic, 1972. [16] J. E. Moody, S. J. Hanson, and R. P. Lippmann, “The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems,” in Proc. Adv. Neural Inf. Process. Syst., 1992, pp. 847–854. [17] R. Battiti, “Accelerated backpropagation learning: Two optimization methods,” Complex Syst., vol. 3, pp. 331–342, Aug. 1989. [18] S. Chen, “Local regularization assisted orthogonal least squares regression,” Neurocomputing, vol. 69, nos. 4–6, pp. 559–585, 2006. [19] M. Sugiyama and H. Ogawa, “Optimal design of regularization term and regularization parameter by subspace information criterion,” Neural Netw., vol. 15, no. 3, pp. 349–361, Apr. 2002. [20] V. N. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. [21] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statist. Comput., vol. 14, pp. 199–222, Aug. 2004.

Neural Network-Based Distributed Attitude Coordination Control for Spacecraft Formation Flying With Input Saturation An-Min Zou and Krishna Dev Kumar, Senior Member, IEEE Abstract— This brief considers the attitude coordination control problem for spacecraft formation flying when only a subset of the group members has access to the common reference attitude. A quaternion-based distributed attitude coordination control scheme is proposed with consideration of the input saturation and with the aid of the sliding-mode observer, separation principle theorem, Chebyshev neural networks, smooth projection algorithm, and robust control technique. Using graph theory and a Lyapunov-based approach, it is shown that the distributed controller can guarantee the attitude of all spacecraft to converge to a common time-varying reference attitude when the reference attitude is available only to a portion of the group of spacecraft. Numerical simulations are presented to demonstrate the performance of the proposed distributed controller. Index Terms— Attitude coordination control, Chebyshev neural networks, control input saturation, quaternion, spacecraft formation flying.

Manuscript received August 2, 2011; revised April 12, 2012; accepted April 20, 2012. Date of publication May 22, 2012; date of current version June 8, 2012. The authors are with the Department of Aerospace Engineering, Ryerson University, Toronto, ON M5B 2K3, Canada (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2196710

1155

I. I NTRODUCTION Attitude coordination control for spacecraft formation flying (SFF) has received significant attention in recent years. In general, there exist several parameterizations to represent the orientation angles, i.e., the three-parameter representations (e.g., the Euler angles, Gibbs vector, Cayley–Rodrigues parameters, and modified Rodrigues parameters) and the fourparameter representations (e.g., unit quaternion). Using the modified Rodrigues parameters for attitude representations, the problem of attitude coordination has been studied in [1]–[7]. However, the three-parameter representations always exhibit singularity, i.e., the Jacobian matrix in the spacecraft kinematics is singular for some orientations, and it is well known that the four-parameter representations (e.g., unit quaternion) are considered for global representation of orientation angles without singularities. Using unit quaternion for attitude representation, two distributed formation control strategies for maintaining attitude alignment among a group of spacecraft were proposed in [8]. In [9], a decentralized variable structure controller was presented for attitude coordination control of multiple spacecraft in the presence of model uncertainties, external disturbances, and intercommunication time delays. Based on the state-dependent Riccati equation technique, a decentralized attitude coordinated control algorithm for SFF was proposed in [10]. In these works [8]–[10], the common reference attitude was assumed to be a constant. The problem of quaternionbased attitude synchronization for a group of spacecraft to a common time-varying reference attitude was studied in [11]–[14]. However, the common time-varying reference attitude was assumed to be available to each agent in the group. In practice, it is more realistic that a common time-varying reference attitude is available only to a subset of the group members, and it is highly desirable to develop a distributed control law that can force a group of spacecraft to a common time-varying reference attitude even when only a subset of the team members has access to the common reference attitude. In the current literature, attitude coordination control that can be robust against both structured and unstructured uncertainties has not received much attention. In [14], a quaternionbased decentralized adaptive sliding-mode control law was proposed for attitude coordination control of SFF in the presence of model uncertainties and external disturbances. However, the common time-varying reference attitude was assumed to be available to each spacecraft in the formation, and the problem of control input saturation was not considered. Universal function approximations such as neural networks (NNs) have been used in the robust control of nonlinear uncertain systems [15]–[19], due to the learning and adaptive abilities of NNs. It is worthwhile mentioning that the NN-based approaches for multiagent systems developed in [17]–[19] cannot be directly applied to spacecraft attitude dynamics because of the inherent nonlinearity in quaternion kinematics. Recently, a pinning impulsive control strategy was proposed for the synchronization of stochastic dynamical networks with nonlinear coupling in [20].

2162–237X/$31.00 © 2012 IEEE

RBF networks under the concurrent fault situation.

Fault tolerance is an interesting topic in neural networks. However, many existing results on this topic focus only on the situation of a single fault...
235KB Sizes 0 Downloads 3 Views