IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

1827

Convergence Analyses on On-Line Weight Noise Injection-Based Training Algorithms for MLPs John Sum, Senior Member, IEEE, Chi-Sing Leung, Member, IEEE, and Kevin Ho

Abstract— Injecting weight noise during training is a simple technique that has been proposed for almost two decades. However, little is known about its convergence behavior. This paper studies the convergence of two weight noise injectionbased training algorithms, multiplicative weight noise injection with weight decay and additive weight noise injection with weight decay. We consider that they are applied to multilayer perceptrons either with linear or sigmoid output nodes. Let w(t) be the weight vector, let V (w) be the corresponding objective function of the training algorithm, let α > 0 be the weight decay constant, and let μ(t) be the step size. We show that if μ(t) → 0, then with probability one E[w(t)22 ] is bound and lim t →∞ w(t)2 exists. we  Based on these twoproperties, 2 show that if μ(t) → 0, t μ(t) = ∞, and t μ(t) < ∞, then with probability one these algorithms converge. Moreover, w(t) converges with probability one to a point where ∇w V (w) = 0. Index Terms— Additive noise, convergence, multilayer perceptron, multiplicative noise, weight noise injection.

I. I NTRODUCTION

I

N [1]– [3], Murray & Edward proposed a modified backpropagation training algorithm in which multiplicative weight noise is injected during each step of training. By simulations on character encoder and eye-classifier problems, they found that the resultant multilayer perceptrons (MLPs) have better tolerance ability against random weight fault and weight perturbation. Similar results were discovered [4] for recurrent neural networks. While on-line weight noise injection training algorithms have been demonstrated to effectively enhance the fault tolerance and generalization ability of some neural networks [1], [2], [5], there is not much analytical work explaining why they succeed. Regarding the effect of weight noise, almost all the analytical works focus entirely on the performance of a neural network. Little has been done on the properties of weight noise injection learning algorithms, such as their convergence behaviors and their corresponding objective functions. Manuscript received March 29, 2011; revised July 11, 2012; accepted July 15, 2012. Date of publication October 5, 2012; date of current version October 15, 2012. This work was supported in part by a City University of Hong Kong research Grant 116511 from the General Research Fund of Hong Kong, and research Grants 98-2221-E-005-048 and 99-2221-E-005-090 from the National Science Council, Taiwan. J. Sum is with the Institute of Technology Management, National Chung Hsing University, Taichung 43301, Taiwan (e-mail: [email protected]). C.-S. Leung is with the Department of Electronic Engineering, City University of Hong Kong, Hong Kong (e-mail: [email protected]). K. Ho is with the Department of Computer Science and Communication Engineering, Providence University, Taichung City 43301, Taiwan (e-mail: [email protected]). Digital Object Identifier 10.1109/TNNLS.2012.2210243

A. Previous Works Murray & Edwards [2], [4], [6] and Jim et al. [4] have presented some analytical results regarding the effect of weight noise on MLPs and RNN. They derived the prediction error of an MLP (or RNN) if its weights are corrupted by noise (refer to Section II-C in [2] and [6] for the analysis of MLPs, and [4, Sec. III] for the analysis of RNNs). But they did not proceed to the objective functions. In [5], An attempted to reveal the objective functions of these weight noise injectionbased training algorithms for MLPs. However, the objective function derived for the weight noise injection-based training algorithm is essentially the prediction error of an MLP if it is corrupted by weight noise.1 In the last decade, further analyses have been performed on the effect of weight noise on a neural network [7]–[12]. Again, all of them focused on analyzing the prediction error of a neural network when the network is corrupted by weight noise. None of them worked on the objective function nor the convergence of the weight noise injection-based algorithms. In regard to the missing information on the objective functions and the convergence analysis for the weight noise injection-based algorithms, we conducted a series of research studies in the past few years, attempting to reveal the effect of weight noise on learning [13]–[16]. In [13] and [14], we showed that the convergence of injecting weight noise during training a radial basis function (RBF) network is with probability one. Besides, for RBF networks, the objective function of injecting weight noise (either multiplicative or additive) during training is essentially the mean square error (MSE) and thus we conclude that injecting weight noise during training cannot improve the fault tolerance or the generalization ability. In [16], we experimentally demonstrated that injecting weight noise during training for MLPs might not converge. Weight noise injection with weight decay during training will lead to convergence behavior. Because of these findings, we further derived in [15], by analyzing the mean update equations of the weight noise injection-based algorithms for MLPs, the objective function of multiplicative weight noise injection with weight decay (MWNI-WD) and the objective function of additive weight noise with weight decay (AWNI-WD). B. Contributions This paper analyzes the convergence behaviors of MWNIWD and AWNI-WD. We consider that the two algorithms 1 We will show in this paper that his approach is only valid for additive weight noise injection. His approach cannot be extended to multiplicative weight noise.

2162–237X/$31.00 © 2012 IEEE

1828

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

are used for training MLPs with either linear or sigmoid output nodes. Three new analytical results will be elucidated in this paper. Let w(t) be the weight vector, let V (w) be the corresponding objective function, let α > 0 be the weight decay constant, and let μ(t) be the step size. 1) If μ(t) fulfils certain mild conditions, the MWNI-WD and AWNI-WD algorithms for an MLP with linear output nodes2 converge with probability one. Besides, for both algorithms, w(t) converges to a local minimum of V (w). 2) The objective functions, V (w), for the MWNI-WD and AWNI-WD algorithms for MLP with sigmoid output nodes are derived. 3) If μ(t) fulfils certain mild conditions, the MWNI-WD and AWNI-WD algorithms for an MLP with sigmoid output node converge with probability one. For both the algorithms, w(t) converges to a local minimum of the corresponding objective function.

the i th hidden node, c = (c1 , . . . , cm )T ∈ R m is the input bias vector, d ∈ R m is the output weight vector, and z = (z 1 , . . . , z m )T ∈ R m is the output vector of the hidden layer. Moreover, z is a vector-valued function. The i th element of z is defined as 1 (2) z i (xk , ai , ci ) = 1 + exp(−(aiT xk + ci )) for i = 1, 2, . . . , m. Let wi ∈ R (n+2) be the parametric vector associated with the i th hidden node, i.e., wi = (di , aiT , ci )T , and let w ∈ R m(n+2) be a parametric vector augmenting all the parametric T ]T . The output is then denoted vectors, i.e., w = [w1T , . . . , wm as f (xk , w). We call wi s and w the weight vectors. Let g(xk , w) be ∇w f (xk , w), and let gi (xk , wi ) be ∇wi f (xk , w). Then we have g(xk , w) = (g1 (xk , w1 )T , . . . , gm (xk , wm )T )T in which

C. Organization of This Paper The rest of this paper will present the main convergence theorems and the corresponding proofs for these weight noise injection-based algorithms for MLPs. In the next section, the models of MLPs with linear and sigmoid output nodes will be introduced. Their corresponding weight decay training algorithms will be described. In Section III, the MWNIWD and AWNI-WD algorithms for MLPs with linear or sigmoid output nodes will be delineated. Their corresponding objective functions are derived. In Section IV, the boundedness condition for E[w(t)22 ] in the MWNI-WD algorithm and the boundedness condition for E[w(t)22 ] in the AWNI-WD algorithm will be derived. In Section V, the existence of limt →∞ w(t)22 is proved.  in the same Moreover, we show section that if μ(t) → 0, t μ(t) = ∞ and t μ(t)2 < ∞, then limt →∞ ∇w V (w(t)) = 0. To illustrate the convergence behaviors of the algorithms, we present in Section VI a few simulations. Finally, the conclusion is presented in the last section. II. BACKGROUND N We assume that the training set D = {(xk , yk )}k=1 is n generated by an unknown system, where xk ∈ R is the kth sample input vector and yk ∈ R is the corresponding output.



⎤ z i (xk , ai , ci ) gi (xk , wi ) = ⎣ di z i (xk , ai , ci )(1 − z i (xk , ai , ci ))xk ⎦. di z i (xk , ai , ci )(1 − z i (xk , ai , ci ))

Furthermore, let ∇w g(x, w) be ∇∇w f (x, w), ∇wi gi (x, w) be ∇∇wi f (x, w) for all i . Clearly, we ⎡ ∇w1 g1 (x, w1 ) · · · 0(n+2)×(n+2) ⎢ .. .. .. ∇w g(x, w) = ⎣ . . .

(3)

and let have ⎤ ⎥ ⎦.

(4)

0(n+2)×(n+2) · · · ∇wm gm (x, wm )

In online weight decay training, a sample {xt , yt } is randomly drawn from the dataset D at the tth step. From (1) and (2), the output is given by f (xt , w(t)) = d(t)T z(t) z(t) = z(xt , A(t), c(t)).

(5) (6)

Replacing wi and xk in (3) by wi (t) and xt , respectively ⎤ ⎡ z i (t) (7) gi (xt , wi (t)) = ⎣ di (t)z i (t)(1 − z i (t))xt ⎦. di (t)z i (t)(1 − z i (t)) To simplify the notation, we denote z i (xt , ai (t), ci (t)) as z i (t). The update equations for wi s are given by wi (t + 1) = wi (t) + μ(t) {(yt − f (xt , w(t)))gi (xt , wi (t))

A. MLP With Linear Output Node We use an MLP with n input nodes, m hidden nodes, and one linear output node,3 to approximate the unknown system, given by f (xk , d, A, c) = dT z(xk , A, c) (1) where A = [a1 , . . . , am ] ∈ R n×m is the input-to-hidden weight matrix, ai ∈ R n is the input weight vector associated with 2 It should be noted that back-propagation algorithms for linear output node

and sigmoid output nodes are defined in two different settings. The former is is based on the gradient descent of the MSE. While the latter is defined as the gradient descent of the cross entropy error. 3 For an MLP with output bias, the steps of analysis are essentially the same.

−αwi (t)}

(8)

where μ(t) > 0 is the step size at the tth step, and α > 0 is the decay constant. The objective function of (8) is given by VMSE (w) =

N 1 (yk − f (xk , w))2 + αw22 . N

(9)

k=1

The last term −αwi (t) in (8) is called the forgetting term [17]. It has been proved [18], [19] that the convergence of (8) is with  probability one if the step size

μ(t) fulfills the conditions that  t μ(t) = ∞, limt →∞ sup μ(t)−1 − μ(t − 1)−1 < ∞, and t μ(t)ν < ∞ for some ν > 1.

SUM et al.: ON-LINE WEIGHT NOISE INJECTION-BASED TRAINING ALGORITHMS FOR MLPs

B. MLP With Sigmoid Output Node For classification problems, a sigmoid output node is usually applied. The output of the network is given by 1 φ(xt , w(t)) = 1 + exp(− f (xt , w(t))) f (xt , w(t)) = d(t)T z(t)

Let wπi and gπi be the πi th element of vectors w and g, respectively, and w ˜ = w + w. Then ˜ ≈ f (xt , w) + ∇w f (xt , w)T w f (xt , w) 1 + wT ∇∇w f (xt , w)w 2 ˜ ≈ gπi (xt , w) + ∇w gπi (xt , w)T w gπi (xt , w) 1 + wT ∇∇w gπi (xt , w)w. 2

(10) (11)

where the elements of z(t) are defined in (2). The update equations of wi s are given by wi (t + 1) = wi (t) + μ(t) {(yt − φ(xt , w(t)))gi (xt , wi (t)) (12) − αwi (t)} . The objective function of (12) is the cross entropy error function, given by VCEE (w) 1 {yk ln φ(xk , w) + (1 − yk ) ln(1 − φ(xk , w))} N k=1 α (13) + w22 . 2 N

=−

III. MWNI-WD D URING T RAINING Let b1 (t), . . . , bm (t) ∈ R (n+2) be random vectors associated with the weight vectors w1 (t), . . . , wm (t) at step t.4 Elements in each bi (t) are independent mean zero Gaussian-distributed random variables with variance Sb

P(bi (t)) ∼ N 0, Sb I(n+2)×(n+2) for all t ≥ 0. Furthermore, bi (t1 ) and bi (t2 ) are independent for all t1 = t2 . Finally, let w ˜ i (t) = (d˜i (t), a˜ i (t)T , c˜i (t))T be the perturbed weight vector associated with the i th hidden node. The perturbed output of the i th hidden node is denoted by z˜ i (t) = z i (xt , a˜ i (t), c˜i (t)). It should be noted that the objective functions derived in this section are from the expansion of the functions f (·) and g(·) up to second-order term. While the objective functions derived in [15] are based on the expansion up to first-order term.

1829

(16)

(17)

From (16) and letting e(t) = (yt − f (xt , w)), we can get that

2 ˜ 2 = e(t)2 + ∇w f (xt , w)T w (yt − f (xt , w))

+ ∇w f (xt , w)T w wT ∇∇w f (xt , w)w

2 1 + wT ∇∇w f (xt , w)w 4 −2e(t)∇w f (xt , w)T w −e(t)wT ∇∇w f (xt , w)w.

(18)

Taking expectation of (18), the odd order terms of w will be zeros. For small w, we can thus ignore the fourth-order term of w. The expected prediction error is given by 

2  T ˜ |w] = e(t) + E ∇w f (xt , w) w |w E[(yt − f (xt , w))   −e(t)E wT ∇∇w f (xt , w)w|w . 2

2

(19) Next, we let E M (xt , yt , w) be the expected prediction error if the noise is multiplicative, and let E A (xt , yt , w) be the expected prediction error if the noise is additive. 1) MWNI-WD Algorithm: For multiplicative weight noise, w ˜ i (t) in (14) is given by w ˜ i (t) = wi (t) + bi (t) ⊗ wi (t)

(20)

where A. MLP With Linear Output Node [2], [6], [15] The update of wi -based weight noise injection with weight decay during training can be written as follows: ˜ ˜ i (t)) wi (t +1) = wi (t)+μ(t) {(yt − f (xt , w(t)))g i (xt , w −αwi (t)} where

(14) ⎡

As in [1], [2], [8], and [9], the noise variance Sb is assumed to be a small positive value. If w = b ⊗ w, the expectation E M (xt , yt , w) of the prediction error (19) is given by E M (xt , yt , w)



z˜ i (t) ˜ ⎣ d (t)˜ z (t)(1 − z˜ i (t))xt ⎦. gi (xt , w ˜ i (t)) = i i ˜ di (t)˜z i (t)(1 − z˜ i (t))

bi (t) ⊗ wi (t) = (bi1 (t)wi1 (t), . . . , bi(n+2) (t)wi(n+2) (t))T .

(15)

4 Note that the number of input nodes is denoted as n, the number of hidden nodes is denoted as m, and the total number of weights is denoted as Mw .

Mw  = (yt − f (xt , w)) + Sb wπi

2 ∂ f (xt , w) ∂wπi πi =1   Mw 2 2 ∂ wπi −Sb (yt − f (xt , w)) f (xt , w) . (21) ∂wπ2 i 2

πi =1

1830

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

Taking derivative of (21) with respect to wπi , we have −

1 ∂ E M (xt , yt , w) 2 ∂wπi Mw ∂f ∂2 f ∂f = (yt − f (xt , w)) − Sb w2 ∂wπi ∂wπ j ∂wπi ∂wπ j π j π j =1  2 ∂f ∂2 f −Sb wπi + Sb (yt − f (xt , w)) 2 wπi ∂wπi ∂wπi Mw ∂3 f Sb + (yt − f (xt , w)) w2 2 ∂wπi ∂wπ2 j π j π j =1



Mw ∂2 f 2 Sb ∂ f w . 2 ∂wπi ∂wπ2 j π j

(22)

π j =1

for all i = 1, . . . , m, where5  N 1 1 V⊗ (w) = E M (xk , yk , w) + αw22 2 N k=1  N  Sb u(xk , w)dw . − N

(28)

k=1

In (28)

  u(xk , w) = w ⊗ diag ∇∇w (yk − f (xk , w))2 .

(29)

The last term in (28) is a line integral. It is clear that V⊗ (w) is differentiable up to infinite order. While we have this new objective function, it is important to check if the discussion in [15] about the weight magnitude still holds (see [15, Sec. III-C]). Let

On the other hand, we can take the expectation of (yt − f (xt , w))g ˜ πi (xt , w) ˜ over the random vector b and get that

L(w) =

N 1 E M (xk , yk , w) + αw22 N

(30)

k=1

E[(yt − f (xt , w))g ˜ πi (xt , w)|w] ˜ = (yt − f (xt , w)) + −

∂f − Sb ∂wπi

Sb (yt − f (xt , w)) 2 Sb ∂ f 2 ∂wπi

Mw π j =1

Mw π j =1

Mw π j =1

∂f ∂2 f w2 ∂wπ j ∂wπ j ∂wπi π j

∂3 f w2 ∂wπ2 j ∂wπi π j

ˆ ∗ ) ≈ ∇w L(w∗ ) + ∇∇w L(w∗ )(w ˆ ∗ − w∗ ). ∇w L(w

f 2 w . ∂wπ2 j π j

(23)

∇w L(w∗ ) ≈ 2∇w V⊗ (w∗ ) +

N Sb u(xk , w∗ ). N

(32)

k=1

By the property that f (xt , w) is differentiable to infinite order

As ∇w L(w ˆ ∗ ) = 0 and ∇w V⊗ (w∗ ) = 0, we thus get that N Sb u(xk , w∗ ) + ∇∇w L(w∗ )(w ˆ ∗ − w∗ ) ≈ 0. N

∂3 f ∂3 f = . (24) ∂wπi ∂wπ2 j ∂wπ2 j ∂wπi

E[(yt − f (xt , w))g ˜ πi (xt , w)|w] ˜



∂f 1 ∂ E M (xt , yt , w) + Sb 2 ∂wπi ∂wπi ∂2 f −Sb (yt − f (xt , w)) 2 wπi ∂wπi 1 ∂ =− E M (xt , yt , w) 2 ∂wπi Sb wπi ∂ 2 (yt − f (xt , w))2 . + 2 ∂wπ2 i

k=1

In other word

Thus, comparing (22) and (23), we have

=−

(31)

By (28) and (30), we get that

∂2

∂2 f ∂2 f = , ∂wπi ∂wπ j ∂wπ j ∂wπi

and let w ˆ ∗ and w∗ be the local minima of L(w ˆ ∗ ) and V⊗ (w∗ ), ∗ respectively. That is, ∇w L(w ˆ ) = 0 and ∇w V⊗ (w∗ ) = 0. Considering the situation that Sb is small, we can assume that w ˆ ∗ and w∗ are close to each other. Therefore, we can get that

  ∇∇w L(w∗ )w ˆ ∗ ≈ ∇∇w L(w∗ ) − Sb H(w∗ ) w∗

2

where

wπi

 N 1 ∗ 2 (yk − f (xk , w )) . H(w ) = diag ∇∇w N 



k=1

By rewriting (33), we can get that   w ˆ ∗ ≈ I Mw ×Mw − Sb ∇∇w L(w∗ )−1 H(w∗ ) w∗ . (25)

In vector form 1 ˜ i (xt , w)|w] ˜ = − ∇wi E M (xt , yt , w) E[(yt − f (xt , w))g 2   Sb + wi ⊗ diag ∇∇wi (yt − f (xt , w))2 . (26) 2 By (14) and (26) E[wi (t + 1)|w(t)] = wi (t) − μ(t)∇wi V⊗ (w(t))

(33)

(27)

(34)

For small Sb , we can assume that ∇∇w L(w∗ ) and N ∇∇w (1/N) k=1 (yk − f (xk , w∗ ))2 are positive definite. The N diagonal elements of ∇∇w (1/N) k=1 (yk − f (xk , w∗ ))2 are ∗ all positive. Hence from (34), w ˆ 2 < w∗ 2 . Therefore, the effect of the integral term in V⊗ (w) is to enlarge the magnitude of the weight vector. The discussion in [15] still holds if f (xt , w) ˜ and gi (xt , w ˜ i ) are expanded up to secondorder term. 5 In this paper, we denote the objective functions for multiplicative weight noise-based algorithms as V⊗ (w) and V¯⊗ (w). For additive weight noise-based algorithms, we denote their objective functions as V⊕ (w) and V¯⊕ (w).

SUM et al.: ON-LINE WEIGHT NOISE INJECTION-BASED TRAINING ALGORITHMS FOR MLPs

2) AWNI-WD Algorithm: For additive weight noise, w ˜ i (t) in (14) is given by

1831

B. MLP With Sigmoid Output Node

(35)

The update of wi based on weight noise injection with weight decay during training can be written as follows:

Then w = b and the expectation of the prediction error (19), which is denoted by E A (xt , yt , w) is given by

˜ ˜ i (t)) wi (t + 1) = wi (t) + μ(t) {(yt − φ(xt , w(t)))g i (xt , w (43) −αwi (t)} .

w ˜ i (t) = wi (t) + bi (t).

E A (xt , yt , w) = (yt − f (xt , w))2 + Sb

2 Mw  ∂ f (xt , w) ∂wπi

πi =1

−Sb (yt −f (xt , w))

Mw πi =1

∂2 f (xt , w). (36) ∂wπ2 i

Taking derivative of (36) with respect to wπi , we have −

Mw ∂f ∂2 f ∂f = (yt − f (xt , w)) − Sb ∂wπi ∂wπ j ∂wπi ∂wπ j



Sb (yt − f (xt , w)) 2

πi =1

πi =1

∂3 f ∂wπi ∂wπ2 j

Mw ∂2 f Sb ∂ f . 2 ∂wπi ∂wπ2 j

(44)

EC (xt , yt , w(t)) = −yt ln φ − (1 − yt ) ln(1 − φ) Mw ∂2 f 1 − (yt − φ) E[wπ2 i ] 2 ∂wπ2 i

π j =1

+

C(xt , yt , w(t)) = −yt ln φ(xt , w(t)) −(1 − yt ) ln(1 − φ(xt , w(t))).

Besides, we denote6 EC (xt , yt , w(t)) as E b [C(xt , yt , w(t))|w]. ˜ ˜ and Expanding the approximations of φ(xt , w(t)) gπ j (xt , w(t)) ˜ up to second order, we have

1 ∂ E A (xt , yt , w) 2 ∂wπi

Mw

For the objective function, we let C(xt , yt , w(t)) be the cross entropy error given the sample xt , yt , and w(t)

 Mw  1 ∂f 2 + φ(1 − φ) E[wπ2 i ]. 2 ∂wπi πi =1

(37)

(45)

On the other hand, the conditional expectation of (yt − ˜ πi (xt , w) ˜ given w is given by f (xt , w))g

To save the space, we simplify the notation of φ(xt , w(t)) by φ and f (xt , w(t)) by f in the above equation. Taking derivative of EC (xt , yt , w) with respect to wπi

πi =1

E[(yt − f (xt , w))g ˜ πi (xt , w)|w] ˜

∂f ∂ EC (xt , yt , w(t)) = −(yt − φ) + 1 + 2 ∂wπi ∂wπi

Mw ∂f ∂2 f ∂f − Sb = (yt − f (xt , w)) ∂wπi ∂wπ j ∂wπ j ∂wπi πi =1

Sb + (yt − f (xt , w)) 2 Sb ∂ f − 2 ∂wπi

Mw πi =1

∂2

f

∂wπ2 j

Mw πi =1

where

∂3 f ∂wπ2 j ∂wπi

1 =

Mw ∂f ∂2 f 1 φ(1 − φ) E[wπ2 j ] 2 ∂wπi ∂wπ2 j π j =1

.

(38)

By (24), then comparing (37) and (38), we get that ˜ πi (xt , w)|w] ˜ =− E[(yt − f (xt , w))g

(46)

1 ∂ E A (xt , yt , w). 2 ∂wπi (39)

1 − (yt − φ) 2

Mw π j =1

∂3 f E[wπ2 j ] ∂wπi ∂wπ2 j

∂2 f ∂ 1 E[wπ2 i ] − (yt − φ) 2 2 ∂wπi ∂wπi  2 Mw ∂f ∂f 1 2 = φ(1 − φ)(1 − 2φ) E[wπ2 j ] 2 ∂wπi ∂wπ j π j =1

In vector form 1 E[(yt − f (xt , w))g ˜ i (xt , w)|w] ˜ = − ∇wi E A (xt , yt , w). (40) 2 By (14) and (26) E[wi (t + 1)|w(t)] = wi (t) − μ(t)∇wi V⊕ (w(t)) for i = 1, . . . , m, where   N 1 1 2 E A (xk , yk , w) + αw2 . V⊕ (w) = 2 N

(41)

Mw

∂f ∂2 f E[wπ2 j ] ∂wπ j ∂wπi ∂wπ j π j =1   1 ∂f 2 ∂ + φ(1 − φ) E[wπ2 i ]. 2 ∂wπi ∂wπi

+φ(1 − φ)

On the other hand, if we take expectation of the second ˜ term at the right-hand side of (43) and expand φ(xt , w(t)), (42)

k=1

It is clear that V⊕ (w) is differentiable up to infinite order.

6 Note that we use the notation E [·|w] to denote the expectation that is b taken over the weight noise vector b(t). While E[·] denotes the expectation that is taken over the weight noise vector and the random sample (xk , yk ).

1832

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

where φ  (xk , w(t)) = φ(xk , w(t))(1 − φ(xk , w(t))). As

and gπi (xt , w(t)) ˜ up to second order, we get that E b [(yt − φ(xt , w(t)))g ˜ ˜ πi (xt , w(t))|w(t)] = (yt − φ)

Mw ∂f ∂2 f ∂f 1 − φ(1 − φ) E[wπ2 j ] ∂wπi 2 ∂wπi ∂wπ2 j π j =1

Mw

∂3 f E[wπ2 j ] ∂wπi ∂wπ2 j π j =1  2 Mw ∂f ∂f 1 E[wπ2 j ] − φ(1 − φ)(1 − 2φ) 2 ∂wπi ∂wπ j

1 + (yt − φ) 2

π j =1

−φ(1 − φ)

Mw π j =1

∂f ∂2 f E[wπ2 j ]. ∂wπi ∂wπi ∂wπ j

(47)



Sb wπ2 i , 0,

if πi = π j if πi = π j .

(48)

(49)

Sb , 0,

if πi = π j if πi = π j .

(50)

E[wπi (t + 1)|w(t)] − wπi (t)  N ∂ 1 = −μ(t) EC M (xk , yk , w) + αwπi (t) ∂wπi N k=1

∂2 f wπ (t) ∂wπ2 i i k=1    N ∂f 2 Sb  φ (xk , w(t)) wπi (t) (51) − N ∂wπi Sb N

(yk − φ(xk , w(t)))

k=1

(52)

(53)

where u1 (xk , w) = w⊗diag{∇∇w C(xk , yk , w)}. It is clear that V¯⊗ (w) is differentiable up to infinite order. 2) AWNI-WD Algorithm: From (48) and (50), the mean update equation for additive noise is given by E[wπi (t + 1)|w(t)] − wπi (t)   N ∂ 1 EC A (xk , yk , w)+αwπi (t) . (54) = −μ(t) ∂wπi N k=1

Hence, the objective function is given by N 1 α V¯⊕ (w) = EC A (xk , yk , w) + w22 . N 2

(55)

k=1

Similar to the case of linear output node, we denote EC M (xt , yt , w) as the expected cross entropy error if the noise is multiplicative and EC A (xt , yt , w) as the expected cross entropy error if the noise is additive. 1) MWNI-WD Algorithm: From (48) and (49), the mean update equation for MWNI-WD is given by

+

N 1 C(xk , yk , w(t)) . N

Note that V¯⊕ (w) is differentiable up to infinite order. 

N

×

∂2 ∂wπ2 i 

N 1 α V¯⊗ (w) = EC M (xk , yk , w) + w22 N 2 k=1  Sb − u1 (xk , w)dw N

For additive weight noise E[wπi wπ j |w] =

+ αwπi (t) − Sb wπi (t)

Hence, the objective function is given by

With (48), we are able to get the mean update equations and the corresponding objective functions of MWNI-WD and AWNI-WD for MLPs with sigmoid output nodes. Recall that E[wπi wπ j |w] = 0 if πi = π j . For multiplicative weight noise E[wπi wπ j |w] =

k=1

k=1

Comparing (46) and (47), we can get that E b [(yt − φ(xt , w(t)))g ˜ ˜ πi (xt , w(t))|w(t)] ∂ EC (xt , yt , w(t)) =− ∂wπi 1 ∂2 f ∂ − (yt − φ) 2 E[wπ2 i ] 2 ∂wπi ∂wπi   1 ∂f 2 ∂ E[wπ2 i ]. + φ(1 − φ) 2 ∂wπi ∂wπi

∂2 f ∂2 C(x , y , w(t)) = −(y − φ(x , w(t))) wπ (t) k k k k ∂wπ2 i ∂wπ2 i i   ∂f 2 wπi (t) +φ  (xk , w(t)) ∂wπi E[wπi (t + 1)|w(t)] − wπi (t)  N ∂ 1 EC M (xk , yk ,w(t)) = −μ(t) ∂wπi N

IV. B OUNDEDNESS OF E[w(t)22 ] Before proceeding to the convergence analysis, we need to show that E[w(t)22 ] < ∞ for all t. This boundedness condition is the key to prove that w(t) converges to a local minimum of the corresponding objective function. Both the cases, linear or sigmoid output nodes, will be analyzed in this section. We accomplish the proof in the following steps. S1: We show that E[d(t)22 ] and E[d(t)42 ] are bounded. S2: We use the result in S1 to show that E[ai (t)22 ] for all i = 1, . . . , m and E[c(t)22 ] are bounded. S3: By the results in S1 and S2, we conclude that E[w(t)22 ] 2 ] is bounded. is bounded and imply that E[w(t) ˜ 2 We assume that μ(t) → 0, ∀t ≥ 0 and let bd (t) be the random vector associated with the output vector d. That is bd (t) = (b11(t), b21 (t), . . . , bm1 (t))T

(56)

where bi1 (t) is the first element in bi (t). Besides, we use the notation E d [·|w(t)] denoting the conditional expectation that is taken over the random vector bd (t) only.

SUM et al.: ON-LINE WEIGHT NOISE INJECTION-BASED TRAINING ALGORITHMS FOR MLPs

A. MWNI-WD Algorithms for MLP With Linear Output Node Here, we analyze the case that the MLP with linear output as it is the most difficult one among all cases. From (1), (14), and (15), the update of di (t) can be expressed as follows:   di (t +1) = di (t)+μ(t) (yt − d˜ T (t)˜z(t))˜z i (t)−α di (t) . (57) In vector-matrix form

  d(t +1) = d(t)+μ(t) (yt − d˜ T (t)˜z(t))˜z(t)−α d(t) .

(58)

Based on (58), the boundedness condition for d(t)2 can be stated in the following lemmas. Lemma 1: For the algorithm based on (14) and (20) with α > 0, if 0 < μ(t)α < 1, then with probability one E[d(t)22 ] < ∞ and E[d(t)42 ] < ∞ for all t. Proof of E[d(t)22 ] < ∞: We rewrite the update of d(t) given by (58) as follows: d(t + 1) = B(t)d(t) + μ(t)yt z˜ (t) −μ(t)˜z(t)˜zT (t) (bd (t) ⊗ d(t))

(59)

1833

As the second term in the RHS of (65) is bounded and its value is independent of b(t), we get that E[d(t + 1)2 |w(t)] ≤ (1 − μ(t)α)d(t)2 + μ(t)κ1 (66) where κ1 is a constant. Thus, we can get that E[d(t + 1)2 ] ≤ (1 − μ(t)α)E[d(t)2 ] + μ(t)κ1 .

We can thus prove by induction that κ1 (68) E[d(t)2 ] ≤ . α Expanding the RHS of (63) and ignoring the terms, which consist of μ(t)2 , we can get by Triangle inequality that E d [d(t+1)22 |w(t)] ≤ (1−2μ(t)α)d(t)22 +2μ(t)κ1 d(t)2 . (69) Again, since the RHS of (69) is independent of b(t), we have E[d(t +1)22 ] ≤ (1−2μ(t)α)E[d(t)22 ]+2μ(t)

E[d(t)22 ] ≤ B(t) = (1 − μ(t)α)Im×m − μ(t)˜z(t)˜z (t).

(60)

Since the elements in bd (t) are identical and independent mean zero Gaussian random variables with variance Sb   E d (bd (t) ⊗ d(t)) (bd (t) ⊗ d(t))T |w(t) ⎡ ⎤ d1 (t)2 0 · · · 0 ⎢ .. ⎥. .. . . = Sb ⎣ ... . ⎦ . . 0

κ12 . α

(70)

Similarly, we can thus prove by induction that

where T

(67)

κ12 . α2

(71)

Proof of E[d(t)42 ] < ∞: The idea is similar to the proof ˜ for the norm square. First of all, let e(t) ˜ = yt − z˜ (t)T d(t). 4 Expanding d(t + 1)2 and taking expectation over bd (t) E d [d(t + 1)42 |w(t)] = (1 − μ(t)α)4 d(t)42 + 4μ(t)(1 − μ(t)α)3 d(t)22 ×E d [e(t)˜ ˜ zT (t)d(t)|w(t)]

1 + 2 + 3 + 4 +

(72)

0 · · · dm (t)2

where

Hence E d [˜z(t)˜zT (t) (bd (t) ⊗ d(t)) 22 |w(t)] = Sb ˜z(t)22 diag{˜z(t)}d(t)22 .

(61)

Given w(t), the expectation of the d(t + 1)22 over the random vector bd (t) is given by E d [d(t + 1)22 |w(t)] = B(t)d(t) + μ(t)yt z˜ (t)22 +μ(t)2 Sb ˜z(t)22 diag{˜z(t)}d(t)22 . (62) Expanding the first term in the RHS of (62) and by (60), the coefficient of d(t)22 in the RHS is bounded by (1 − μ(t)α)2 + μ(t)2 Sb m, since z˜ i (t) ≤ 1 for all i = 1, . . . , m. For μ(t) → 0, this factor is dominated by (1 − μ(t)α)2 . As a result, (62) can be rewritten as follows: E d [d(t +1)22 |w(t)] = (1−μ(t)α)d(t)+μ(t)yt z˜ (t)22 .

(63)

By Jensen inequality [20], we can get that

1/2 E d [d(t + 1)2 |w(t)] ≤ E d [d(t + 1)22 |w(t)] . (64) Therefore, from (63), (64), and Triangle inequality, we get that E d [d(t +1)2 |w(t)] ≤ (1−μ(t)α)d(t)2 +μ(t)yt z˜ (t)2 . (65)

1 = 4μ(t)2 (1 − μ(t)α)2 (˜zT (t)d(t))2 E d [e˜2 (t)|w(t)]

2 = 2μ(t)2 (1 − μ(t)α)2 d(t)22 ˜z(t)22 E d [e˜2 (t)|w(t)]

3 = 4μ(t)3 (1 − μ(t)α)˜z(t)22 z˜ T (t)d(t)E d [e˜3(t)|w(t)]

4 = μ(t)4 ˜z(t)42 E d [e˜4 (t)|w(t)]. As Sb , yt , and ˜z(t) are all bounded, it can be shown that

1 ≤ μ(t)2 κ11 + κ12 d(t)22 + κ13 d(t)42

2 ≤ μ(t)2 κ21 + κ22 d(t)22 + κ23 d(t)42

3 ≤ μ(t)3 κ31 + κ32 d(t)22 + κ33 d(t)42

4 ≤ μ(t)4 κ41 + κ42 d(t)22 + κ43 d(t)42 where κi j s are constants. For μ(t) → 0 for all t, the coefficients of d(t)42 are dominated by (1 − μ(t)α)4 . We

4 in (72). Finally, we get by expanding can thus ignore 1 –

e(t) ˜ in (72) that E d [d(t + 1)42 |w(t)] = (1 − μ(t)α)4 d(t)42 + 4μ(t)(1 − μ(t)α)3 d(t)22 yt ×E d [˜zT (t)d(t)|w(t)] − 4μ(t)(1 − μ(t)α)3 d(t)22 ×E d [(˜zT (t)d(t))2 |w(t)]. (73)

1834

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

By the fact that 2yt z˜ T (t)d(t) ≤ yt2 + (˜zT (t)d(t))2 and d(t)22 (˜zT (t)d(t))2 ≥ 0, we get that7 E d [d(t + 1)42 |w(t)] ≤ (1 − μ(t)α)d(t)42 +2μ(t)κ2 d(t)22

(74)

where κ2 = max |yt |2 . Thus, we get that E[d(t +1)42 ] ≤ (1−μ(t)α)E[d(t)42 ]+2μ(t)κ2 E[d(t)22 ]. (75) From (68), the second term in the RHS of (75) is bounded by 2μ(t)κ2 κ12 /α 2 . Therefore, we can prove by induction that 2κ2 κ12 ≤ (76) α3 for all t ≥ 0. Then, the proof is completed. One should note that the boundedness of E[d(t)22 ] and E[d(t)42 ] imply the boundedness of E[dπ1 (t)dπ2 (t)], E[dπ1 (t)dπ2 (t)dπ3 (t)] and E[dπ1 (t)dπ2 (t)dπ3 (t)dπ4 (t)]. From (1), (14), and (15), the update of ai (t) can be expressed as follows:     ai (t) ai (t +1) = (1−μ(t)α) ci (t +1) ci (t)   xt ˜ +μ(t)v˜i (t)di (t)e(t) ˜ (77) 1 E[d(t)42 ]

where v˜i (t) = z˜ i (t)(1 − z˜ i (t)). Note from (20) and (56) that d˜i (t) = di (t) + bi1 (t)di (t) and ˜ = (di (t) + bi1 (t)di (t)) (bd (t) ⊗ d(t)). d˜i (t)d(t)

(78)

Lemma 2: For the algorithm based on (14) and (20) with α > 0, if 0 < μ(t)α < 1, with probability one E[(ai (t), ci (t))22 ] < ∞ for all t and i = 1, 2 . . . , m. Proof: Let ai j (t) be the j th element in ai (t) ai j (t + 1) = (1 − μ(t)α)ai j (t) + μ(t)v˜i (t)x t j e˜(t)d˜i (t). (79) Squaring both sides and taking expectation over bd (t) E d [ai j (t + 1)2 |w(t)] = (1 − μ(t)α)2 ai j (t)2 +2μ(t)(1 − μ(t)α)ai j (t)x t j ×E d [v˜i (t)e(t) ˜ d˜i (t)|w(t)]+μ(t)2x t2j E d  

2 v˜i (t)e(t) ˜ d˜i (t) |w(t) . (80) By the fact that v˜i (t) < 1/4, we can get that E d [ai j (t + 1)2 |w(t)] ≤ (1 − μ(t)α)2 ai j (t)2 +2μ(t)(1 − μ(t)α)κ3 |ai j (t)| ˜ d˜i (t)||w(t)] ×E d [|e(t)  

2 2 2 ˜ ˜ di (t) |w(t) +μ(t) κ3 E d e(t) (81) where κ3 = max{|x t j ||t ≥ 0, j = 1, . . . , n}. Now, we let λ1 (d(t)) = E d [|e(t) ˜ d˜ (t)||w(t)]   i

2 ˜ λ2 (d(t)) = E d e˜(t)di (t) |w(t) . 7 Note that (1 − μ(t)α)4 < (1 − μ(t)α).

Clearly, λ1 (d(t)) is a function of d1 (t), . . . , dm (t) up to second order and λ2 (d(t)) is a function of d1 (t), . . . , dm (t) up to fourth order. Therefore, we can take expectation of (81) for b(t) and get that E[ai j (t + 1)2 |w(t)] ≤ (1 − μ(t)α)2 ai j (t)2 +2μ(t)(1 − μ(t)α)κ3 |ai j (t)|λ1 (d(t)) +μ(t)2 κ32 λ2 (d(t)).

(82)

Then by taking expectation of (82) for all b(τ ) for 0 ≤ τ ≤ t, we get that E[ai j (t + 1)2 ] ≤ (1 − μ(t)α)2 E[ai j (t)2 ] +2μ(t)(1 − μ(t)α)κ3 E[|ai j (t)|λ1 (d(t))] +μ(t)2 κ32 E[λ2 (d(t))]. By Cauchy–Schwarz inequality [20]

1/2 . E[|ai j (t)|λ1 (d(t))] ≤ E[ai j (t)2 ]E[λ21 (d(t))]

(83)

(84)

As λ21 (d(t)) and λ2 (d(t)) are functions of d1 (t), . . . , dm (t) up to fourth order, E[λ21 (d(t))] and E[λ2 (d(t))] must be bounded for all t ≥ 0. We let κ4 be the upper bound of these two values. Thus, taking square root on both sides of (83), we get that   √ E[ai j (t + 1)2 ] ≤ (1 − μ(t)α) E[ai j (t)2 ] + μ(t)κ3 κ4 . (85) Clearly, we can prove by induction that κ32 κ4 . (86) α2 Following the same steps as for ai j (t) and replacing κ3 by one, we can show that κ4 (87) E[ci (t)2 ] ≤ 2 . α E[ai j (t)2 ] ≤

So, we can conclude that E[(ai (t), ci (t))22 ] < ∞ for all t and i = 1, 2 . . . , m. The proof is completed. As a direct implication from Lemmas 1 and 2, we state without proof the following theorem for the weight vector w(t). Theorem 1 (MWNI-WD for MLP With Linear Output): For the algorithm based on (14) and (20) with α > 0, if μ(t) → 0, with probability one E[w(t)22 ] < ∞ for all t ≥ 0. B. MWNI-WD Algorithms for MLP With Sigmoid Output Node In this section, we analyze the case of MLP with sigmoid output. Theorem 2 (MWNI-WD for MLP With Sigmoid Output): For the algorithm based on (43) and (20) with α > 0, if 0 < μ(t)α < 1, with probability one E[w(t)22 ] < ∞ for all t ≥ 0. Proof: Similar to the case of MLP with linear output, we first consider the change of the output weights ˜ z(t) − α d(t)}. d(t + 1) = d(t) + μ(t) {(yt − φ(xt , w(t)))˜ (88)

SUM et al.: ON-LINE WEIGHT NOISE INJECTION-BASED TRAINING ALGORITHMS FOR MLPs

Since φ(xt , w(t)), ˜ z˜ i (t), and yt are all bounded by zero and one,8 it is clear that d(t + 1)2 ≤ (1 − μ(t)α)d(t)2 + μ(t).

(89)

Again, we can prove by induction that d(t)2 ≤ α −1 .

(90)

As the boundedness of d(t)2 is independent of the random vector b, we can infer that E[d(t)2 ] and E[d(t)22 ] are bounded for all t ≥ 0. For the input weight ai j (t), the update is given by ai j (t + 1) = (1 − μ(t)α)ai j (t) + μ(t)γ (t)d˜i (t)

(91)

˜ ˜ ˜ where γ (t) = φ(t)(1 − φ(t)) v˜i (t)x t j (yt − φ(t)). Here, we denote φ(xt , w(t)) ˜ as φ(t) to save space. From (91), we have E[ai2j (t + 1)] ≤ ((1 − μ(t)α)ai j (t) + μ(t)γ (t)di (t))2 +μ(t)2 Sb γ 2 (t)di2 (t).

(92)

As we can see that φ(t), v˜i (t), x t j , and yt are all bounded, γ (t) is bounded for all t ≥ 0 E[ai2j (t + 1)] ≤ (1 − μ(t)α)2 E[ai2j (t)] +2μ(t)(1 − μ(t)α)κ5 E[|ai j (t)di (t)|] +μ(t)2 κ6 E[di2 (t)] ≤ (1 − μ(t)α)

2

1835

V. C ONVERGENCE A NALYSIS Now, we can proceed to the convergence analysis. The proof is conducted by the following steps. First, we show that limt →∞ w(t)2 exists. It implies that for sufficient large t, the elements in w(t) must be bounded. As the objective functions derived earlier in this paper are infinitely differentiable, the elements in the gradient vector ∇w V (w(t)) and the matrix ∇∇w V (w(t)) must be bounded elements when t is large enough. In this regard, we show in the second step that under mild condition on μ(t), limt →∞ ∇w V (w(t)) = 0. As the steps of proofs for all four cases, including MWNIWD for MLPs with linear output nodes, MWNI-WD for MLPs with sigmoid output nodes, AWNI-WD for MLPs with linear output nodes, and AWNI-WD for MLPs with sigmoid output nodes, are almost identical, we only show the steps of proof for the first case. Then the convergence theorems for other cases are stated without proof. Lemma 3 (MWNI-WD for MLP With Linear Output): For the algorithm, (14) and (20) with α > 0, if μ(t) → 0, then with probability one limt →∞ w(t)2 exists. Proof: From (66), define a random variable β(t) as follows:9 β(t) = d(t)2 +κ1



(94)

The last inequality is based on (90). Then   E[ai2j (t + 1)] ≤ (1 − μ(t)α) E[ai2j (t)] + μ(t)κ9 . E[ai2j (t)]



(1 − μ(τ2 )α).

It is clear that β(t) ≥ 0. Now, we show that β(t) is a supermartingale. The expectation of β(t + 1) is given by

(95)

(κ92 /α 2 )



+κ1

∞ 

(1 − μ(τ )α)

τ =t +1 ∞ 

μ(τ1 )

τ1 =t +1

(1 − μ(τ2 )α).

τ2 =τ1 +1

(97) By (66), we can thus get that E[β(t + 1)|β(t)] ≤ (1 − μ(t)α)d(t)2 ∞ 

C. AWNI-WD Algorithms Similarly, both the cases, linear and sigmoid output nodes will be analyzed. As the proofs can be accomplished by the same steps and replacing w(t) ⊗ b(t) by b(t), we simply state the theorems without proof. Theorem 3 (AWNI-WD for MLP With Linear Output): For the algorithm based on (14) and (35) with α > 0, if 0 < μ(t)α < 1, then with probability one E[w(t)22 ] < ∞ for all t ≥ 0. Theorem 4 (AWNI-WD for MLP With Sigmoid Output): For the algorithm based on (43) and (35) with α > 0, if 0 < μ(t)α < 1, then with probability one E[w(t)22 ] < ∞ for all t ≥ 0.

(96)

τ2 =τ1 +1

E[β(t + 1)|β(t)] = E [d(t + 1)2 |β(t)]

By induction, we can show that ≤ for all t ≥ 0. Repeat the same steps as for E[ai2j (t)], we can show that E[ci2 (t)] < ∞ for all t ≥ 0. Finally, we can conclude that with probability one E[w(t)22 ] < ∞ for all t ≥ 0 and the proof is completed.

8 For classification problem, y ∈ {0, 1} only. t

∞ 

μ(τ1 )

τ1 =t

+μ(t)(1 − μ(t)α)κ7 E[ai2j (t)] +μ(t)2 κ8 .

(1 − μ(τ )α)

τ =t

(93)

E[ai2j (t)]

∞ 

+μ(t)κ1 +κ1

τ1 =t +1 ∞ 

= d(t)2

(1 − μ(τ )α)

μ(τ1 )



τ1 =t

∞ 

(1 − μ(τ2 )α)

τ2 =τ1 +1

(1 − μ(τ )α)

τ =t

+κ1

(1 − μ(τ )α)

τ =t +1

τ =t +1



∞ 

μ(τ1 )

∞ 

(1 − μ(τ2 )α).

τ2 =τ1 +1

(98) 9 Note that the definition of β(t) is inspired from the proof of Gladyshev’s Theorem ([21, p. 276]).

1836

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

Clearly, the RHS of (98) is equal to β(t). Thus, E[β(t + 1)|β(t)] ≤ β(t) and E[β(t + 1)] ≤ E[β(t)] ≤ · · · ≤ E[β(0)].

(99)

exp(ax)

Next, we are going to show that E[β(0)] is finite. As d(0)2 is finite, what we will show is that the second term in the RHS of (98) is finite. Let ⎧ ⎫ T ⎨ ⎬ ξT = exp −α μ(τ2 ) . ⎩ ⎭ τ2 =0

By the inequality that ln(1 − μ(t)α) ≤ −μ(t)α, T −1 T  μ(τ1 ) (1 − μ(τ2 )α) τ1 =0

x

τ2 =τ1 +1

⎧ ⎨

T −1



exp(a(u(0)+...+u(t1−1)))

⎫ ⎬

T

μ(τ1 ) exp −α μ(τ2 ) ⎩ ⎭ τ2 =τ1 +1 ⎧ ⎫ τ1 T −1 ⎨ ⎬ μ(τ1 ) exp α μ(τ2 ) = ξT ⎩ ⎭

Fig. 1.

τ1 =0

τ1 =0

= ξT μ(0) exp{αμ(0)} +ξT

τ1 =1

⎧ ⎫ 1 −1 ⎨ τ ⎬ μ(τ1 ) exp{αμ(τ1 )} exp α μ(τ2 ) ⎩ ⎭

ξT exp{α max{μ(t)}} t

T −1 τ1 =1

⎧ ⎫ 1 −1 ⎨ τ ⎬ μ(τ1 ) exp α μ(τ2 ) . ⎩ ⎭ τ2 =0

(100) From Fig. 1, we can see that ⎧ ⎫ T −1 1 −1 ⎨ τ ⎬  T¯ −μ(T ) μ(τ1 ) exp α μ(τ2 ) ≤ exp (αx) d x ⎩ ⎭ μ(0) τ1 =1

τ2 =0





< where T¯ = T −1

τ =0 μ(τ ). T 

μ(τ1 )

τ1 =0

exp (αx) d x

From (100) and (101)

(1 − μ(τ2 )α) < ξT μ(0) exp{αμ(0)}

τ2 =τ1 +1

 exp{α maxt {μ(t)}}  1 − exp(−α T¯ )) . α Since lim T →∞ ξT = 0 +

lim

T →∞

T −1 τ1 =0

(101)

0

T

μ(τ1 )

T 

(1 − μ(τ2 )α) ≤

τ2 =τ1 +1

(102)

exp{α maxt {μ(t)}} . α

Finally, from (96) and the fact that d(0)2 is finite, we get exp{α maxt {μ(t)}} α

(104)

According to the Martingale Convergence Theorem ([22, p. 109]), limt →∞ β(t) exists with probability one. From (103), it is clear that the value of the second term in (96) is bounded and positive. On the other hand, the factor &∞ τ =t (1−μ(τ )α) associated with d(t)2 in (96) is increasing with respect to t. Its value is positive and bounded by one. Therefore, we can conclude that lim t →∞ d(t)2 exists with probability one. As limt →∞ d(t)2 exists, there exists a bounded region  and t ∗ such that d(t) ∈  for all t ≥ t ∗ . From (77), we get that * ' ( ( ( ) (2 ( ai (t + 1) (2 )) ( ( 2 ( ai (t) ( ( ( Ed ( ) w(t) = (1 − μ(t)α) ( ci (t + 1) (2 ) ci (t) (2  (  ( ( xt ( 2 2 2 2 2 ( ( ˜ +μ (t)E d v˜i (t)di (t)e(t) ˜ |w(t) ( 1 (2 )   ) +2(1 − μ(t)α)μ(t)E d v˜i (t)d˜i e(t) ˜ ) w(t)

(105) × aiT (t)xt + ci (t) . Thus, we can follow similar arguments as in the proof of Lemma 2 and get that

(103)

E[β(0)] ≤ κ1

Schematic diagram showing (101).

E[β(t + 1)] ≤ E[β(t)] ≤ · · · ≤ E[β(0)] < ∞.

τ2 =0

≤ ξT μ(0) exp{αμ(0)} +

u(0)+...+u(T)

and thus

τ2 =0

T −1

u(t1)

u(1) u(0)

(  ( ( ai (t + 1) ( ( Ed ( ( ci (t + 1) (

2

) ( (  ( ) ( ) w(t) ≤ (1 − μ(t)α) ( ai (t) ( ) ( ci (t) ( +μ(t)κ10

2

(106)

for all t ≥ t ∗ . The constant κ10 in (106) is depended on the bounded of d(t)2 and Sb . Now, we can define β  (t) in the

SUM et al.: ON-LINE WEIGHT NOISE INJECTION-BASED TRAINING ALGORITHMS FOR MLPs

same way as β(t). For t ≥ t ∗ ( ( ∞ ( ai (t) (   ( ( β (t) = ( (1 − μ(τ )α) ci (t) (2 +κ10



τ =t

μ(τ1 )

τ1 =t

∞ 

(1 − μ(τ2 )α).

(107)

τ2 =τ1 +1

Following the same steps as for d(t) ( 2 , we ( can then show ( ai (t ) ( that with probability one limt →∞ ( ci (t ) ( exists for all 2 i = 1, . . . , m. In conclusion, limt →∞ w(t)2 exists with probability one. The proof is completed. By Lemma 3, we can have the following Lemmas. Lemma 4: For all t ≥ t ∗ , there exists a bounded region  and t ∗ such that w(t) ∈ ; V⊗ (w(t)) < ∞; and the eigenvalues of the Hessian matrix ∇∇w V⊗ (w(t)) are all finite. Now, we can state in the following theorem the convergence of algorithm based on (14) and (20). Theorem 5 (MWNI-WD for MLP With Linear Output Nodes): For the algorithm based on (14) and (20) with α > 0, if  2 μ(t) = ∞, and μ(t) → 0, t t μ(t) < ∞, then with probability one limt →∞ ∇w V⊗ (w(t)) = 0, where V⊗ (w(t)) is given by (28). Proof: Now, we can expand V⊗ (w(t + 1)) around w(t) and get that V⊗ (w(t + 1)) = V⊗ (w(t)) + ∇w V⊗ (w(t))δw(t) 1 + δw(t)T ∇∇w V⊗ (w(t))δw(t) (108) 2 where δw(t) = w(t+1)−w(t). By Lemma 4, we can let κ11 be the maximum eigenvalue of ∇∇w V⊗ (w(t)) for t ≥ t ∗ . Then we get V⊗ (w(t + 1)) ≤ V⊗ (w(t)) + ∇w V⊗ (w(t))δw(t) κ11 δw(t)T δw(t). (109) + 2 E[V⊗ (w(t + 1))|w(t)] ≤ V⊗ (w(t)) − μ(t) ∇w V⊗ (w(t))22 κ11 μ(t)2 E[h(t)22 ] + (110) 2 where h(t) = (h1 (t)T , . . . , hm (t)T )T and ˜ ˜ i (t)) − αwi (t). hi (t) = (yt − f (xt , w(t)))g i (xt , w The boundedness of E[h(t)22 ] can then be proved by Lemma 1 and Theorem 1. We let this bound be κ12 . As a result, we can get that lim E[V⊗ (w(t))|w(t ∗ )] ≤ V⊗ (w(t ∗ )) − μ(t)E[∇w V⊗ (w(t))22 |w(t ∗ )]

t →∞

t ≥t ∗

+

κ11 κ12 μ(t)2 . 2 ∗

(111)

t ≥t

It is clear from Lemma 4 that lim t →∞ E[V⊗ (w(t))|w(t ∗ )] and V⊗ (w(t ∗ )) are allfinite. Further from the condition that  2 2 ∗ t ≥t ∗ μ(t) < ∞, t ≥t ∗ μ(t)E[∇w V⊗ (w(t))2 |w(t )] is finite.

1837

 By the condition that t ≥t ∗ μ(t) = ∞, we can prove by contradiction that limt →∞ E[∇w V⊗ (w(t))22 |w(t ∗ )] = 0. In other words, limt →∞ ∇w V⊗ (w(t)) = 0. The proof is completed. As the steps of proof for other cases are almost the same, we state without proof the following theorems. Theorem 6 (MWNI-WD for MLP With Sigmoid Output): For the algorithm  (20) with α > 0, if  based on (43) and 0 < μ(t)α < 1, t μ(t) = ∞, and t μ(t)2 < ∞, then with probability one limt →∞ ∇w V¯⊗ (w(t)) = 0, where V¯⊗ (w(t)) is given by (53). Theorem 7 (AWNI-WD for MLP With Linear Output): For the algorithmbased on (14) and  (35) with α > 0, if 0 < μ(t)α < 1, t μ(t) = ∞, and t μ(t)2 < ∞, then with probability one limt →∞ ∇w V¯⊕ (w(t)) = 0, where V⊕ (w(t)) is given by (42). Theorem 8 (AWNI-WD for MLP With Sigmoid Output): For the algorithmbased on (43) and  (35) with α > 0, if 0 < μ(t)α < 1, t μ(t) = ∞, and t μ(t)2 < ∞, then with probability one limt →∞ ∇w V¯⊕ (w(t)) = 0, where V¯⊕ (w(t)) is given by (55). VI. I LLUSTRATIVE E XAMPLES To examine the convergence properties of the algorithms, we illustrate in this section two examples: 1) nonlinear autoregressive (NAR) and 2) XOR problems. A. NAR Problem We consider the following NAR time series [23], given by y(k) = (0.8 − 0.5 exp(−y 2 (k − 1)))y(k − 1) −(0.3 + 0.9 exp(−y 2 (k − 1)))y(k − 2) +0.1 sin(π y(k − 1)) + noise(k) (112) where noise(k) is a mean zero Gaussian random variable with variance equal to 0.09. One thousand samples were generated under the initial condition that y(0) = y(1) = 0.1. The first 500 data points were used for training and the other 500 points were used for testing. The neural network is used to predict y(k) based on the past observation y(k − 1) and y(k − 2). Four structures of the MLP are examined. All of them have two input nodes, ten hidden nodes, and one linear output node. Two are obtained based on MWNI-WD algorithm, i.e., (14) and (20), with parameters (α, Sb ) equal to (0, 10−2 ) and (10−4 , 10−2 ), respectively. The other two are obtained based on AWNI-WD algorithm, i.e., (14) and (35), with parameters (0, 10−2 ) and (10−4 , 10−2 ), respectively. The step size μ(t) is defined as follows:  0.001, for 1 ≤ t ≤ 20 × TCN (113) μ(t) = 0.001 , for t > 20 × TCN N t /TC −20

where TCN = 5 00 000 and t = 1, 2, . . . ,  5×108 . In accordance with this definition, we can show that ∞ t =1 μ(t) = ∞ and  ∞ 2 t =1 μ(t) ≤ ∞. The training MSE is defined as follows: 1 (yk − f (xk , w(t)))2 500 500

Train MSE =

k=1

1838

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

Train MSE 4

0.14

3

0

0.11

−1

(b)

0.09 0 0.14 0.135 0.13 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0

400

600

800

1000

−4

0 3

200

400

600

800

1000

−1 −2 800

1000

−3 0

200

400

600

800

1000

Fig. 2. Change of the MSE and input weights for the NAR problem with MWNI. (a) (α, Sb ) is (0, 10−2 ) and (b) (α, Sb ) is (10−4 , 10−2 ). Each line on the right panels corresponds to the value of an input weight. The data is captured after every TCN steps. The horizontal axis corresponds to the number of TCN steps.

Train MSE

(a)

(b)

0.135 0.13 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0 0.14 0.135 0.13 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0

(a)

0

600

Input Weights 2 1 0 −1 −2

600

800

1000

−3

0

200

400

600

800

1000

0

200

400

600

800

1000

3 2 1 0 −1 −2 200

400

600

800

1000

−3

4

(b)

0 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0

2 200

400

600

800

1000

0 0

200

400

600

800

1000

200

400

600

800

1000

6 4 2 0 −2 −4 200

400

600

800

1000

−6

0

Fig. 4. Change of the CE and input weights for the XOR problem with MWNI. (a) (α, Sb ) is (0, 10−2 ) and (b) (α, Sb ) is (10−3 , 10−2 ). Each line on the right panels corresponds to the value of an input weight. The data is captured after every TCX steps. The horizontal axis corresponds to the number of TCX steps.

For the XOR problem, we generated 100 pairs of (x 1 (k), x 2 (k)) and the corresponding y(k) is generated based on the following equation. For k = 1, 2, . . . , 100  1, if x 1 (k)x 2 (k) ≥ 0 (114) y(k) = 0, otherwise.

3

400

6

B. XOR Problem

4

200

8

0.35

1

400

10

0.4

2

200

12

0.45

−3 200

14

0.5

−2

0.1

16

0.6 0.55

1

0.12

Input Weights 18

0.65

2

0.13

(a)

Train CE

Input Weights

0.15

Fig. 3. Change of the MSE and input weights for the NAR problem with additive weight noise injection. (a) (α, Sb ) is (0, 10−2 ) and (b) (α, Sb ) is (10−4 , 10−2 ). Each line on the right panels corresponds to the value of an input weight. The data is captured after every TCN steps. The horizontal axis corresponds to the number of TCN steps.

where w(t) is the weight vector obtained after the tth training step and (xk , yk ) is the kth training data. The changes of the MSE and the input weights10 are shown in Figs. 2 and 3. Regarding the convergence behavior, it is clear from the Figs. 2 and 3 that both the MSE value and the weights converge in all cases of (α, Sb ). For the case of MWNI with α > 0, the result as showed in Fig. 2(b) confirms our theoretical finding. For the case of additive weight noise, weights converge even when α = 0, as showed in Fig. 3(a). For α > 0, the result as shown in Fig. 3(b) confirms our theoretical finding. 10 Because the output weights and bias terms show similar behavior, their results are not shown here.

Four MLP structures are examined. All of them have two input nodes, six hidden nodes, and one sigmoid output node. Two are obtained based on MWNI-WD algorithm, i.e., (43) and (20), with parameters (α, Sb ) equal to (0, 10−2 ) and (10−3 , 10−2 ), respectively. The other two are obtained based on AWN-WD algorithm, i.e., (43) and (35), with parameters (0, 10−2 ) and (10−3 , 10−2 ), respectively. The step size μ(t) is defined as follows:  0.01, for 1 ≤ t ≤ 20 × TCX (115) μ(t) = 0.01 , for t > 20 × TCX t /T X −20 C

where TCX = 50 000 and t = 1, 2, . . . , 5 × 107 . Figs. 4 and 5 show the convergence behaviors of the training CE and the input weights against the training cycles. Here, the training CE is defined as follows: −1 {yk ln φk (t) + (1 − yk ) ln(1 − φk (t))} 100 k=1 (116) where φk (t) = φ(xk , w(t)). It is clear from the figures that the training CE converges in all cases. However, the convergence behaviors of the input weights are not the same. For multiplicative noise case, some input weights do not converge when α = 0, as shown in Fig. 4(a). The weights converge when α > 0, as showed in Fig. 4(b). This confirms our theoretical finding. For the case of additive weight noise, weights converge even when α = 0, as showed in Fig. 5(a). For α > 0, as shown in Fig. 5(b), the weights converge, which confirms our theoretical finding. 100

Train CE =

SUM et al.: ON-LINE WEIGHT NOISE INJECTION-BASED TRAINING ALGORITHMS FOR MLPs

Train CE

0.7

Input Weights 10

0.5

5

0.4

0

0.3

(a)

0.2

−5

0.1

−10

0 0 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2

(b)

0

R EFERENCES

15

0.6

200

400

600

800

1000

200

400

600

800

1000

−15 0 5 4 3 2 1 0 −1 −2 −3 −4 −5 0

200

400

600

800

1000

200

400

600

800

1000

Fig. 5. Change of the CE and input weights for the XOR problem with additive weight noise injection. (a) (α, Sb ) is (0, 10−2 ) and (b) (α, Sb ) is (10−3 , 10−2 ). Each line on the right panels corresponds to the value of an input weight. The data is captured after every TCX steps. The horizontal axis corresponds to the number of TCX steps. TABLE I S UMMARY OF THE K EY R ESULTS O.N., Noise

LN, MWN

E[w22 ] O.F.(*) Local min.

Thm 1 V⊗ (w) Thm 5 μ(t) → 0 

μ(t) Example

SN, MWN

LN, AWN

1839

SN, AWN

Thm 2 V¯⊗ (w) Thm 6

Thm 3 Thm 4 V⊕ (w) V¯⊕ (w) Thm 7 Thm 8 0 < μ(t)α < 1  2 t μ(t) = ∞ and t μ(t) < ∞ Fig. 2(b) Fig. 4(b) Fig. 3(b) Fig. 5(b)

O.F.: Objective function, O.N.: Output node, LN: Linear node, SN: Sigmoid node, MWN: Multiplicative weight noise, AWN: Additive weight noise. (*) See (28), (42), (53), (55).

VII. C ONCLUSION In this paper, we followed our previous work in [15] to present the convergence analyses of two on-line algorithms, which combine the idea of weight noise injection and weight decay. For each of these algorithms, we showed that with probability one the weight vector converged to a local minimum of the corresponding objective function. Table I summarized the key results presented in this paper. While most of the previous works analyze the effect of weight noise on the performance of a neural network [1], [2], [4], [5], [7], [11], our works presented in this paper and other recent papers [14], [15] are the first attempts to study the effect of weight noise on learning algorithms. In recent years, there has been increasing interest in the study of the effect of noise in human brains and behaviors [24]–[29]. Extending our work to study the effect of noise (either brain noise or mental noise) on lifespan learning should be a valuable future direction. ACKNOWLEDGMENT The authors would like to express their gratitude to the reviewers who gave valuable comments on the earlier version of this paper. Their comments were crucial and indispensable.

[1] A. F. Murray and P. J. Edwards, “Synaptic weight noise during multilayer perceptron training: Fault tolerance and training improvements,” IEEE Trans. Neural Netw., vol. 4, no. 4, pp. 722–725, Jul. 1993. [2] A. F. Murray and P. J. Edwards, “Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training,” IEEE Trans. Neural Netw., vol. 5, no. 5, pp. 792–802, Sep. 1994. [3] P. J. Edwards and A. F. Murray, “Fault tolerant via weight noise in analog VLSI implementations of MLP’s: A case study with EPSILON,” IEEE Trans. Circuits Syst. II, Analog Digital Signal Process., vol. 45, no. 9, pp. 1255–1262, Sep. 1998. [4] K. C. Jim, C. L. Giles, and B. G. Horne, “An analysis of noise in recurrent neural networks: Convergence and generalization,” IEEE Trans. Neural Netw., vol. 7, no. 6, pp. 1424–1438, Nov. 1996. [5] G. An, “The effects of adding noise during backpropagation training on a generalization performance,” Neural Comput., vol. 8, no. 3, pp. 643–674, Apr. 1996. [6] P. J. Edwards and A. F. Murray, “Can deterministic penalty terms model the effects of synaptic weight noise on network fault-tolerance?” Int. J. Neural Syst., vol. 6, no. 4, pp. 401–16, Dec. 1995. [7] G. Basalyga and E. Salinas, “When response variability increases neural network robustness to synaptic noise,” Neural Comput., vol. 18, no. 6, pp. 1349–1379, Jun. 2006. [8] J. L. Bernier, J. Ortega, I. Rojas, E. Ros, and A. Prieto, “Obtaining fault tolerance multilayer perceptrons using an explicit regularization,” J. Neural Process. Lett., vol. 12, no. 2, pp. 107–113, Oct. 2000. [9] J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto, “A quantitative study of fault tolerance, noise immunity and generalization ability of MLPs,” Neural Comput., vol. 12, no. 12, pp. 2941–2964, Dec. 2000. [10] J. L. Bernier, J. Ortega, I. Rojas, and A. Prieto, “Improving the tolerance of multilayer perceptrons by minimizing the statistical sensitivity to weight deviations,” Neurocomputing, vol. 31, nos. 1–4, pp. 87–103, Mar. 2000. [11] J. L. Bernier, A. F. Díaz, F. J. Fernández, A. Cañas, J. González, P. Martín-Smith, and J. Ortega, “Assessing the noise immunity and generalization of radial basis function networks,” Neural Process. Lett., vol. 18, no. 1, pp. 35–48, Aug. 2003. [12] J. Sum, C. S. Leung, and K. Ho, “On objective function, regularizer and prediction error of a learning algorithm for dealing with multiplicative weight noise,” IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 124–138, Jan. 2009. [13] K. Ho, C. S. Leung, and J. Sum, “On weight-noise-injection training,” in Advances in Neuro-Information Processing, N. Kasabov, M. Koeppen, and G. Coghill, Eds. New York: Springer-Verlag, 2009, pp. 919–926. [14] K. Ho, C. S. Leung, and J. Sum, “Convergence and objective functions of some fault/noise injection-based on-line learning algorithms for RBF networks,” IEEE Trans. Neural Netw., vol. 21, no. 6, pp. 938–947, Jun. 2010. [15] K. Ho, C. S. Leung, and J. Sum, “Objective functions of on-line weight noise injection training algorithms for MLP,” IEEE Trans. Neural Netw., vol. 22, no. 2, pp. 317–323, Feb. 2011. [16] J. Sum and K. Ho, “SNIWD: Simultaneous weight noise injection with weight decay for MLP training,” in Neural Information Processing, C. S. Leung, M. Lee, and J. H. Chan, Eds. New York: Springer-Verlag, 2010, pp. 494–501. [17] C. S. Leung, G. H. Young, J. Sum, and W. K. Kan, “On the regularization of forgetting recursive least square,” IEEE Trans. Neural Netw., vol. 10, no. 6, pp. 1842–1846, Nov. 1999. [18] H. White, “Some asymptotic results for learning in single hidden-layer feedforward network models,” J. Amer. Stat. Assoc., vol. 84, no. 408, pp. 1003–1013, Dec. 1989. [19] H. Zhang, W. Wu, F. Liu, and M. Yao, “Boundedness and convergence of online gradient method with penalty for feedforward neural networks,” IEEE Trans. Neural Netw., vol. 20, no. 6, pp. 1050–1054, Jun. 2009. [20] S. M. Ross, Stochastic Processes. New York: Wiley, 1996. [21] E. Gladyshev, “On stochastic approximation,” Theory Probabil. Appl., vol. 10, no. 2, pp. 275–278, Apr. 1965. [22] D. Williams, Probability with Martingales. Cambridge, MA: Cambridge Univ. Press, 1991. [23] S. Chen, “Local regularization assisted orthogonal least squares regression,” Neurocomputing, vol. 69, nos. 4–6, pp. 559–585, Jan. 2006. [24] B. S. Chen and C. W. Li, “On the noise-enhancing ability of stochastic Hodgkin–Huxley neuron systems,” Neural Comput., vol. 22, no. 7, pp. 1737–1763, Jul. 2010. [25] A. A. Faisal, L. P. Selen, and D. M. Wolpert, “Noise in the nervous system,” Nature Rev. Neurosci., vol. 9, no. 4, pp. 292–303, Apr. 2008.

1840

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012

[26] H. C. Flehmig, M. Steinborn, R. Langner, and K. Westhoff, “Neuroticism and the mental noise hypothesis: Relationships to lapses of attention and slips of action in everyday life,” Psychol. Sci., vol. 49, no. 4, pp. 343– 360, Jul. 2007. [27] S. C. Li, T. Oertzenb, and U. Lindenberger, “A neurocomputational model of stochastic resonance and aging,” Neurocomputing, vol. 69, nos. 13–15, pp. 1553–1560, Aug. 2006. [28] S. W. MacDonald, L. Nyberg, and L. Backman, “Intra-individual variability in behavior: Links to brain structure, neurotransmission and neuronal activity,” Trends Neurosci., vol. 29, no. 8, pp. 474–480, Jul. 2006. [29] M. D. Robinson and M. Tamir, “Neuroticism as mental noise: A relation between neuroticism and reaction time standard deviations,” J. Personal. Soc. Psychol., vol. 89, no. 1, pp. 107–114, Jul. 2005.

John Sum (SM’05) received the B.Eng. degree in electronic engineering from the Hong Kong Polytechnic University, Hong Kong, in 1992, and the M.Phil. and Ph.D. degrees in computer science engineering from the Chinese University of Hong Kong, Hong Kong, in 1995 and 1998, respectively. He spent six years teaching in several universities in Hong Kong, including the Hong Kong Baptist University, the Open University of Hong Kong, and the Hong Kong Polytechnic University. He moved to Taiwan in 2005 and began teaching at Chung Shan Medical University, Taichung, Taiwan and then at the National Chung Hsing University, Taichung, where he is currently an Associate Professor in the Institute of Technology Management. His current research interests include neural computation, service science and engineering, mobile sensor networks, and scale-free networks. Dr. Sum is a GB Member of APPNA and a Program Committee Member for various international conferences, including ICONIP, INNS, and WIC. Moreover, he was an Associate Editor of the International Journal of Computers and Applications and a Guest Editor of Neural Computing and Applications from 2005 to 2009.

Chi-Sing Leung (M’05) received the B.Sci. degree in electronics, the M.Phil. degree in information engineering, and the Ph.D. degree in computer science from the Chinese University of Hong Kong, Hong Kong, in 1989, 1991, and 1995, respectively. He is currently an Associate Professor with the Department of Electronic Engineering, City University of Hong Kong. His current research interests include neural computing, data mining, and computer graphics. Dr. Leung received the IEEE T RANSACTIONS ON M ULTIMEDIA Prize Paper Award in 2005 for his paper “The Plenoptic Illumination Function” published in 2002. He was a member of the Organizing Committee of ICONIP’06. He was a Guest Editor of Neurocomputing and Neural Computing and Applications. He is the Program Chair of ICONIP’09. He is also the Vice-President of the Asian Pacific Neural Network Assembly.

Kevin Ho received the B.S. degree in computer engineering from the National Chiao-Tung University, Hsinchu, Taiwan, in 1983, and the M.S. and Ph.D. degrees in computer science from the University of Texas at Dallas, Dallas, in 1990 and 1992, respectively. He was an Assistant Engineer with the Institute of Information Industry, Taipei, Taiwan, from 1985 to 1987. He is currently a Professor with the Department of Computer Science and Communication Engineering, Providence University, Taichung, Taiwan. His current research interests include neural computation, algorithm design and analysis, and security and scheduling theory.

Convergence analyses on on-line weight noise injection-based training algorithms for MLPs.

Injecting weight noise during training is a simple technique that has been proposed for almost two decades. However, little is known about its converg...
619KB Sizes 1 Downloads 3 Views