334

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

The Kernel Adaptive AutoregressiveMoving-Average Algorithm Kan Li, Student Member, IEEE, and José C. Príncipe, Fellow, IEEE

Abstract— In this paper, we present a novel kernel adaptive recurrent filtering algorithm based on the autoregressivemoving-average (ARMA) model, which is trained with recurrent stochastic gradient descent in the reproducing kernel Hilbert spaces. This kernelized recurrent system, the kernel adaptive ARMA (KAARMA) algorithm, brings together the theories of adaptive signal processing and recurrent neural networks (RNNs), extending the current theory of kernel adaptive filtering (KAF) using the representer theorem to include feedback. Compared with classical feedforward KAF methods, the KAARMA algorithm provides general nonlinear solutions for complex dynamical systems in a state-space representation, with a deferred teacher signal, by propagating forward the hidden states. We demonstrate its capabilities to provide exact solutions with compact structures by solving a set of benchmark nondeterministic polynomial-complete problems involving grammatical inference. Simulation results show that the KAARMA algorithm outperforms equivalent input-space recurrent architectures using first- and second-order RNNs, demonstrating its potential as an effective learning solution for the identification and synthesis of deterministic finite automata. Index Terms— Deterministic finite automaton (DFA), kernel adaptive filtering (KAF), recurrent neural network (RNN), reproducing kernel Hilbert space (RKHS).

I. I NTRODUCTION

K

ERNEL methods [1] create a powerful unifying framework for classification, clustering, and regression of both numeric and symbolic data, with countless applications in machine learning, signal processing, and biomedical engineering. The theory of adaptive signal processing is greatly enhanced through the integration of the theory of reproducing kernel Hilbert space (RKHS). Performing classical linear methods in an enriched feature space, kernel adaptive filtering (KAF) [2] moves beyond the limitations of the linear model to provide general nonlinear solutions in the original input space that is not restricted to numeric data (e.g., spike trains [3]). KAF brings together adaptive signal processing and feedforward artificial neural networks, by combining the best of both worlds: 1) the universal approximation property of neural networks and 2) the simple convex optimization of linear adaptive filters. Manuscript received August 15, 2014; revised March 27, 2015; accepted March 28, 2015. Date of publication April 28, 2015; date of current version January 18, 2016. This work was supported by the Defense Advanced Research Projects Agency under Contract N66001-10-C-2008. The authors are with the Computational NeuroEngineering Laboratory, University of Florida, Gainesville, FL 32611 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2015.2418323

Since its introduction, KAF has rapidly gained traction, because of its simplicity and usefulness. Despite its youth, KAF is already well-established in [4]–[6] for solving online nonlinear system identification. However, there still exist two principal bottlenecks for current KAF implementations. First, the filter’s ability to cope with general nonlinear dynamical system (DS) characteristics is inadequate [7]. Most research has focused on time-delay feedforward implementations of kernel method. Feedforward systems have limited generalization capability and are best suited for learning a priori defined and fixed memory mappings of input–output pairs. The use of lengthy time-embedded inputs to provide memory for modeling dynamics results in increased computational overhead and large network size. The second bottleneck is the lack of a computationally efficient and compact solution. For feedforward kernel adaptive filters, the exact solution is almost never obtained, and its approximation may require an excessively high number of coefficients due to the growing nature of the radial basis function (RBF) network during training. To combat the computational and memory cost imposed by the linear growth, many sparsification techniques have been proposed to restrict the network size or budget [5], [8]–[12]. The central theme is to accept only those centers that conform to a given criterion and is most applicable for stationary learning. For nonstationary environment, various pruning strategies have been proposed to maintain the budget [13], [14]. The main shortcoming for this type of formulation is that the network learns and retains only the most recent dynamics. Whenever it updates, a portion of the past is forgotten or erased. Without any long-term memory, after enough time has elapsed, data have to be relearned from scratch when revisited. Both bottlenecks can be effectively addressed by a kernelized version of the recurrent neural network (RNN). RNNs are proven to be universal approximators of DSs [15] and have been successfully used for the identification and control of nonlinear DS [16]. A recurrent network operating with just the current sample achieves memory via feedback of internal states or the outputs through time-delay units. The input and output are no longer independent stationary vectors, but correlated temporal sequences. It is a much more powerful framework for describing the underlying dynamics of a system than a feedforward network. Despite their promising nature, RNNs’ popularity has not matched up to their potential, due to the difficulty of training.

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LI AND PRÍNCIPE: KAARMA ALGORITHM

Standard gradient-based training techniques for RNNs include backpropagation through time (BPTT) [17], extended Kalman filtering (EKF) [18], and real-time recurrent learning (RTRL) [19]. A comprehensive review of RNNs is outside the scope of this paper (for recent work see [20]). A second approach to training RNNs utilizes a large network of randomly initialized recurrent connections or a dynamic reservoir. These weight values remain fixed, and only a readout layer is trained to optimally project the network states onto the desired output. This reservoir computing (RC) framework is further identified as echo state networks [21] for continuous valued inputs and liquid state machines (LSMs) [22] for spike train inputs. RC is generally thought to be free from the problems associated with gradient-based RNN training, such as slow convergence, local optima, and computational complexity. However, performance depends upon random parameters that need to be appropriately cross validated to find optimal solutions without which RC is a less reliable convex universal learning machines than KAF methods [23]. A kernelized RNN also opens the door to symbolic representations and simpler solutions. A discrete-time DS (DTDS) with a discrete state space can be modeled by a finite-state machine (FSM). Deterministic finite automata (DFA) are FSMs with all state transitions uniquely determined by input symbols. For nondeterministic finite automata, equivalent DFA can be derived using the subset construction algorithm [24]. From a systems theory perspective, finite automata are solutions to general DTDS. By bridging the theory of kernel adaptive signal processing with RNNs, we fill a void in the current theory of KAF. In this paper, we propose a novel exact gradient-following kernel spatio-temporal filter, which we call the kernel adaptive autoregressive-moving-average (KAARMA) algorithm. By mapping the input symbols into a potentially infinitedimensional feature space, an adaptive filter with feedback can be trained to approximate any dynamical or nonlinear time-dependent relationships in the original input space while preserving the conveniences associated with a linear structure in the RKHS. Moreover, by mimicking an FSM via feedback and fixed-point behavior, an exact compact solution can be derived from trained KAARMA network for a certain DTDS. We demonstrate the computational power of the KAARMA algorithm by solving a set of benchmark grammatical inference problems and comparing its performance with RNNs operating on equivalent recurrent architectures in the input space. Furthermore, we show that KAARMA-based DFA can outperform LSMs on spike data, which opens the door for many novel neuroscience applications. The rest of this paper is organized as follows. In Section II, we provide an overview of current developments in KAF. We present our novel KAARMA algorithm in Section III. The performance of the proposed algorithm is evaluated in Section IV. Section V concludes this paper. A brief note on automata theory and regular grammar for the identification and synthesis of DFA using the KAARMA approach is given in the Appendix.

335

II. S URVEY OF K ERNEL A DAPTIVE F ILTERS We begin with a survey of the recent history of KAF. This section serves to establish the notations and helps to distinguish our contributions. In the family of kernel adaptive filters, the kernel least mean square (KLMS) algorithm [2] is the simplest. A finite-impulse response (FIR) filter trained in the RKHS using the LMSs algorithm, it can be viewed as a single-layer feedforward neural network or perceptron. For a set of n input–output pairs {(u1 , y1 ), (u2 , y2 ), . . . , (un , yn )}, the input vector ui ∈ U ⊆ Rm (where U is a compact input domain in Rm ) is mapped into a potentially infinite-dimensional feature space F. Define a U → F mapping ϕ(u), the feature-space parametric model becomes yˆ = fˆ(u) = T ϕ(u)

(1)

where  is the weight vector in the RKHS. Using the representer theorem [25] and the kernel trick, (1) can be written as n  αi K(ui , u) (2) fˆ(u) = i=1

K(u, u )

is a Mercer kernel, corresponding to the inner where   product ϕ(u), ϕ(u ) , and αi are the coefficients. The most commonly used kernel is the Gaussian kernel Ka (u, u ) = exp(−au − u 2 )

(3)

where a > 0 is the kernel parameter. Despite their universal approximation property, conventional KAF algorithms share the same filter structure as their linear input-space analogs, thus performing poorly when modeling DSs. For a nonstationary environment, the adaptive filter is tasked with the additional job of tracking statistical variations. This steady-state phenomenon is distinguished from the convergence or transient behavior. The LMS filter assumes the following state-space model (SSM): xi+1 = xi yi = uiT xi + v i

(4) (5)

where xi is the state at time i and v i is the measurement noise. For the LMS and KLMS algorithms, the state or steady-state weight is fixed. The recursive least squares (RLSs) and kernel RLS (KRLS) [5] algorithms share a similar state-transition model [26] xi+1 = λ−1/2 xi

(6)

where 0 < λ ≤ 1 is a constant forgetting factor. In both formulations, the conditions are equivalent to a stationary environment. The extended KRLS (Ex-KRLS) algorithm [6] is the kernelized extended RLS algorithm [27] and can only model a random walk xi+1 = xi + wi

(7)

where wi is the state or process noise. For linear DS, the Kalman filter [28] is the optimal estimator on the merit of second-order statistics, with the

336

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

RLS algorithm being a special case [26]. Several Kalman-based partial solutions have been proposed under the KAF framework for nonlinear SSMs. The Ex-KRLS with Kalman filtering [29] is a hybrid approach where the known transition function is constructed in the original state space (estimated using the EKF [30]), and the measurement model is learned using KRLS. For the simple additive noise measurement model, the kernel Kalman filtering (KKF) [31] is a batch method relying on subspace approximation by kernel principal component analysis. The KKF with conditional embedding operator algorithm [32] treats the estimated measurement embeddings as hidden states in the RKHS. The conditional distribution PY |x (y) in the RKHS can be estimated in the original input space through the conditional embedding operator [33] and KKF. To properly model a DS, both the state-transition and measurement equations have to be general solutions. The bottleneck lies in the recursive state-transition model, since the states are hidden. For multistep prediction problems, where the desired signal is not accessible at all times, an incremental learning approach can be taken, using the difference between successive predictions. This concept is generalized by the TD(λ) family of learning procedures, where 0 ≤ λ ≤ 1 is the exponential weighting factor [34]. In [35], the kernelized TD(λ) algorithm was proposed as a nonlinear solution. However, TD learning suffers from the same inadequacy as the LMS algorithm from a state-space view. The linear TD(1) procedure produces the same per-sequence weight changes as the LMS algorithm is proved in [34, Th. 1]. The notion of states is closely related to the concept of memory. Unlike an FIR system, the memory depth of an infinite-impulse response (IIR) filter is independent of the filter order and the number of adaptive parameters, making it ideal for modeling dynamics characterized by a deep memory structure yet with a small number of free parameters [36]. However, ensuring stability during IIR adaptation is difficult, and the error surface is nonconvex [37]. In [36], an FIR–IIR hybrid filter or generalized feedforward filter was proposed. In particular, the gamma (γ ) filter was analyzed in detail. It is an IIR filter with a restricted or adjustable memory depth, controlled by the adaptive parameter μ (when μ = 1, the γ -filter becomes an FIR filter). This paper is extended into the RKHS in [38], resulting in a γ -structured recursive kernel formulation that generalizes the KLMS algorithm. The kernel RTRL [39] is based on an RNN. However, because its derivation assumes a constant weight over the entire state trajectory, instantaneously altering it does not lead to true changes along the negative gradient of the cost function. For input-space RTRL, this problem can usually be avoided by lowering the learning rate, such that it is much slower than the time scale of the DS. Unfortunately, this bit of pragmatism does not translate directly into the RKHS. Unlike a conventional RNN that has the activation functions fixed in advance and only the weights are adjusted, the weights in the RKHS are functions in the input space. There is no guarantee that the functions are modified properly in real time, regardless of the initial condition or learning rate [39].

Fig. 1.

General SSM for DS.

In practice, it degrades to a teacher-forced multiple-input and multiple-output (MIMO) KLMS algorithm. Another class of kernel-induced feature space ARMA method involves support vector machines (SVMs). In [40], the SVM-ARMA2k method was formulated using uncoupled input and output feature spaces, and the SVM-ARMA4k was formulated using a composite sum kernel. However, both formulations rely on sufficient delay embedding for memory. Similarly, the support vector regression-based MIMO kernel ARMA method [41] assumes a fully observable state trajectory and uses the desired states for updates. Our approach differs from the previous research because KAARMA is developed as a true IIR system in the RKHS with a full SSM. All the previous KAF methods operate on subspaces or build the operators directly in the RKHS using statistical embedding, which are extremely time consuming. To make it practical as in KAF, we exploit the representer theorem. Unlike adaptive IIR filters [37], which consist of a feedforward element using delayed samples of the input ui− j +1 (where j ≥ 1) and a feedback element using past output samples yi− j , here, we explicitly construct and learn a hidden third input signal: states variables x. Furthermore, unlike the recursive kernel methods in [38] and [42], at each time step, we use the representer theorem and the kernel trick to realize the state vector in the input space, where system dynamics are completely encoded. Given the state, the current input is completely decoupled from the past inputs. Therefore, the KAARMA filter can operate using only the current input sample and achieves memory via feedback of internal states in the input space. Moreover, as a recurrent KAF algorithm, filter centers can be placed in a part of the space where data are defined and stability is ensured. The proposed gradient-descent adaptive procedure learns unknown general nonlinear continuous-time state-transition and measurement equations, using only an input sequence and the observed outputs. This viewpoint is inspired by RNNs for learning DS. III. K ERNEL A DAPTIVE ARMA A LGORITHM Let a DS (Fig. 1) be defined in terms of a general continuous nonlinear state transition and observation functions, f(·, ·) and h(·), respectively xi = f(xi−1 , ui )

(8)

yi = h(xi )

(9)

LI AND PRÍNCIPE: KAARMA ALGORITHM

337

where

T   f(xi−1 , ui ) = f (1) (xi−1 , ui ), . . . , f (n x ) (xi−1 , ui )  (n x ) T = x(1) (10) i , . . . , xi T   (1) (n y ) h(xi ) = h (xi ), . . . , h (xi )  (1) (n ) T = yi , . . . , yi y (11)

with input ui ∈ Rnu , state xi ∈ Rn x , output yi ∈ Rn y , and the parenthesized superscript (k) indicating the kth component of a vector or the kth column of a matrix. Note that the input, state, and output vectors have independent degrees of freedom or dimensionality. For simplicity, we rewrite (8) and (9) in terms of a new hidden state vector     f(xi−1 , ui )  xi = (12) si = yi h ◦ f(xi−1 , ui )    xi  (n s −n y +1:n s ) (13) yi = si = 0 In y yi I

where In y is an n y × n y identity matrix, 0 is an n y × n x zero matrix, and ◦ is the function composition operator. This augmented state vector si ∈ Rns is formed by concatenating the output yi with the original state vector xi . With this rewriting, measurement equation simplifies to a fixed selector  matrix I = [0 In y ]. Next, we define an equivalent transition function g(si−1 , ui ) = f(xi−1 , ui ) taking as argument the new state variable s. Using this notation, (8) and (9) become xi = g(si−1 , ui ) yi = h(xi ) = h ◦ g(si−1 , ui ).

(14) (15)

To learn the general continuous nonlinear transition and observation functions, g(·, ·) and h ◦ g(·, ·), respectively, we apply the theory of RKHS. First, we map the augmented state vector si and the input vector ui into two separate RKHSs as ϕ(si ) ∈ Hs and φ(ui ) ∈ Hu , respectively. By the representer theorem, the SSM defined by (14) and (15) can be expressed as the following set of weights (functions in the  input space) in the joint RKHS Hsu = Hs ⊗ Hu :   g(·, ·)   (16)  = Hsu = h ◦ g(·, ·) where ⊗ is the tensor-product operator. We define the new features in the tensor-product RKHS as 

ψ(si−1 , ui ) = ϕ(si−1 ) ⊗ φ(ui ) ∈ Hsu .

(17)

It follows that the tensor-product kernel is defined by: ψ(s, u), ψ(s , u ) Hsu = Ksu (s, u, s , u )

= (Ks ⊗ Ku )(s, u, s , u ) = Ks (s, s ) · Ku (u, u ).

(18)

This construction has several advantages over the simple concatenation of the input u and the state s. First, the tensor product kernel of two positive definite kernels is also

Fig. 2.

KAARMA network.

a positive definite kernel [1]. Second, since the adaptive filtering is performed in an RKHS using features, there is no constraint on the original input signals or the number of signals, as long as we use the appropriate reproducing kernel for each signal. Last but not least, this formulation imposes no restriction on the relationship between the signals in the original input space. This is important for input signals having different representations and spatiotemporal scales. For example, under this framework, we can model a neurobiological system, taking spike trains, continuous amplitude local field potentials, and vectorized state variables as inputs. Finally, the kernel SSM becomes si = T ψ(si−1 , ui ) yi = Isi .

(19) (20)

Fig. 2 shows a simple KAARMA network. In general, the states si are assumed hidden, and the desired does not need to be available at every time step, e.g., a deferred desired output value for yi may only be observed at the final indexed step i = f , i.e., d f . A. Kernel Adaptive Recurrent Filtering The learning procedure presented here computes the exact error gradient in the RKHS. For simplicity, we consider only the Gaussian kernel in the derivation. For the state and input vectors, the joint inner products are computed using Kas (s, s ) and Kau (u, u ), respectively. The cost function at time i is defined as 1 (21) εi = eiT ei 2 where ei = di − yi ∈ Rn y ×1 is the error vector, with di as the desired signal. The error gradient with respect to the RKHS weights i at time i is ∂eT ei ∂εi ∂y = i = −eiT i (22) ∂i 2∂i ∂i where the partial derivative ∂yi /∂i consists of n s (1) (2) (n ) terms, ∂yi /∂i , ∂yi /∂i , . . . , ∂yi /∂i s , corresponding to the state dimension. Note that the feature-space weights i are functions with potentially infinite dimension.

338

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

Fortunately, functional derivative is well-posed in the RKHS, since Hilbert spaces are complete normed vector spaces, we can use the Fréchet derivative [43] to compute (22). For the kth component of i , the gradient can be expanded using the chain rule as ∂εi ∂(k) i

= −eiT

∂yi ∂(k) i

= −eiT

∂(k) i

=

(23)

∂iT ψ(si−1 , ui ) ∂(k) i ∂ψ(s i−1 , ui ) T

= i

(k)

∂i

T + I(k) n s ψ(si−1 , ui )

(24)

(k)

where Ins ∈ Rns is the kth column of the n s × n s identity matrix. The distinction between a recurrent formulation and the normal feedforward KAF lies in the gradient term on the right-hand side of (24). In a recurrent network, past states are coupled with the current state input through feedback. Consequently, the partial derivatives of the previous states with respect to the current filter weights are nonzero. Using the representer theorem, weights i at time i can be written as a linear combination of prior features  i =  i Ai

(25)



where  i = [ψ(s−1 , u0 ), . . . , ψ(sm−2 , um−1 )] ∈ Rnψ ×m is a collection of the m past tensor-product features with  potentially infinite dimension n ψ , and Ai = [α i,1 , . . . , α i,ns ] ∈ m×n s is the set of corresponding coefficients. For feedforward R KAF, such as the KLMS algorithm, the number of basis functions grows linearly with each new sample, i.e., m = i . Here, we use m to denote a dictionary  i of arbitrary size, with ψ(s−1 , u0 ) initialization. Thus, the kth component (1 ≤ k ≤ n s ) of the filter weights at time i becomes (k) (k) i =  i Ai =  i α i,k .

(26)

Substituting the expression for weights i in (25) into the feedback gradient on the right-hand side of (24) and applying the chain rule gives T

∂ψ(si−1 , ui ) ∂(k)

= AiT =

∂ iT ψ(si−1 , ui ) ∂si−1 (k) ∂si−1 ∂

2as AiT Ki DiT



Λi

∂si−1

∂(k)

i

.

(27)

i

Here, the partial derivation is evaluated using Gaussian tensor product kernel (18), where Ki = diag( iT ψ(si−1 , ui )) is a ( j, j ) diagonal matrix with eigenvalues Ki = Kas (s j , si−1 ) · 

∂si ∂(k) i

= Λi

∂si−1 ∂(k) i

T + I(k) n s ψ(si−1 , ui ) .

(28)

(k)

∂yi ∂si ∂si ∂(k) i

where (∂yi /∂si ) = I. Applying the product rule to the gradient yields ∂si

we call the state-transition gradient. Substituting (27) in (24) gives the following recursion:

Kau (u j , ui ) and Di = [(s−1 − si−1 ), . . . , (sm−2 − si−1 )] are the difference matrix between state centers of the filter and the current input state si−1 . We collect the gradient coefficients  in (27) into a matrix Λi = (∂si /∂si−1 ) = 2as AiT Ki DiT , which

The gradient ∂si /∂i measures the sensitivity of si , the state output at time i , to a small change in the kth weight component, considering the effect of such variation in the weight values over the entire state trajectory s0 , . . . , si−1 . In this evaluation, the initial state s0 , the input sequence ui1 , ( j) and the remaining weights (i , where j = k) are fixed. The (k) state gradient (∂si /∂i ) is backpropagated with respect to a constant weight via feedback. Clearly, (28) is independent of any teacher signal or error that the system may incur in the future and can be computed entirely from the observed data. Therefore, we can forward propagate the state gradients. Since the initial state is userdefined and functionally independent of the filter weights, (k) by setting (∂s0 /∂i ) = 0, the ensuing recursions in (28) become ∂s1 T = I(k) (29) n s ψ(s0 , u1 ) . ∂(k) i By induction, we can factor out the basis functions and express the recursion as ∂si T (k) T = Λi V(k) i−1  i−1 + In s ψ(si−1 , ui ) (k) ∂i    T (k)  i−1 , ψ(si−1 , ui ) = Λi V (k) i−1 , In s  = V(k) i  i

T

(30)



where   i = [  i−1 , ψ(si−1 , ui )] ∈ Rnψ ×i are centers generated by the input sequence and forward-propagated states from  (k) n s ×i a fixed filter weight i , and V(k) = [Λi V(k) i i−1 , In s ] ∈ R is the updated state-transition gradient, with initializations (k) (k)   1 = [ψ(s0 , u1 )] and V1 = Ins . Combining (23) with (30) gives the error gradient ∂εi T = −eiT IV(k) i  i. ∂(k) Updating the weights in the negative direction yields

(k) T (k) (k) i+1 = i + η  i IVi ei

T (k) (k) =  i Ai + η  i IVi ei   (k) Ai = [ i ,   i ] (k) T η IVi ei 

(k)

=  i+1 Ai+1

(31)

(32) (33)

where η is the learning rate. Since learning is provided in an LMS fashion in the RKHS, the conventional tradeoff of speed and accuracy exists, and we recommend the same rules as for the KLMS algorithm. The convex optimization behavior of KLMS is very likely lost (as in the IIR case) and stability of this class of algorithms needs to be fully studied. The proposed kernel adaptive recurrent filter is summarized in Algorithm 1.

LI AND PRÍNCIPE: KAARMA ALGORITHM

Algorithm 1 Kernel Adaptive ARMA Algorithm Initialization: n u : input dimension n s : state dimension n y : output dimension as : state kernel parameter au : input kernel parameter η: learning rate Randomly initialize input u0 ∈ R1×nu Randomly initialize states s−1 and s0 ∈ R1×ns Randomly initialize coefficient matrix A ∈ R1×ns  = [ψ(s−1 , u0 )]: feature matrix S = [s−1 ]: state dictionary m =1: dictionary size  n y I = 0 In y ∈ R ×ns : measurement matrix Computation: for time t = 1, . . . , n do Initialization   = [ ]: feature matrix update S = [ ]: state matrix update (k) (k) V1 = Ins ∈ Rns ×1 , for k = 1, . . . , n s . Update State-Transition Gradient Matrix for time i = 1, . . . , t do Generate Next State (19) si = T ψ(si−1 , ui ) Update  State Gradient (30)  Di = (S(1) − si−1 ), . . . , (S(m) − si−1 ) Ki = diag( T ψ(si−1 , ui )) Λi = 2as AT Ki DiT  (k) (k) (k) Vi = Λi Vi−1 , Ins Update Feature and State Matrices   = [  , ψ(si−1 , ui )] S = [S , si−1 ] Prediction yt = Ist Update Weights in the RKHS (33) et = d t ⎡ − yt ⎤ (k) Ai A(k) = ⎣  (k) T ⎦ η IVi ei     = ,  S = [S, S ]  = A m =m+t

For this forward-propagated online learning model, after each adaptation, the entire state trajectory, with respect to this updated fixed network i+1 , is reconstructed in order to compute the next state gradient ∂si+1 /∂i+1 . Unlike a feedforward KAF algorithm, where the network grows linearly with the number of processed samples or updates, the KAARMA filter grows quadratically, i.e., the updating term   i in (32) consists of i new centers (not including the constant initial state), which forms the revised state trajectory. For the case where the filter is updated after each new input,

339

Algorithm 2 Quantization Algorithm Initialization: q: quantization threshold as : state kernel parameter au : input kernel parameter : feature matrix A: coefficient matrix   : feature matrix update A : coefficient update m: dictionary size m  : new centers size Computation: for i = 1, . . . , m s do (i) dis(  , )  = min as si − s j  + au ui − u j  ∗ 

1≤ j ≤m

j = arg min as si − s j  + au ui − u j  1≤ j ≤m

if dis(  (i) , ) < q then A( j ∗ ) = A( j ∗ ) + A (i ) else    = [, ψ(s i , ui )] A A= A (i ) m =m+1

the unconstrained RBF  structure of the KAARMA algorithm at time i has size ij =1 j = (i (i + 1)/2). For a sequence of length n and an update frequency of once per input symbol, the KAARMA memory and computational complexities are O(n 2 ) and O(n 3 ), respectively, the same as KRLS. For classification tasks, where the update frequency is only once per sequence, the memory and computational complexities are reduced to O(n) and O(n 2 ), respectively, the same as KLMS. The complexity of the KAARMA algorithm would be greatly reduced if the weights are allowed to change instantaneously as in RTRL. However, this effectively introduces an undesirable secondary feedback (since the observed state trajectory depends on any weight change made by the learning algorithm), one that is intractable in the functional spaces, regardless of the learning rate. A far more sensible way to combat the computational complexity is to use the quantization technique in [10], since KAARMA naturally forms feature clusters during training. We modify it slightly for Gaussian tensor-product reproducing kernels. This implementation is the quantized KAARMA (QKAARMA). Algorithm 2 outlines the quantization procedure of QKAARMA, which constrains the network growth in the last five lines of Algorithm 1. Each new centers from the feature update   is compared with the existing ones. If the minimum joint distance (input and state) is below a quantization thresh old q, the new center   (i) = ψ(si , ui ) is simply discarded, and its corresponding coefficients A (i ), where (i ) indicates

340

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

the i th row, are added to the nearest existing neighbor’s (indexed by j ∗ ), thus updating the weights without growing the network structure. B. Kernel Backpropagation Through Time Alternatively, we can compute the exact error gradient in the recurrent topology using BPTT in the RKHS. During training, we can treat the recurrent network as a single multistage feedforward network. By unfolding the filter dynamics in time, through a series of feedforward networks of identical weights, we effectively remove the feedback. To adapt the weights, the output error at the present time is transported back in time or filtered through the dual networks of the cascade FIR filters. Here, we show the relationship between kernel BPTT (KBPTT) and Algorithm 1. We decouple the feedback for a single iteration by connecting the state input of the feedforward network t at time t to the state output of the same feedforward network replicated at time t − 1, i.e., t −1 = t . For this two-stage network, the error gradient consists of the contribution from the external error taken at time t − 1 ∂εt −1 T = −etT−1 I I(k) (34) n s ψ(st −2 , ut −1 ) ∂(k) t −1 added with the filtered error gradient from time t ∂εt (k) ∂t −1

= −etT =

∂yt ∂st ∂st −1 ∂st ∂st −1 ∂(k)

−etT IΛt

∂st −1 (k)

∂t −1

t −1

(35)

which has the same form as (28) and equals (31) if we let et = et −1 , i.e., the average behavior of the multilayered feedforward network is equivalent to the recurrent system’s. The filtered error gradient L time steps ahead is then ∂εt +L ∂(k) t

∂yt +L ∂st +L ∂st ··· ∂st +L ∂st +L−1 ∂(k) t   L−1  T = −etT I Λt +L−i I(k) n s ψ(st −1 , ut ) . (36) = −etT+L

Algorithm 3 Kernel Backpropagation Through Time Initialization: L: window size Computation: for time t = 1, . . . , n do Unfold the Network and Forward Pass t0 = max(t − L + 1, 1): start time for time i = t0 , . . . , t do si = T ψ(si−1 , ui ) yi = Isi ei = di − yi Backpropagation Through Time (36) eTL I = −etT0 I for i = 1, . . . , L − 1do  i−1 eTL I = eTL I − etT I j =0 Λt0 +i− j Update Weights in the RKHS (k) ∂ε L = eTL I Ins ψ(st0 −1 , ut0 )T ∂(k) L T (k) = (k) − η( ∂ε(k) ) ∂ Update Next Initial State Estimate st0 = st0 + IT e L

Aside from the training complexity, there are two notable issues plaguing RNNs trained using direct gradient descent methods, namely, the vanishing and exploding gradient problems [44]. For sigmoid and tanh activation functions, the error flow tends to vanish for learning long-term dependence. A full analysis for the KAARMA algorithm is outside the scope of this paper and will be pursued in the future. Nonetheless, we note that the KAARMA algorithm is less prone to these problems, since the filter operates directly in the functional space, using smoothing functions as reproducing kernels. Furthermore, the quantization technique in Algorithm 2 provides additional stability and robustness, ensuring that only a compact dictionary is formed, in a part of the space where the system is stable. Nonetheless, we deal with the rare case of exploding gradients by restricting the dynamic range of the hidden states using simple amplitude clipping.

i=0

For certain applications, instead of always backpropagating the error to the initial state, which contributes significantly to the computational cost, we can open up a window of a fixed interval L into the past, relative to the current time, and adapt the weights of this L-stage multilayered FIR filter. This truncated or fixed-window KBPTT is summarized in Algorithm 3. In this formulation, the window size L is userdefined and independent of the filter order. The per iteration storage and computational complexity is of a constant factor L. Furthermore, we can treat the initial state of each window as an adaptive parameter and update its values using the backpropagated or filtered error. Similary, we can modify Algorithm 1 to limit the recursion to L steps by setting (∂st0 −1 /∂(k) i ) = 0, then propagating and updating the next initial states st0 . There are many different types of BPTT and here we simply work with the simplest that is an approximation because the window is finite.

IV. S IMULATION R ESULTS We demonstrate the computational power of the KAARMA algorithm by modeling and providing exact compact solutions to DSs, both primary bottlenecks of conventional KAF algorithms. In particular, we evaluate the performance of the QKAARMA algorithm for the task of syntactic pattern recognition. Identification and synthesis of DFA are performed using the set of Tomita grammars (Table I) [45], which has served as the benchmark for RNNs. A brief review on DFA and formal language is given in the Appendix. Note that the field of grammatical inference is very broad [46]–[48]. Here, we will focus on pure data driven learning approaches as the ones started with [49] and continued in [50] and [51]. The KAARMA algorithm is well suited for grammatical inference or DFA identification. Each training string is presented to KAARMA one symbol u t at a time (Fig. 2).

LI AND PRÍNCIPE: KAARMA ALGORITHM

341

TABLE I T OMITA G RAMMARS

Fig. 4.

Fig. 3.

QKAARMA generated DFA for Tomita grammar #4.

QKAARMA generated DFA for Tomita grammar #1.

The last component of the state vector indicates the response,  (n ) i.e., yt = st s . Since error is evaluated at the end of a sequence, KAARMA only takes linear time to process each string. Because KAARMA forces the state vector space into well-separated partitions or subspaces, corresponding to state nodes in an FSM [with accepting states indicated by positive response values s(ns ) > 0], we can extract DFA from trained KAARMA networks, similar to [49]. Furthermore, the quantization technique outlined in Algorithm 2 or QKAARMA can be effectively employed with no performance loss in the final solution. Training data consist of stimulus-response pairs labeled by source grammar. Greater emphasis is given to grammar #4, which has been the subject of significant research interest in [49], [52], and [53]. Similar to [49] for second-order RNNs, the training set for grammar #1 consists of 1000 randomly generated binary strings of length 1–15 (mean of 7.758). Unlike epoch learning used in [49], each string is processed only once by QKAARMA, with state dimension n s = 4. The states and coefficient matrix A are randomly initialized, uniformly from (−0.5, 0.5) to (0, 1), respectively. Gaussian kernel parameters as = au = 2 are used, with learning rate η = 0.1. Using a quantization factor q = 0.5, out of the 7759 possible centers, the trained QKAARMA network consists of only 20 centers. The DFA extraction and minimization procedures follow those in [49]. The quantization factor used during DFA extraction is q = 0.8. Fig. 3 shows the extracted DFA for grammar #1 on the left and the minimized DFA on the right, which QKAARMA correctly identified. Initial states are labeled 0 for s0 and shown as black dots. The value above each state label indicates its response in the state vector space. Final states are shaded green. The empty string is not learned, i.e., initial states are

Fig. 5. Performance of QKAARMA on Tomita grammar #4 averaged over 100 different initializations. TABLE II QKAARMA DFA S IZE FOR T OMITA G RAMMARS

assumed to be nonfinal. Edges are color coded with blue (dashed line) for 0 input and red (solid line) for 1. Self transitions appear as large circles around state nodes. For Tomita grammar #4, the same training set and QKAARMA parameters are used, with the solution shown in Fig. 4. Again, the correct minimal DFA is inferred (not accepting empty strings). To show stability, we run QKAARMA 100 times using random initialization. A separate test set (200 sequences of length of 20) is used to generate the learning curve shown in Fig. 5. The top plot shows the average response of the network during training, with the learning curve plotted in the middle. We see that after 400 updates, the average filter has learned enough to start discriminating the strings, and learning begins to accelerate.

342

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

TABLE III C OMPARISON OF QKAARMA TO G RADIENT-BASED RNNs AND BY RG

By the time training reached 700 sequences, test error reduced by half. For most initialization, QKAARMA has learned the complete dynamics by 900 strings, and the DFA extracted after this point is an exact solution. This demonstrates that QKAARMA is not only a fast algorithm, but also robust to initial conditions. In contrast, the second-order RNN in [49] takes at least 49 epochs to converge on a 1024-string training set, with several of them failing to converge after 5000 epochs. Fig. 5 (bottom) shows the average network size as a function of the processed strings. This inverse relationship between the growth rate and the learning curve is expected. After an update, centers are added to describe newfound dynamics, which in turn, enhances the generalization capability. During periods of growth plateau, learning does not stop, since QKAARMA fine-tunes the filter coefficients based on the error gradient, even when network size is unchanged. We tabulate the results for all seven Tomita grammars in Table II. The QKAARMA algorithm is able to identify correctly each DFA. Furthermore, we see that despite the network growth issue inherent in kernel recurrent learning, the KAARMA network size can be effectively constrained without performance loss, using the appropriate quantization. This compact filter representation greatly facilitates the DFA extraction process. The state diagrams synthesized from trained QKAARMA for the remaining grammars are shown in Fig. 6. In the case of Tomita grammar #5, the extracted automaton is the minimized form. This is partly due to training parameters and the size of the minimum DFA solution. In addition, grammar constraints as manifested in the symmetry of the minimum DFA makes correct nondistinguishable states difficult to form after training. A. Comparison With RNN The KAARMA algorithm operates on the same recurrent topology as a second-order RNN, however, the filtering is performed in the RKHS. Here, we compare the performance of QKAARMA with first- and second-order RNNs in [54] and the random weight guessing technique [55]. Tomita grammars #3 and #5 are omitted in the comparison, since their definitions in [54] differ from the originals (Table I). Random guessing (RG) results are only available for

parity-free Tomita grammars (1, 2, and 4). The average performance over ten random initializations are presented in Table III. For each inference algorithm, the average training size, number of test set errors made by the trained network, test set accuracy, network size (type), successful extraction rate (convergence rate × the rate DFA matches correct grammar), and the unminimized extracted DFA size are enumerated, whenever available. For RNNs, first- and second-order networks of 3–9 neurons, with and without bias, were tested. Since QKAARMA is a much faster algorithm, for RNNs and RG, the configuration with the fastest convergence in training samples is selected for comparison. Unlike the RNNs, which are epoch trained on all binary strings of length 0–9, in alphabetical order with an elaborate scheme involving cycles of working and recycle sets, the QKAARMA networks are trained on random strings of length 1–9 in the target languages. This is a more difficult problem, since it has been shown that learning is enhanced using lexicographic order in string presentation [56]. In practice, one cannot expect the training samples to be a complete ordered set, especially for unknown grammars. The QKAARMA state dimension is fixed at n s = 4. Test set consists of all strings of length 10–15 (64 512 total), same as the RNNs’. Training is stopped when the test error is

The Kernel Adaptive Autoregressive-Moving-Average Algorithm.

In this paper, we present a novel kernel adaptive recurrent filtering algorithm based on the autoregressive-moving-average (ARMA) model, which is trai...
3MB Sizes 4 Downloads 12 Views