Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Neural Network-Based Finite-Horizon Optimal Control of Uncertain Affine Nonlinear Discrete-Time Systems Qiming Zhao, Hao Xu, Member, IEEE, and Sarangapani Jagannathan, Senior Member, IEEE Abstract— In this paper, the finite-horizon optimal control design for nonlinear discrete-time systems in affine form is presented. In contrast with the traditional approximate dynamic programming methodology, which requires at least partial knowledge of the system dynamics, in this paper, the complete system dynamics are relaxed utilizing a neural network (NN)-based identifier to learn the control coefficient matrix. The identifier is then used together with the actor–critic-based scheme to learn the time-varying solution, referred to as the value function, of the Hamilton–Jacobi–Bellman (HJB) equation in an online and forward-in-time manner. Since the solution of HJB is time-varying, NNs with constant weights and timevarying activation functions are considered. To properly satisfy the terminal constraint, an additional error term is incorporated in the novel update law such that the terminal constraint error is also minimized over time. Policy and/or value iterations are not needed and the NN weights are updated once a sampling instant. The uniform ultimate boundedness of the closed-loop system is verified by standard Lyapunov stability theory under nonautonomous analysis. Numerical examples are provided to illustrate the effectiveness of the proposed method. Index Terms— Finite-horizon, Hamilton–Jacobi–Bellman (HJB) equation, neural network (NN), optimal control.

I. I NTRODUCTION

C

ONVENTIONALLY, for linear systems with quadratic cost, the optimal regulation problem (LQR) can be tackled by solving the well-known Riccati equation (RE) [1] given in the system dynamics (e.g., A and B). In addition, the solution is obtained offline and backward-in-time from the terminal constraint. In the case of infinite-horizon, the solution of the RE becomes a constant and the algebraic RE. However, the optimal control of nonlinear systems in affine form is more challenging since it requires the solution to the Hamilton– Jacobi–Bellman (HJB) equation. For infinite-horizon case, the HJB solution reduces to a time-invariant partial differential or difference equation. Therefore, in recent years, adaptive or neural network (NN)-based optimal control over infinitehorizon has been studied for both linear and nonlinear systems, see [2]–[4]. However, the finite-horizon optimal control for such nonlinear systems still remains unresolved.

Manuscript received May 17, 2013; revised February 15, 2014; accepted March 23, 2014. This work was supported in part by the EECS under Grant 1128281 and in part by the Intelligent Systems Center. The authors are with the Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65401 USA (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2315646

First, for general affine nonlinear systems, the solution to the HJB equation for a finite-horizon case is inherently time-varying [1] that complicates the analysis. Second, a terminal constraint is imposed on the cost function whereas this constraint is taken as zero in the infinitehorizon case. The traditional approximate dynamic programming (ADP) techniques [4], [8], [9] address the optimal control problem by solving the HJB equation iteratively. Though iteration-based methods are mature, they are not preferred for real-time implementation since inadequate number of iterations in a sampling interval can cause instability [2]. Beard [5] considered the finite-horizon optimal control of continuous-time nonlinear systems by iteratively solving the generalized HJB equation via Galerkin method from the terminal time. Cheng et al. [6] proposed a fixed final-time optimal control of general affine nonlinear continuous-time systems using a NN with time-dependent weights and state-dependent activation function to solve the HJB equation through backward integration. On the other hand, Heydari and Balakrishan [8] considered the finite-horizon optimal control of nonlinear discretetime systems with input constraints using offline trained direct heuristic dynamic programming (DHDP)-based scheme utilizing constant weights and time-varying activation function-based NN. Similarly, Wang et al. [9] considered the finite-horizon optimal control of discrete-time systems using iteration-based ADP technique. However, in [9], the terminal time is not specified. The past literature [5], [6], [8], [9] for solving the finitehorizon optimal control of nonlinear systems utilize either backward-in-time integration or iteration-based offline training, which requires significant number of iterations within each sampling interval to guarantee the system stability. On the other hand, other ADP schemes [17] normally relax the drift dynamics while the control coefficient matrix is still needed [3]. Therefore, a real-time finite-horizon optimal control scheme, which can be implemented in an online and forward-in-time manner with completely unknown system dynamics and without using value and policy iterations, is yet to be developed. Therefore, in this paper, a novel approach is addressed to solve the finite-horizon optimal control of uncertain affine nonlinear discrete-time systems in an online and forward-in-time manner. First, the control coefficient matrix is generated using a novel NN-based identifier which functions in an online manner. Next, an error term corresponding to

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

the terminal constraint is defined and minimized overtime such that the terminal constraint can be properly satisfied. To handle the time-varying nature of the solution to the HJB equation or value function, NNs with constant weights and time-varying activation functions are utilized. In addition, in contrast with [8] and [9], the control policy is updated once a sampling instant. Value/policy iterations are not performed by incorporating history information. Finally, due to the time-dependency of the optimal control policy, the closedloop system becomes essentially nonautonomous, and the stability of our proposed design scheme is demonstrated using Lyapunov stability analysis. In this paper, the proposed methodology is developed under mild assumptions which are standard in the literature, such as the polynomial activation function [7] for the NN. In addition, an actor NN is needed to approximate the control input for relaxing the need for xk+1 [3], whereas in the case of continuous-time systems, only critic NN is required [7]. The main contribution of the paper includes the development of an optimal adaptive NN control scheme in finitehorizon for nonlinear discrete-time systems without using value and/or policy iterations. An online identifier to generate the system dynamics is introduced and tuning laws for all the NNs are also derived. Lyapunov stability is given. The rest of the paper is organized as follows. In Section II, background and formulation of finite-horizon optimal control for affine nonlinear discrete-time systems are introduced. In Section III, the main control design scheme along with the stability analysis are addressed. In Section IV, simulation results are given to verify the feasibility of our approach. Conclusive remarks are provided in Section V. II. BACKGROUND AND P ROBLEM F ORMULATION In this paper, the finite-horizon optimal regulation of affine nonlinear discrete-time systems is investigated. The system is described as xk+1 = f (xk ) + g(xk )uk

(1)

where xk ∈ n represent the state vector, f (xk ) ∈ n and g(xk ) ∈ n×m are smooth nonlinear functions and considered to be unknown, and uk ∈ m is the control input vector. It is assumed that 0 < g(xk ) < g M with g M being a positive constant. The objective of the optimal control design is to determine a state feedback control policy which minimizes the following time-varying value or cost function given by: V (xk , k) = ψ(x N ) +

N−1

r (xk , uk , k)

(2)

k=i

where [i, N] is the time span of interest, ψ(x N ) is the terminal constraint that penalizes the terminal state xN , r (xk , uk , k) = Q(xk , k)+ukT Ruk is an in-general time-varying function of the state and control input at each intermediate time k in [i, N], where Q(xk , k) ∈ and R ∈ m×m are positive semidefinite function and positive definite symmetric weighting matrix, respectively. It should be noted that in finite-horizon scenario, the control inputs can be time-varying, i.e., uk = μ(xk , k).

Setting k = N, the terminal constraint for the value function is given as V (x N , N) = ψ(x N ).

(3)

Remark 1: In general, the terminal penalty ψ(x N ) is a function of state at terminal stage N and not necessarily to be in quadratic form. In the case of standard LQR, Q(xk , k) and ψ(x N ) takes the quadratic form as Q(xk , k) = xkT Sk xk and ψ(x N ) = xTN S N x N , and the optimal control policy can be obtained by solving the RE in a backward-in-time fashion from the terminal value S N . It is also important to note that in the case of finite-horizon, the value function (2) becomes essentially time-dependent, in contrast with the infinite-horizon case where this problem is developed in a forward-in-time manner [2], [3]. By Bellman’s principle of optimality [1], the optimal cost from k onward is equal to V ∗ (xk , k) = min r (xk , uk , k) + V ∗ (xk+1 , k + 1) . (4) uk

The optimal control policy u∗k that minimizes the value function V ∗ (xk , k) is obtained using the stationarity condition ∂ V ∗ (xk , k)/∂uk = 0 and revealed to be 1 ∂ V ∗ (xk+1 , k + 1) . u∗k = − R−1 g T (xk ) 2 ∂xk+1

(5)

From (5), it is clear that even the full system dynamics are available, the optimal control cannot be obtained for nonlinear discrete-time systems due to the dependency on future state xk+1 . To avoid this drawback and relax the requirement for system dynamics, iteration-based schemes are normally utilized with NNs by performing offline-training [4]. However, the iteration-based schemes are not preferred for hardware implementation since the number of iterations to ensure the stability cannot be easily determined. In addition, the iterative approaches cannot be implemented when the dynamics of the system are completely unknown, since at least the control coefficient matrix g(xk ) is required to generate the control policy [3]. In contrast, in this paper, a solution is found with completely unknown dynamics without utilizing the iterative approach, as given in Section III. III. F INITE -H ORIZON O PTIMAL R EGULATION W ITH C OMPLETELY U NKNOWN DYNAMICS In this section, the finite-horizon optimal regulation of nonlinear discrete-time systems in affine form with completely unknown system dynamics is addressed. First, to relax the requirement of system dynamics, a novel NN-based identifier is designed to learn the true system dynamics in an online manner. Next, the actor–critic methodology is proposed to approximate the time-varying value function with a critic network, while the control inputs are generated by the actor network, with both NNs having the structure of constant weights and time-varying activation function. To satisfy the terminal constraint, an additional error term is defined and incorporated in the novel NN updating law such that this error is also minimized overtime. The stability of the closedloop system is demonstrated, under nonautonomous analysis

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHAO et al.: NN-BASED FINITE-HORIZON OPTIMAL CONTROL

3

by Lyapunov theory to show that the parameter estimation remains bounded as the system evolves. A. NN-Based Identifier Design Due to the online learning capability, NNs are commonly used for estimation and control. According to the universal approximation property [19], the system dynamics (1) can be rewritten on a compact set using NN representation as xk+1 = f (xk ) + g(xk )uk = W Tf σ f (xk ) + ε f k + W Tg σg (xk )uk + εgk uk T σ f (xk ) 0 1 1 = Wf + [ ε f εg ] uk Wg uk 0 σg (xk ) = W σ (xk )u¯ k + ε¯ k T

(6)

It can be seen that (10) includes a time history of previous l + 1 identification errors recalculated using the most recent ˆ k. weights W Similar to (10), the dynamics for the augmented identification error vector becomes T ˆ k+1 Ξ k+1 = X k+1 − W Θ k Uk .

(11)

ˆ k can be Next, the update law for the NN identifier weights W defined as T

ˆ k+1 = Θ k U k · U kT Θ kT Θ k U k −1 X k+1 (12) − αΞ kT W where 0 < α < 1 is a design parameter. Substituting (12) into (11) yields T

ˆ k+1 Θ k U k Ξ k+1 = X k+1 − W −1

= X k+1 − Θ k U k · U kT Θ kT Θ k U k T

T × X k+1 − αΞ kT Θ k Uk −1 T T

T T = X k+1 −(X k+1 −αΞ k ) U k Θ k Θ k U k Uk Θ k Θ k Uk = αΞ k . (13)

where

Wf ∈ L×n Wg σ f (xk ) 0 σ (xk ) = ∈ L×(1+m) 0 σg (xk ) 1 ∈ (1+m) u¯ k = uk W =

and ε¯ k = [ ε f εg ]u¯ k ∈ n with L being the number of hidden neurons. In addition, the target NN weights are assumed to be upper bounded by W ≤ W M , where W M is a positive constant, while the NN activation function and reconstruction error are assumed to be bounded above as σ (xk ) ≤ σ M and ¯εk ≤ ε¯ M , with σ M and ε¯ M positive constants. Note that to match the dimension W can be constructed by stacking zeros in W f or W g , which does not change the universal approximation property of the NN. Therefore, system dynamics xk+1 = f (xk ) + g(xk )uk can be identified by updating the target NN weight matrix W. Using NN identifier, the system states at k can be estimated by ˆ kT σ (xk−1 )u¯ k−1 . xˆ k = W Define the identification error as ˆ kT σ (xk−1 )u¯ k−1 . ek = xk − xˆ k = xk − W

(7)

T

T

(9)

(10)

where X k = [ xk xk−1 · · · xk−l ] ∈ n×(l+1) , Θ k−1 = [ σ (xk−1 ) σ (xk−2 ) · · · σ (xk−l−1 ) ] ∈ L×(l+1)(1+m) , and ⎤ ⎡ 0 ··· 0 u¯ k−1 ⎢ 0 0 ⎥ u¯ k−2 0 ⎥ ⎢ (1+m)(l+1)×(l+1) U k−1 = ⎢ . . .. ⎥ ∈ . .. .. ⎦ ⎣ .. . . 0

···

0

u¯ k−l−1

T

ˆ k+1 σ (xk )u¯ k = W T σ (xk )u¯ k + ε¯ k − W T ˜ k+1 =W σ (xk )u¯ k + ε¯ k .

(14)

Using ek+1 = αek from (13), we have T ˜ k+1 ek+1 = W σ (xk )u¯ k + ε¯ k T

Next, by incorporating the history information, define an augmented error vector as ˆ k Θ k−1 U k−1 Ξ k = Xk − W

ek+1 = xk+1 − xˆ k+1

˜ k σ (xk−1 )u¯ k−1 + ε¯ k−1 ) = α(W

(8)

Then, the identification error dynamics of (8) can be expressed as ˆ k+1 σ (xk )u¯ k . ek+1 = xk+1 − xˆ k+1 = xk+1 − W

Remark 2: For the above identification scheme, Θ k U k needs to be persistently exciting (PE) long enough for the NN identifier to learn the true system dynamics. The PE condition is well-known in the adaptive control theory [21] and can be satisfied by adding probing noise [20]. This update law is a nonstandard update law similar to the one derived in [3] which includes the history information. This update law ensures that the augmented error vector converges exponentially. Next, to find the NN weights estimation error dynamics, ˆ k . Recall from (6) and (7), the identifi˜ k = W −W define W cation error dynamics can be expressed as

(15)

or equivalently T ˜ kT σ (xk−1 )u¯ k−1 + α ε¯ k−1 − ε¯ k . (16) ˜ k+1 σ (xk )u¯ k = α W W

Next, the boundedness of the NN weights estimation error ˜ k will be demonstrated in Theorem 1. The following W definition is needed before proceeding. Definition 1 [19]: An equilibrium point xe is said to be uniformly ultimately bounded (UUB) if there exists a compact set S ⊂ n so that for all initial state x0 ∈ S, there exists a bound B and a time T (B, x0 ) such that xk − xe ≤ B for all k ≥ k0 + T. Theorem 1 (Boundedness of the NN Identifier): Let the proposed NN identifier be defined as in (7) and the update law for tuning the NN weights be given in (12). Then, there exists a positive constant α satisfying 0 < α < 1/2, such that



the identification error ek as well as the NN weights estimation ˜ k are UUB, with the bound given in (A.5) and (A.6). error W Proof: See Appendix. Remark 3: In the proof, the inequality 0 < 2m ≤ Θ k 2 ≤ Θ k U k 2 holds since Θ k uk satisfies the PE condition [2] such that the NN identifier is able to learn the system dynamics. It should be also noted that the control input is assumed to be bounded, which is consistent with the literature, for the identification scheme since the main purpose of this section is to show the effectiveness of our identifier design. This assumption will be relaxed in our final theorem, where the convergence of the overall closed-loop system is shown with our proposed control design. B. Optimal NN Controller Design In this section, the finite-horizon optimal regulation design is proposed. To handle the time-dependency of the value function, two NNs with the structure of constant weights and time-varying activation functions are utilized to approximate the time-varying value function and the control input, respectively. An additional error term corresponding to the terminal constraint is also defined and minimized overtime such that the terminal constraint can be properly satisfied. Due to the timedependency nature for finite-horizon, the closed-loop stability of the system will be shown by Lyapunov theory. By universal approximation property of NNs [19] and actor– critic methodology, the value function and control inputs can be represented by a critic NN and an actor NN, respectively, as V (xk , k) = W TV σV (xk , N − k) + εV (xk , k)

(17)

u(xk , k) = W uT σu (xk , N − k) + εu (xk , k)

(18)

and

where W V and W u are the constant target NN weights, σV (xk , N −k) and σu (xk , N −k) are the time-varying activation functions incorporating the time-to-go, εV (xk , k) and εu (xk , k) are the NN reconstruction errors for the critic and actor network, respectively. The target NN weights are assumed to be upper bounded by W V ≤ WV M and W u ≤ WuM , respectively, where both WV M and WuM are positive constants [19]. The NN activation functions and the reconstruction errors are also assumed to be upper bounded by σV (xk , N − k) ≤ σV M , σu (xk , N − k) ≤ σuM , |εV (xk , k)| ≤ εV M , and |εu (xk , k)| ≤ εuM , with σV M , σuM , εV M , and εuM all positive constants [19]. In addition, in this paper, the gradient of the reconstruction error is also assumed to be upper bounded by ∂εV (xk , k)/∂xk+1 ≤ εV M with εV M a positive constant [3], [14]. Similarly as (17), the terminal constraint of the value function can also be written in NN representation as V (x N , N) =

W TV σV (x N , 0)

+ εV (x N , N)

(19)

with σV (x N , 0) and εV (x N , N) having the same meaning as σV (xk , N −k) and εV (xk , k) but corresponding to the terminal state. Note that the activation function is taking the form

σV (x N , 0) at terminal stage since from the definition, the timevarying activation function incorporates the time-to-go utility function. Remark 4: The fundamental difference between this paper and [8] is that our proposed algorithm yields a completely forward-in-time and online solution without using both value and policy iteration and offline training, whereas the algorithm proposed in [8] was essentially an iteration-based DHDP scheme which is performed offline. 1) Value Function Approximation: The time-varying value function, V (xk , k), can be approximated by the critic NN and written as ˆ TV ,k σV (xk , N − k) Vˆ (xk , k) = W

(20)

where Vˆ (xk , k) represents the estimated value function (2) and ˆ V ,k is estimation of the target NN weights W V . The basis W function should satisfy σV (0) = 0 for x = 0 to guarantee that Vˆ (0) = 0 can be satisfied [1]. The terminal constraint can be represented by ˆ TV ,k σV (ˆx N , 0) Vˆ (x N , N) = W

(21)

where xˆ N is an estimation of the terminal state. It should be noted that since the true value of x N is not known, xˆ N can be considered to be a guess of x N and can be chosen at random as long as xˆ N lies within the stability region for a stabilizing control policy [4], [8]. To ensure optimality, the Bellman equation should hold along the system trajectory. According to the principle of optimality, the true Bellman equation is given by r (xk , uk , k) + V (xk+1 , k + 1) − V (xk , k) = 0.

(22)

However, (22) no longer holds when the NN approximation is considered. Therefore, with estimated values, the Bellman equation (22) becomes ekB = r (xk , uk , k) + Vˆ (xk+1 , k + 1) − Vˆ (xk , k) ˆ TV ,k σV (xk+1 , N − k − 1) = r (xk , uk , k) + W ˆ TV ,k σV (xk , N − k) −W T

ˆ V ,k σV (xk , N − k) = r (xk , uk , k) + W

(23)

where ekB is the Bellman error along the system trajectory, and σV (xk , N − k) = σV (xk+1 , N − k − 1) − σV (xk , N − k). Next, recall from (21), define an additional error term corresponding to the terminal constraint as ˆ TV ,k σV (ˆx N , 0). ekN = ψ(x N ) − W

(24)

The objective of the optimal control design is thus to minimize the Bellman error ekB as well as the terminal constraint error ekN as the system evolves. Hence, define the total error as ektotal = ekB + ekN .

(25)

Based on gradient descent, the update law for critic NN can be defined as σ1 (xk , N − k)ektotal ˆ V ,k+1 = W ˆ V ,k −αV W (26) 1 + σ1T (xk , N − k)σ1 (xk , N − k)


5

where σ1 (xk , N − k) = σV (xk , N − k) − σV (ˆx N , 0), while σ1 (xk , N − k) is bounded by σ1m ≤ σ1 (xk , N − k) ≤ σ1M and αV is a design parameter with range given in Theorem 2. Remark 5: Two points needs to be clarified in the update law (26). First, the total error is minimized such that the optimality can be achieved as well as the terminal constraint can also be properly satisfied. Second, the activation function σ1 (xk , N − k) is also a combination of the activation function along the system trajectory and the activation function at the terminal stage. For the infinite-horizon case, the update law becomes a standard gradient descent algorithm with timeinvariant activation function and also the terms corresponding to the terminal constraint become all zero, i.e., ektotal = ekB and σ1 (xk ) = σV (xk ). ˜ V ,k = W V − Next, to find the error dynamics, define W ˆ W V ,k . Recalling the Bellman equation (22) and the definition of the value function (17), we have r (xk , uk , k) + V (xk+1 , k + 1) − V (xk , k) = r (xk , uk , k) + W TV σV (xk+1 , N − k − 1) +εV (xk+1 , k + 1) − W TV σV (xk , N − k) − εV (xk , k) = r (xk , uk , k) + W TV σV (xk , N − k) + εV (xk , k) =0

r (xk , uk , k) = −W TV σV (xk , N − k) − εV (xk , k). (28) Substituting (28) into (23) yields

1+σ1T (xk , N −k)σ1 (xk , N −k)

σ1 (xk , N −k) σ˜ VT (x N , 0)W V +ε(xk , k) 1+σ1T (xk , N −k)σ1 (xk , N −k)

= −W TV σV (xk , N − k) − εV (xk , k) T

ˆ V ,k σV (xk , N − k) +W (29)

Next recalling from (19), then the terminal constraint error ekN can be written as T

ˆ V ,k σV (ˆx N , 0) ekN = ψ(x N ) − W ˆ TV ,k σV (ˆx N , 0) = W TV σV (x N , 0) + εV (x N , 0) − W = W TV σV (x N , 0) − W TV σV (ˆx N , 0) + W TV σV (ˆx N , 0) T

ˆ V ,k σV (ˆx N , 0) +εV (x N , 0) − W ˜ TV ,k σV (ˆx N , 0) + W TV σ˜ V (x N , 0) + εV (x N , 0) (30) =W where σ˜ V (x N , 0) = σV (x N , 0) − σV (ˆx N , 0). Hence, the total error (25) becomes ektotal = ekB + ekN ˜ TV ,k σV (xk , N − k) − εV (xk , k) = −W ˜ TV ,k σV (ˆx N , 0) + W TV σ˜ V (x N , 0) + εV (x N , 0) +W ˜ V ,k σ1 (xk , N − k)+W TV σ˜ V (x N , 0)+ε(xk , k) (31) = −W

. (32)

Next, the boundedness of the estimation error for the critic NN weights is presented, as in the following theorem. Theorem 2 (Boundedness of the Critic NN Weights): Let u 0 (xk ) be an admissible control for the system (1). Let the update law for the critic NN be given as (26). Then, there 2 /3(1 + σ 2 ) such exists a positive constant 0 < αV < 2σ1m 1M ˜ V ,k is UUB with that the critic NN weights estimation error W a computable bound BV given in (A.12). Proof: See Appendix. 2) Approximation of Optimal Feedback Control Signal: In this section, the optimal control policy is obtained such that the estimated value function (20) is minimized. The actor NN approximation of (18) is defined as (33)

ˆ u,k is the estimation of the target actor NN weights. where W Next, define the actor error as the difference between the control policy applied to (1) and the control policy which minimizes the estimated value function (20), denoted as u(x ˜ k , k) = uˆ 1 (xk , k) − uˆ 2 (xk , k) where uˆ 1 (xk , k) =

T

ˆ V ,k σV (xk , N − k) ekB = r (xk , uk , k) + W

where ε(xk , k) = − εV (xk , k) + εV (x N , 0).

+αV

˜ V ,k σ1 (xk , N −k)σ1T (xk , N −k)W

T

where εV (x k , K ) = εV (x k+1 , K + 1) − εV (x k , K ). Hence, we have

T

˜ V ,k −αV ˜ V ,k+1 = W W

ˆ u,k σu (xk , N − k) u(x ˆ k , k) = W (27)

˜ TV ,k σV (xk , N − k) − εV (xk , k). = −W

Finally, by substituting (31) into the update law (26), the ˜ V ,k is revealed to be error dynamics for W

T ˆ u,k W σu (xk ,

(34)

N − k) and

1 ∂ Vˆ (xk+1 , k + 1) uˆ 2 (xk , k) = − R−1 gˆ T (xk ) 2 ∂xk+1 1 −1 T T ˆ V ,k = − R gˆ (xk )∇σV (xk+1 , N − k − 1)W 2 where ∇ denotes the gradient, g(x ˆ k ) is the estimated control coefficient matrix from the NN-based identifier, and Vˆ (xk+1 ) is the approximated value function from the critic network. Hence, (34) becomes T ˆ u,k σu (xk , N − k) u(x ˜ k , k) = W 1 −1 T ˆ V ,k . (35) + R gˆ (xk )∇σVT (xk+1 , N −k −1)W 2 The update law for tuning the actor NN weights can be then defined as

ˆ u,k+1 = W ˆ u,k − αu W

σu (xk , N − k)u˜ T (xk , k) (36) 1 + σuT (xk , N − k)σu (xk , N − k)

where αu > 0 is a design parameter. Recall that the control policy (18) minimizes the value function (17), then we have u(xk , k) = W uT σu (xk , N − k) + εu (xk , k)

1 = − R−1 g T (xk ) ∇σVT (xk+1 , N − k − 1)W V 2 +∇εV (xk+1 , k + 1)



or equivalently 1 0 = W uT σu (xk , N − k) + εu (xk , k) + R−1 g T (xk ) 2 ×∇σVT (xk+1 , N − k − 1)W V 1 + R−1 g T (xk )∇εV (xk+1 , k + 1). 2

(37)

ˆ u,k , To find the error dynamics for the actor NN weights W ˜ ˆ define W u,k = W u − W u,k . Subtracting (37) from (35) yields 1 T ˜ u,k σu (xk , N − k) + R−1 g T (xk ) −u(x ˜ k , k) = W 2 ×∇σVT (xk+1 , N − k − 1)W V 1 + R−1 g T (xk )∇εVT (xk+1 , k + 1) 2 1 ˆ V ,k + εu (xk , k). − R−1 gˆ T (xk )∇σVT (xk+1 , N − k − 1)W 2 (38) Next, for simplicity, rewrite σu,k = σu (xk , N − k), ∇σV ,k+1 = ∇σV (xk+1 , N −k −1), ∇εV ,k+1 = ∇εV (xk+1 , N − ˆ k ), g˜ k = g(x ˜ k ), and εu,k = k − 1), gk = g(xk ), gˆ k = g(x εu (xk , k), then add and subtract 1/2R−1 gˆ T (xk )∇σVT,k+1 W V and arranging terms yields 1 T ˜ u,k u(x ˜ k , k) = −W σu,k − R−1 g˜ kT ∇σVT,k+1 W V 2 1 −1 T T ˜ V ,k − ε˜ u,k − R gˆ k ∇σV ,k+1 W 2

(39)

gk − gˆ k and ε˜ u,k = 12 R−1 gkT ∇εVT ,k+1 + εu,k . it can be easily concluded that ε˜ u,k satisfies

where g˜ k = Furthermore, ˜εu,k ≤ ε˜ uM , where ε˜ uM is a positive constant. Finally, the error dynamics for the actor NN weights are revealed to be T

˜ u,k + αu σu,k u˜ (xk , k) ˜ u,k+1 = W W T σ 1 + σu,k u,k σ u,k ˜ u,k − αu =W T σ 1 + σu,k u,k

T 1 ˜ u,k σu,k + R−1 g˜ kT ∇σVT,k+1 W V W 2 1 −1 T ˜ V ,k + ε˜ u,k T . + R gˆ k ∇σVT,k+1 W 2

(40)

It should be noted that from the above analysis, the control matrix g(xk ) is not needed for updating the actor NN, in contrast with [3]. Instead, the approximated control matrix g(x ˆ k ) from NN identifier is utilized to find the control input, hence the partial knowledge of system dynamics are relaxed. To complete this section, the flowchart of this scheme is shown in Fig. 1. We first collect the information for the steps k = 1, 2, . . . , l + 1 with the initial admissible control for the first time identifier NN weights update. Then, the NNs for the critic, actor, and identifier are updated based on our proposed weights tuning laws at each sampling interval in an online and forward-in-time fashion.

Fig. 1.

Flowchart of the finite-horizon optimal control design.

C. Convergence Analysis In this section, it will be shown that the closed-loop system will remain bounded. Before proceeding, the following lemma is needed. Lemma 1 (Bounds on the Optimal Closed-Loop Dynamics): Consider the affine nonlinear discrete-time system defined in (1), then there exists an optimal control policy u∗k for (1) such that the closed-loop system dynamics f (xk ) + g(xk )u∗k can be written as f (xk ) + g(xk )u∗ 2 ≤ k ∗ xk 2 (41) k where 0 < k ∗ < 1/2 is a constant. Theorem 3 (Convergence of Finite-Horizon Optimal Control Signal): Let u0 (xk ) be an initial stabilizing control policy for (1). Let the NN weights update law for the identifier, critic network and actor network be provided by (12), (26), and (36), respectively. Then, there exists positive constant α, αV , and αu satisfying 0 < α < 1/2

(42)

3 0 < αu < 7

(43)

σ2 0 < αV < 1m 2 4 1 + σ1M

(44)


Fig. 2.

7

Fig. 3.

System response.

Error history.

respectively, where 0 < σ1m ≤ σ1k ≤ σ1M with σ1k is a time-dependent activation function of the value function, such that the system state xk , NN identification error ek , identifier ˜ k , critic and actor network weights weight estimation errors W ˜ ˜ u,k are all UUB at terminal estimation errors W V ,k and W stage N with the bound bx , bΞ , be , bW˜ , and bW˜ V shown in (A.22)–(A.26). Proof: See Appendix. IV. S IMULATIONS In this section, a linear system is first utilized followed by a practical two-link robot nonlinear system. For the linear system, one can compare the REbased solution with the proposed scheme. A. Linear Case The proposed finite-horizon optimal control design scheme is first evaluated by a linear example. Consider the system 0.8 1 1 (45) xk+1 = x + u . 0 0.6 k 0.5 k The weighting matrices for the performance index (2) are selected to be Q(xk , k) = 0.5xkT xk and R = 1. For comparison purpose, the terminal constraint is selected to be ψ(x N ) = 0. Nonzero terminal constraint is considered in the nonlinear case. For the NN setup, for linear systems, the input to the identifier NN is chosen to be zk = [xk , uk ], the time-varying activation functions for both critic and actor network are chosen as σV (xk , N − k) = σu (xk , N − k) = [x 1 , x 1 exp(−τ ), x 2 , x 2 exp(−τ ), x 12 , x 22 , x 1 x 2 τ ]T , which results seven neurons and τ = (N − k)/N is the normalized time-to-go. The design parameters are chosen as α = 0.4, αV = 0.1, and αu = 0.01. The initial admissible control gain is selected as K(0) = [0.3, − 0.3] and the initial system states are selected as x0 = [0.5, −0.5]T . The critic and actor NN weights are both initialized as zeros. Simulation results are shown as below. First, the system response is shown in Fig. 2. It can be clearly observed from Fig. 2 that the system states converge

Fig. 4. (a) Convergence of critic NN weights. (b) Convergence of actor NN weights.

close to the origin within finite time. This confirms that the system remains stable under our proposed design scheme. Next, to show the feasibility of the proposed optimal control design scheme, the Bellman error as well as the terminal constraint error is plotted in Fig. 3. It is shown from this figure that the Bellman equation error converges close to zero, which illustrates the fact that our proposed controller design indeed achieves optimality. It is more important to note that the convergence of the terminal constraint error indicates that the terminal constraint is also properly satisfied. Next, the convergence of critic and actor NN weights is shown in Fig. 4(a) and (b), respectively. From the results, it



Fig. 5.

Fig. 6.

Fig. 7.

Cost error between two methods.

Two-link planar robot arm.

can be clearly observed that both weights converge to constants and remain bounded as desired. Finally, to compare our proposed design with traditional REbased design, the cost is shown in Fig. 5. It can be observed from the figure that the difference between the cost computed from traditional RE based and our proposed approach converges more quickly than the system states, which illustrates the validity of our proposed method. B. Nonlinear Case Now, consider the two-link planar robot arm depicted in Fig. 6. The continuous-time dynamics of the two-link robot arm is given by [22] q¨1 α + β + 2η cos q2 β + η cos q2

+

β + η cos q2 −η(2q˙1q˙2 ) sin q2

=

τ1 τ2

ηq˙ 2 sin q2 1

β +

q¨2 αe1 cos q1 + ηe1 cos(q1 + q2 )

ηe1 cos(q1 + q2 ) (46)

where α = (m 1 + m 2 )a12 , β = m 2 a22 , η = m 2 a1 a2 , and e = g/a1 . In the simulation, the parameters are chosen to

System response and control inputs.

be m 1 = m 2 = 1 kg, a1 = a2 = 1 m, and g = 10 m/s 2 . Hence, α = 2, β = 1, η = 1, and e1 = 10. Define the system states as x = [x 1, x 2 , x 3 , x 4 ]T = [q1 , q2 , q˙1 , q˙2 ]T and the control inputs as u = [u 1 , u 2 ]T = [τ1 , τ2 ]T . Then, the system dynamics can be written in the affine form as x˙ = f (x) + g(x)u, where ⎤ ⎡ x3 ⎥ ⎢ x4 ⎥ ⎢ ⎥ ⎢ 2 − x 2 − x 2 cos x ) sin x ⎥ ⎢ −(2x x + x 3 4 2 2 4 3 3 ⎥ ⎢ ⎥ ⎢ +20 cos x − 10 cos(x + x ) cos x 1 1 2 2 ⎥ ⎢ f (x) = ⎢ 2 x −2 ⎥ cos 2 ⎞⎥ ⎢⎛ ⎢ (2x 3 x 4 + x 42 + 2x 3 x 4 cos x 2 + x 42 cos x 2 + 3x 32 ⎥ ⎢⎜ ⎟⎥ ⎢⎜ +2x 32 cos x 2 + 20(cos(x 1 + x 2 ) − cos x 1 )× ⎟ ⎠⎥ ⎦ ⎣⎝ (1 + cos x 2 ) − 10 cos x 2 cos(x 1 + x 2 ) cos2 x2 −2 ⎡ ⎤ 0 0 ⎢ ⎥ 0 0 ⎢ ⎥ ⎢ ⎥ g(x) = ⎢ −1−cos x 2 ⎥ . 1 2 2 ⎣ 2−cos x2 2−cos x2 ⎦ −1−cos x 2 2−cos2 x 2

3+2 cos x 2 2−cos2 x 2

Discretizing the continuous-time system with a sufficient small sampling interval Ts , then the discrete-time version of the system can be written as xk+1 = f (xk ) + g(xk )uk , where the equations providing the values of f (xk ) and g(xk ) are shown at the bottom of next page. In the simulation, we choose Ts = 0.001, and the value function is given in the form of (2), with Q(xk , k) = xkT xk and R = 0.005I. The initial states and admissible control gain are selected to be x(0) = [π/3, π/6, 0, 0]T and K(0) = [−50, 0 −50, 0; 20, 0, 20, −20], and the terminal constraint is given as ψ(x N ) = 8. For the NN setup, the activation function for the identifier is constructed from 2β the expansion of the even polynomial M/2 n x , where M is the order of approximation i=1 i β=1 and n is the dimension of the system. In our case, n = 4 and we choose M = 4, which results in 45 neurons. For the critic and actor network, the state-dependent part of the timevarying activation functions is also chosen to be the expansion of the even polynomial with M = 4 and 2, which results in 45 and 10 neurons, respectively, while the time-dependent part


Fig. 8.

Identification errors.

9

Fig. 10.

Convergence of critic and actor NN weights.

shows the stability of the system and effectiveness of the NN identifier design. Next, to show the feasibility of our proposed optimal control design scheme, the error histories are shown in Fig. 9. Similar trends as the linear case are shown from Fig. 9 that both Bellman equation error and terminal constraint error converge close to zero as system evolves, which illustrates that the proposed algorithm not only achieves optimality but also satisfies the terminal constraint. Finally, due to the large number of neurons for the critic and actor NN, the Frobenius norm of the NN weights are shown in Fig. 10. It can be clearly observed from the figure that the actual NN weights converge to a constant as desired. Fig. 9.

V. C ONCLUSION

Error histories.

are selected as the polynomials of time-to-go with saturation, i.e., {0, (N − k)/L i , (N − k)/(L i − 1), . . . , N − k}, where N is the terminal time and L i is the number of neurons. In our case, L 1 = 45 and L 2 = 10. Note that saturation for the time-dependent part of the activation function is to ensure its magnitude is within a reasonable range such that the parameter estimation is computable. The tuning parameters are chosen as α = 0.3, αV = 0.01, and αu = 0.1. All the initial NN weights are randomly selected between [0, 1]. The simulation results are shown as below. First, the system response, control inputs, and identification errors are given in Figs. 7 and 8, respectively. It can be observed clearly from the figure that system states, control inputs, and identification errors converge close to the origin in finite time, which ⎡

Ts x 3k + x 1k

In this paper, the finite-horizon optimal control of affine nonlinear discrete-time systems is addressed with completely unknown system dynamics. First, the NN-based identifier generates suitable control coefficient matrix such that the control input can be computed. Next, the actor–critic structure is utilized to approximately find the optimal control policy. The time-varying nature for finite-horizon optimal control problem is handled using NNs with constant weights and time-varying activation functions. An additional error term corresponding to the terminal constraint is minimized to guarantee that the terminal constraint can be properly satisfied. In addition, the proposed algorithm is implemented utilizing a history of cost to go errors instead of traditional iteration-based scheme. The proposed algorithm yields an online and forward-intime design scheme which enjoys great practical advantages. ⎤

⎥ ⎢ Ts x 4k + x 2k ⎥ ⎢ ⎡ ⎥ ⎢ 0 2 2 2 ⎥ ⎢ −(2x 3k x 4k + x 4k − x 3k − x 3k cos x 2k ) sin x 2k ⎢ ⎥ ⎢ 0 ⎢ ⎥ ⎢ +20 cos x 1k − 10 cos(x 1k + x 2k ) cos x 2k ⎥ g(xk ) = ⎢ f (xk ) = ⎢ + x 3k 1 2 ⎢ ⎥ ⎢⎛ (cos x2k −2)/T s ⎞ ⎥ ⎢ ⎣ (cos2 x2k −2)/Ts 2 + 2x x cos x + x 2 cos x ⎥ ⎢ (2x 3k x 4k + x 4k 3k 4k 2k 2k −1−cosx2k 4k ⎥ ⎢⎜ 2 + 2x 2 cos x + 20(cos(x + x ) − cos x )× ⎟ ⎟ ⎥ ⎢⎜ (cos2 x2k −2)/Ts +3x 3k 2k 1k 2k 1k ⎠ ⎝ 3k ⎦ ⎣ (1 + cos x 2k ) − 10 cos x 2k cos(x 1k + x 2k ) + x 4k (cos2 x2k −2)/Ts

0 0 −1−cosx2k (cos2 x2k −2)/Ts 3+2cosx2k (cos2 x2k −2)/Ts

⎤ ⎥ ⎥ ⎥. ⎥ ⎦



The convergence of the parameter estimation and closed-loop system are demonstrated using Lyapunov stability theory under nonautonomous analysis. Simulation results are provided to verify the theoretical claim. A PPENDIX P ROOF OF T HEOREM 1 First observe that 0 < 2m ≤ Θ k 2 ≤ Θ k uk 2 , where m is a positive constant. This is ensured by the PE condition. Define the Lyapunov candidate function as T

T ˜ k σ (xk−1 )u¯ k−1 2 (A.1) ˜ k + W ˜kW L I D (k) = ekT ek +2m tr W where tr(•) denotes the trace operator. The first difference of L I D (k) becomes T ek+1 − ekT ek L I D (k) = L I D (k + 1) − L I D (k) ≤ ek+1 T T T T ˜ k+1 σ (xk )u¯ k −2m tr W ˜k ˜ k+1 σ (xk )u¯ k ˜kW W +tr W T T ˜ k σ (xk−1 )u¯ k−1 2 . ˜ k+1 σ (xk )u¯ k 2 − W (A.2) +W

Recall ek+1 = αek and (16), (A.1) can be further written as L I D (k) ≤ α 2 ek 2 − ek 2 T T ˜ k+1 σ (xk )u¯ k 2 − 2m tr W ˜kW ˜k +W T ˜ k σ (xk−1 )u¯ k−1 2 ˜ kT σ (xk−1 )u¯ k−1 +α ε¯ k−1 − ε¯ k 2 − W +α W 2 ˜ kT σ (xk−1 )u¯ k−1 + α ε¯ k−1 − ε¯ k 2 ≤ −(1 − α 2 )ek + 2α W T T ˜kW ˜ k σ (xk−1 )u¯ k−1 2 . ˜ k − W −2m tr W (A.3) Using Cauchy–Schwartz inequality, (A.2) can be written as T ˜ k σ (xk−1 )u¯ k−1 2 L I D (k) ≤ −(1 − α 2 )ek 2 + 4α 2 W T 2 ˜ kT W ˜ k σ (xk−1 )u¯ k−1 2 ˜ k ) − W +4α ε¯ k−1 − ε¯ k − 2m tr(W 2 2 ˜ k ≤ −(1 − α 2 )ek − 2m W

˜ kT σ (xk−1 )u¯ k−1 2 + 4α ε¯ k−1 − ε¯ k 2 − 1 − 4α 2 W 2 2 ˜ k + ¯ε M ≤ −(1 − α 2 )ek − 2m W (A.4) where 4α ε¯ k−1 − ε¯ k ≤ ¯ε M due to the boundedness of the NN reconstruction error, where ¯ε M is a positive constant. Therefore, the first difference of L I D (k) is less than zero outside of a compact set as long the following conditions hold: ¯ε M ≡ be ek > (A.5) 1 − α2 2

or W ˜ k >

¯ε M ≡ bW˜ . 2m

(A.6)

P ROOF OF T HEOREM 2 Define the Lyapunov candidate function as T

˜ V ,k ) = W ˜ V ,k W ˜ V ,k . L V (W

(A.7)

˜ V ,k ) is given by The first difference of L V (W T

T

˜ V ,k ) = W ˜ V ,k+1 W ˜ V ,k W ˜ V ,k+1 − W ˜ V ,k . L V (W

(A.8)

Denote σ1k = σ1 (xk , N − k) for simplicity. Substituting (32) into (A.8) yields ˜ V ,k ) = W ˜ TV ,k+1 W ˜ TV ,k W ˜ V ,k+1 − W ˜ V ,k L V (W T ˜ σ σ W ˜ V k − αV 1k 1k V ,k = W T 1 + σ1k σ1k T

T σ1k σ˜ V (x N , 0)W V + ε(xk , k) +αV Tσ 1 + σ1k 1k T ˜ σ σ W ˜ V k − αV 1k 1k V ,k × W T 1 + σ1k σ1k

T σ1k σ˜ V (x N , 0)W V + ε(xk , k) +αV Tσ 1 + σ1k 1k T

˜ V ,k ˜ V ,k W −W ⎡ T Tσ W ˜ V ,k σ1k 1k ˜ V ,k W ⎣ = −2αV Tσ 1 + σ1k 1k

T σ ˜ V ,k ˜ VT (x N , 0)W V + ε(xk , k) W σ1k − Tσ 1 + σ1k 1k

σT σ W T (x , 0)W + ε(x , k) 2 ˜ V ,k σ ˜ σ 1k N V k 1k V +aV2 1k T − . T 1 + σ1k σ1k 1 + σ1k σ1k (A.9) T σ /1 + σ T σ Notice that σ1k 1k 1k 1k < 1 and applying Cauchy– Schwartz inequality yields T

˜ V ,k ) ≤ −2αV L V (W +2αV

Tσ 1 + σ1k 1k

T T ˜ V ,k σ1k σ˜ V (x N , 0)W V + ε(xk , k) W

+2αV2

Tσ W ˜ V ,k σ1k 1k ˜ V ,k W

Tσ σ1k 1k Tσ 1 + σ1k 1k

Tσ 1 + σ1k 1k 2 W ˜ V ,k 2

T σ1k σ˜ (x N , 0)W V + ε(xk , k) 2 V T 2 (1 + σ1k σ1k ) Tσ σ1k 1k W ˜ V ,k 2 ≤ −2αV T 1 + σ1k σ1k

T ˜ V ,k 2 σ1k σ˜ VT (x N , 0)W V + ε(xk , k) W 2 ˜ +2αV + 2α W V ,k V Tσ 1 + σ1k 1k T 2αV2 σ1k σ˜ (x N , 0)W V + ε(xk , k) 2 . (A.10) + V T 2 (1 + σ1k σ1k ) +

2αV2

Note that σ1k is a time-dependent activation function and hence the Lyapunov candidate function becomes nonautonomous. Recall that the time span of interest is finite and σ1k is a smooth function, then σ1k is bounded by 0 < σ1m ≤ σ1k ≤ σ1M . Then, separating the term T σ ˜ V ,k /1 + σ T σ1k and recall˜ VT (x N , 0)W V + ε(xk , k) W σ1k 1k ing the bounds σ˜ V (x N ) ≤ 2σV M , W V ≤ W M and


11

ε(xk , k) ≤ 3εV M , (A.10) becomes 2 σ1m

˜ V ,k ) ≤ −2αV L V (W

and 2 3 ∇σV M R 3αu + 8 . 8 Next, the terms in (A.13) will be considered individually. First 3 =

˜ V ,k 2 + αV2 W ˜ V ,k 2 W

2 1 + σ1M 2 2 σ W +8αV V M 2M + 18ε2V M 1 + σ1m 2 ˜ 2 +2αV W V ,k 2 + 2αV2 (8σV2 M W M + 18ε2V M ) 2 σ1m ˜ V ,k 2 + αV2 W ˜ V ,k 2 + 2αV2 W ˜ V ,k 2 ≤ −2αV W 2 1 + σ1M

2 +αV (1 + 2αV )8σV2 M W M + 36αV ε2V M + 18ε2V M 2 2σ1m 2 ˜ V ,k 2 + ε1V ≤ −αV − 3αV W (A.11) M 2 1 + σ1M

2 2 + 36α ε 2 where ε1V = αV (1 + 2αV )8σV2 M W M V VM + M 18ε2V M . From (A.11), it can be observed that the nonautonomous Lyapunov candidate is upper bounded by a time-invariant ˜ V ,k ) is less than zero outside of function. Therefore, L V (W a compact set as long as the following condition holds: ε2 W ˜ V ,k > ! 1V M " ≡ BW V . (A.12) 2 2σ1m αV 1+σ 2 − 3αV 1M

T xk+1 − xkT xk L x = xk+1

= fk + gk uˆ k 2 − xk 2 = fk + gk u∗k − gk u∗k + gk uˆ k 2 − xk 2 . (A.14) Using Lemma 1, Cauchy–Schwartz inequality twice and recalling the bounds, (A.14) becomes 2 2 L x ≤ 2fk + gk u∗k + 2gk u∗k − gk uˆ k − xk 2 ≤ −(1 − 2k ∗ )xk 2 2 T ˆ u,k +2g 2M (W uT σu,k + εu,k − W σu,k ) T ˜ u,k σu,k + εu,k 2 ≤ −(1 − 2k ∗ )xk 2 + 2g 2M W ≤ −(1 − 2k ∗ )xk 2 + 4g 2M Ξ u,k 2 2 +4g 2M εuM T

˜ u,k σu,k . Next, recalling (40) and using the where Ξ u,k = W bound, the first difference L u can be represented as

T

T ˜ u,k W ˜ u,k+1 W ˜ u,k+1 − tr W ˜ u,k L u = tr W

2αu T ˜ = tr u ˜ σ W k u,k u,k T σ 1 + σu,k u,k T σ αu2 σu,k

u,k T + 2 tr u˜ k u˜ k . T 1 + σu,k σu,k

P ROOF OF T HEOREM 3 ˆ k , k), fk = f (xk ), gk = g(xk ), gˆ k = First, denote uˆ k = u(x g(x ˆ k ), and g˜ k = g(x ˜ k ) for simplicity. Define the Lyapunov candidate function as L =

αu2

L x + L I D + L V + L u + L A + L B 2 2g 2M 1 + σuM (A.13)

where L x = xkT xk , L I D , L V are defined in (A.1) and (A.7), T

T ˜ u,k W ˜ V ,k W ˜ u,k , L A = W ˜ V ,k 2 , and respectively, L u = tr W T

T ˜ k σ (xk−1 )u¯ k−1 4 . ˜ k 2 + W ˜kW L B = 2m tr W Define ! " ⎧ 2 σ1m ⎪ ⎪ α − 3α V 1+σ 2 V ⎨ α2 2m 1M V , , , = min 2 ⎪ 2α αu 3 ⎪ ⎩ u 1 4αu 2 σ M ⎫ ⎪ ⎪ ⎬ 4 m

4 ⎪ 2αu 3 σ M ⎪ ⎭

*

= max 2

*

(g M ∇σV M R)2 (8 + 9αu ) 3αu2 (g M ∇σV M R)2 , 2 2 (3αu + 8)

+

(WV M ∇σV M R)2 (αu + 8) 3αu2 (WV M ∇σV M R)2 , = max 2 2 (3αu + 8)

(A.16)

T σ /1 + σ T σ Noticing that σuk uk uk uk ≤ 1, then substituting (39) into (A.16) and using cyclic property of trace operator and applying norm with upper bounds, (A.16) becomes, after collecting the terms, as

L u ≤ −

αu (2 − αu ) αu2 (WV M ∇σV M R)2 2

˜gk 2 Ξ + u,k T σ T σ 1 + σu,k 4 1 + σu,k u,k u,k

α 2 (g M ∇σV M R)2 ˜ αu2 (∇σV M R)2 ˜ 2 W V ,k 2 ˜gk 2

W + u + V ,k T σ T σ 4 1+σu,k 4 1+σu,k u,k u,k αu2 α 2 (WV M ∇σV M R ε˜ uM ) ˜εu,k 2 + u ˜gk T T σ 1 + σu,k σu,k 1 + σu,k u,k αu (1 + αu ) + Ξ u,k T σ 1 + σu,k u,k ! ∇σV M RWV M ˜ V ,k ˜gk + ∇σV M Rg M W × 1 + αu " ˜ V ,k + 2(1 + αu )˜εuM + ∇σV M R˜gk W

+

α 2 (∇σV M R)2 ˜ V ,k 2 + WV M ˜gk 2 W ˜ V ,k g M ˜gk W + u T 2 1 + σu,k σu,k α 2 ∇σV M R ˜ V ,k ˜gk WV M g M ∇σV M R + 2˜εuM W + u T 2 1 + σu,k σu,k

where 1

(A.15)

+

+

αu2 ˜ V ,k (g M ∇σV M R ε˜ uM )W T σ 1 + σu,k u,k

where R = R−1 .

(A.17)



2 2 2 +˜ε2M + ε˜ 3M + ε˜ 4M αu (3 − 3αu )Ξ u,k 2

≤− 2 2 1 + σuM ˜ V ,k 2 + 2αu 2 ˜gk 2 +2αu 1 W

Next, observe that −αu (3 − 3αu )Ξ u,k 2 2 αu (1 + αu )Ξ u,k 2 . − 2 Recall that similar as the critic NN, σu,k is a time-dependent activation function and bounded, due to the smoothness of σu,k and finite time span by 0 < σum ≤ σu,k ≤ σuM . Then, (A.17) becomes, after completing the squares with respect to ˜ V ,k , ˜gk , and W ˜ V ,k ˜gk , as Ξ u,k , W −αu (2 − αu )Ξ u,k 2 =

αu (3 − 3αu ) Ξ u,k 2 L u ≤ − 2 2 1 + σuM αu (g M ∇σV M R)2 (8 + 9αu ) ˜ W V ,k 2 2 ) 2(1 + σuM αu (WV M ∇σV M R)2 (αu + 8)

+ ˜gk 2 2 2 1 + σuM +

6αu (∇σV M R)2 αu2 g 2M ˜ V ,k 2 + W 2 (3α + 8) 4 1 + σuM u 3αu (∇σV M R)2 (3αu + 8) ˜ W V ,k 2 ˜gk 2 2 ) 4(1 + σuM 6αu (∇σV M R)2 αu2 WV2 M 2 2 + + ε˜ 2M ˜gk 2 + ε˜ 1M 2 )(3α + 8) 4(1 + σuM u +

2 2 +˜ε3M + ε˜ 4M

2 = 8α (1 + α )3 / 1 + σ 2 ε˜ 2 where ε˜ 1M u u um uM 2 ε˜ 2M

= 2 ε˜ 3M

=

2αu (g M ∇σV M R)2 (8 + 9αu )

2 1 + σum

!

αu ε˜ uM g M ∇σV M R(8 + 9αu )

"2

"2 ! 2αu (WV M ∇σV M R)2 αu ε˜ uM

(α +8) u 2 WV M ∇σV M R(αu + 8) 1 + σum

2 ε˜ 4M

" ! 3αu (∇σV M R)2 (3αu +8) αu (WV M g M ∇σV M R+2˜εuM) 2

= . 2 (3αu +8)∇σV M R 4 1+σum

Finally, using Young’s inequality and recalling from the definition of 1 ∼ 3 , we have αu (3 − 3αu )Ξ u,k 2

2 2 1 + σuM αu ˜ V ,k 2 + (g M ∇σV M R)2 (8 + 9αu )W 2 αu + (WV M ∇σV M R)2 (αu + 8)˜gk 2 2 3αu ˜ V ,k 4 (∇σV M R)2 (3αu + 8)W + 8 3αu (∇σV M R)2 (3αu + 8)˜gk 4 + 8 3α 2 (g M ∇σV M R)2 ˜ W V ,k 2 + u 2(3αu + 8) 3αu (∇σV M R)2 2 2 2 αu WV M ˜gk 2 + ε˜ 1M + 2(3αu + 8)

L u ≤ −

2 2 ˜ V ,k 4 + αu 3 ˜gk 4 + ε˜ 1M +αu 3 W + ε˜ 2M 2 2 +˜ε3M + ε˜ 4M .

(A.18)

T ˜ V ,k 2 using (A.11), we have ˜ V ,k W Next, consider L A = W

T

T ˜ V ,k+1 2 − W ˜ V ,k 2 ˜ V ,k+1 W ˜ V ,k W L A = W

T ˜ TV ,k W ˜ V ,k+1 + W ˜ V ,k ˜ V ,k+1 W = W

T ˜ V ,k+1 − W ˜ V ,k ˜ TV ,k W ˜ V ,k+1 W × W 2 2σ1m 2 2 ˜ W V ,k + ε1V M − 3αV ≤ 2 − αV 2 1 + σ1M 2 2σ1m 2 2 ˜ V ,k + ε1V M − 3αV W × −αV 2 1 + σ1M 2 2 2σ1m 2σ1m 2 − αV − 3αV − 3αV = −αV 2 2 1 + σ1M 1 + σ1M ˜ ×W 4 V ,k 2 2σ1m 2 2 4 ˜ +2 1 − αV − 3αV ε1V M W V ,k + ε1V M 2 1 + σ1M 2 2 2σ1m 2σ1m 3 − αV = −αV − αV − αV 2 2 2 1 + σ1M 1 + σ1M ˜ 4 ×W ⎛ V ,k ⎜ +⎜ ⎝

! αV

⎞ 1+2

2 2σ1m 2 1+σ1M

− αV

⎟ 4 "⎟ ⎠ ε1V M .

(A.19)

T ˜ kT W ˜ k ) 2 + W ˜ k σ (xk−1 )u¯ k−1 4 . Next, consider L B = 2m tr(W Recalling the NN weights estimation error dynamics (16) and applying Cauchy–Swartz inequality, we have

T

T ˜ k+1 W ˜kW ˜ k+1 2 − 2m tr W ˜k 2 L B = 2m tr W T T ˜ k+1 σ (xk )u¯ k 4 − W ˜ k σ (xk−1 )u¯ k−1 4 +W T ˜ k 4 + 2W ˜ k σ (xk−1 )u¯ k−1 4 ≤ −4m W T ˜ k σ (xk−1 )u¯ k−1 4 + ¯ε2M −W T ˜ k 4 + 16α 4 W ˜ k σ (xk−1 )u¯ k−1 4 ≤ −4m W T ˜ k σ (xk−1 )u¯ k−1 4 + ¯ε2M −W T

˜ k 4 − 1 − 16α 4 W ˜ σ (xk−1 )u¯ k−1 4 + ¯ε2 ≤ −4 W m

k

M

˜ k 4 + ¯ε2M . ≤ −4m W

(A.20)

Finally, combing all the above terms yields the first difference of the Lyapunov candidate function as L =

αu2 L x + L I D 2 2 ) 2g M (1 + σuM + L u + L A + L B

+ L V


˜ V ,k > min W

⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩

13

εtotal

4

,

, εtotal

! αV

2 σ1m 2 1+σ1M

! αV

αu2 (1 − 2k ∗ ) αu (3 − 7αu ) xk 2 − Ξ u,k 2 2 2 2 ) 2g M (1 + σuM ) 2(1 + σuM

− 1 − α 2 ek 2 2 σ1m ˜ V ,k 2 − 1 2m W ˜ k 2 −αV − 4αV W 2 2 1 + σ1M 2 2 σ1m 2σ1m 1 − αV −αV − 3αV − 3αV 2 2 1 + σ1M 1 + σ1M ˜ V ,k 4 ×W ˜ k 4 + εtotal − 4m W (A.21) !

εtotal =

! " " 1+2 1+2 4 2 + ¯ ε ε1V M M 2m 2m 2 2 2 2 + ε˜ 1M + ε˜ 2M + ε˜ 3M + ε˜ 4M +

2 2 2 ¯ε M + ε1V M + 2αu εuM

. 2 1 + σuM

Hence, the nonautonomous Lyapunov candidate is upper bounded by a time-invariant function. Therefore, L is less than zero outside of a compact set as long as the following conditions hold: . xk > 2g 2 (1 + σ 2 )εtotal α 2 (1 − 2k ∗ ) u M uM ≡ bx or

(A.22) -

Ξ u,k >

. 2 )ε 2(1 + σuM total αu (3 − 7αu )

≡ bΞ

(A.23)

or

ek >

or

. εtotal (1 − α 2 ) ≡ be

/˜ k > min W ≡ bW˜

. 2εtotal

2m ,

4

εtotal

.

⎫ ⎪ ⎪ ⎪ ⎪ ⎬

" − 4αV

"! ! "" ⎪ ≡ bW˜ V . 2 ⎪ 2σ1m ⎪ ⎪ − 3αV 1 − αV 1+σ 2 − 3αV ⎭

≤−

where

2 σ1m 2 1+σ1M

(A.24) 0 4m (A.25)

or (A.26), as shown at the top of the page. Note that the range for αu and k ∗ will always guarantee bx > 0 and bΞ > 0. The range for α will guarantee be > 0 and bW˜ > 0. The range for αV will guarantee bW˜ V > 0 since 2 /4(1 + σ 2 ) < σ 2 /3(1 + σ 2 ), which will 0 < αV < σ1m 1M 1m 1M guarantee that the second term shown in (A.26) is positive.

(A.26)

1M

R EFERENCES [1] F. L. Lewis and V. L. Syrmos, Optimal Control, 2nd ed. New York, NY, USA: Wiley, 1995. [2] H. Xu and S. Jagannathan, “Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 471–484, Mar. 2013. [3] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1118–1129, Jul. 2012. [4] Z. Chen and S. Jagannathan, “Generalized Hamilton–Jacobi–Bellman formulation based neural network control of affine nonlinear discretetime systems,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 90–106, Jan. 2008. [5] R. Beard, “Improving the closed-loop performance of nonlinear systems,” Ph.D. dissertation, Dept. Electr. Eng., Rensselaer Polytech. Inst., Troy, NY, USA, 1995. [6] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “A neural network solution for fixed-final-time optimal control of nonlinear systems,” Automatica, vol. 43, no. 3, pp. 482–490, 2007. [7] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating acutators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005. [8] A. Heydari and S. N. Balakrishan, “Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 145–157, Jan. 2013. [9] F. Y. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with-error bound,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36, Jan. 2011. [10] Q. Zhao, H. Xu, and S. Jagannathan, “Finite-horizon optimal adaptive neural network control of uncertain nonlinear discrete-time systems,” in Proc. IEEE Multi-Conf. Syst. Control, Hyderabad, India, Aug. 2013, pp. 41–46. [11] C. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, Dept. Comput. Sci., Cambridge Univ., Cambridge, U.K., 1989. [12] H. K. Khalil, Nonlinear Systems, 3rd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2002. [13] P. J. Werbos, “A menu of designs for reinforcement learning over time,” J. Neural Netw. Control, vol. 3, no. 1, pp. 835–846, 1983. [14] T. Dierks and S. Jagannathan, “Optimal control of affine nonlinear discrete-time systems with unknown internal dynamics,” in Proc. Conf. Decision Control, Shanghai, China, 2009, pp. 6750–6755. [15] D. P. Bertsekas, Dynamic Programming and Optimal Control, 2nd ed. Belmont, MA, USA: Athena Scientific, 2000. [16] T. Dierks, B. T. Thumati, and S. Jagannathan, “Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence,” Neural Netw., vol. 22, nos. 5–6, pp. 851–860, 2009. [17] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of Learning and Approximate Dynamics Programming. New York, NY, USA: Wiley, 2004. [18] D. V. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997. [19] S. Jagannathan, Neural Network Control of Nonlinear Discrete-Time Systems. Boca Raton, FL, USA: CRC Press, 2006. [20] M. Green and J. B. Moore, “Persistency of excitation in linear systems,” Syst. Control Lett., vol. 7, no. 5, pp. 351–360, 1986. [21] K. S. Narendra and A. M. Annaswamy, Stable Adaptive Systems. Englewood Cliffs, NJ, USA: Prentice-Hall, 1989. [22] F. L. Lewis, S. Jagannathan, and A. Yesilderek, Neural Network Control of Robot Manipulator and Nonlinear Systems. London, U.K.: Taylor & Francis, 1999.


Qiming Zhao received the master’s degree in automation from Northwestern Polytechnical University, Xi’an, China, and the Ph.D. degree in electrical engineering from the Missouri University of Science and Technology, Rolla, MO, USA, in 2010 and 2013, respectively. He is currently with DENSO International America, Inc., Southfield, MI, USA. His current research interests include intelligent systems, approximate/adaptive dynamic programming, neural network-based control, nonlinear control, optimal control, and adaptive control.

Hao Xu (M’12) was born in Nanjing, China, in 1984. He received the master’s degree in electrical engineering from Southeast University, Nanjing, China, and the Ph.D. degree from the Missouri University of Science and Technology, Rolla, MO, USA, in 2012, respectively. He is currently an Assistant Professor with the Department of Electrical Engineering, University of Tennessee at Martin, Martin, TN, USA. His current research interests include wireless passive sensor network, localization, detection, networked control system, cyber-physical system, distributed network protocol design, optimal control, and adaptive control.


Sarangapani Jagannathan (S’89–M’89–SM’99) is with the Missouri University of Science and Technology, where he is a Rutledge-Emerson Endowed Chair Professor of Electrical and Computer Engineering and a Site Director of the NSF Industry/University Cooperative Research Center on Intelligent Maintenance Systems. He has coauthored 120 peer reviewed journal articles in the IEEE Transactions and 220 refereed IEEE conference articles, several book chapters, and three books. He holds 20 U.S. patents. He supervised to graduation around 18 doctoral and 29 M.S. level students and his funding is in excess of $14 million from various U.S. federal and industrial members. He was a co-editor for the IET book series on control from 2010 to 2013, and he is currently serving as the Editor-in-Chief for the Discrete Dynamics and Society and on many editorial boards. His current research interests include neural network control, adaptive event-triggered control, secure networked control systems, prognostics, and autonomous systems/robotics. Dr. Jagannathan is the IEEE CSS Tech Committee Chair on Intelligent Control. He is a senior IEEE member since 1999. He received many awards and has been organizing committees of several IEEE conferences.

Optimal second order sliding mode control for nonlinear uncertain systems.

Universal neural network control of MIMO uncertain nonlinear systems.

Adaptive neural network control of unknown nonlinear affine systems with input deadzone and output constraint.

Dynamic neural network-based robust observers for uncertain nonlinear systems.

Composite neural dynamic surface control of a class of uncertain nonlinear systems in strict-feedback form.

Adaptive control of uncertain nonaffine nonlinear systems with input saturation using neural networks.

Adaptive Output-Feedback Neural Control of Switched Uncertain Nonlinear Systems With Average Dwell Time.

Dynamic Surface Control Using Neural Networks for a Class of Uncertain Nonlinear Systems With Input Saturation.

Adaptive dynamic output feedback neural network control of uncertain MIMO nonlinear systems with prescribed performance.

Neural network-based finite horizon stochastic optimal control design for nonlinear networked control systems.

Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update.

Optimal second order sliding mode control for linear uncertain systems.

Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks.

Discrete-time neural inverse optimal control for nonlinear systems via passivation.

Adaptive identifier for uncertain complex nonlinear systems based on continuous neural networks.

Continuous neural identifier for uncertain nonlinear systems with time delays in the input signal.

Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming.

Fuzzy Adaptive Quantized Control for a Class of Stochastic Nonlinear Uncertain Systems.

Adaptive NN tracking control of uncertain nonlinear discrete-time systems with nonaffine dead-zone input.

Adaptive control for a class of nonlinear complex dynamical systems with uncertain complex parameters and perturbations.

Finite-Horizon Near-Optimal Output Feedback Neural Network Control of Quantized Nonlinear Discrete-Time Systems With Input Constraint.

Adaptive neural control of MIMO nonlinear systems with a block-triangular pure-feedback control structure.

Online adaptive policy learning algorithm for H∞ state feedback control of unknown affine nonlinear discrete-time systems.

Output-feedback adaptive neural control for stochastic nonlinear time-varying delay systems with unknown control directions.