This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Neural Network-Based Finite Horizon Stochastic Optimal Control Design for Nonlinear Networked Control Systems Hao Xu, Member, IEEE, and Sarangapani Jagannathan, Senior Member, IEEE

Abstract— The stochastic optimal control of nonlinear networked control systems (NNCSs) using neuro-dynamic programming (NDP) over a finite time horizon is a challenging problem due to terminal constraints, system uncertainties, and unknown network imperfections, such as network-induced delays and packet losses. Since the traditional iteration or time-based infinite horizon NDP schemes are unsuitable for NNCS with terminal constraints, a novel time-based NDP scheme is developed to solve finite horizon optimal control of NNCS by mitigating the above-mentioned challenges. First, an online neural network (NN) identifier is introduced to approximate the control coefficient matrix that is subsequently utilized in conjunction with the critic and actor NNs to determine a time-based stochastic optimal control input over finite horizon in a forward-in-time and online manner. Eventually, Lyapunov theory is used to show that all closed-loop signals and NN weights are uniformly ultimately bounded with ultimate bounds being a function of initial conditions and final time. Moreover, the approximated control input converges close to optimal value within finite time. The simulation results are included to show the effectiveness of the proposed scheme. Index Terms— Neuro-dynamic programming (NDP), nonlinear networked control system (NNCS), stochastic optimal control.

I. I NTRODUCTION

N

ONLINEAR networked control systems (NNCSs) [1], which utilize a communication network to connect the nonlinear plant with a controller, has been considered as the next-generation control system. Despite the following benefits, such as high flexibility, efficiency, and low installation cost, a communication network within the feedback loop causes several challenging issues due to network imperfections while exchanging data among devices. Furthermore, these network imperfections, such as random delays and packet losses, can degrade the performance of the control system even causing instability. Therefore, the stability analysis for the linear networked control systems (LNCS) is developed in [2]. Subsequently,

Manuscript received February 27, 2013; revised January 29, 2014; accepted March 28, 2014. This work was supported by the National Science Foundation under Grant ECCS 1128281. H. Xu is with the Department of Engineering, University of Tennessee at Martin, Martin, TN 38237 USA (e-mail: [email protected]). S. Jagannathan is with the Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2315622

Walsh et al. [3] analyzed asymptotic behavior of NNCS in the presence of network-induced delays. In [4], a discretetime framework is introduced to analyze NNCS stability with both delays and packet losses. These schemes [2]–[4] maintain stability of NNCS while assuming that the system dynamics and network imperfections are known beforehand. However, due to network imperfections that are time varying and normally unknown, the NCS dynamics become uncertain. In addition, optimality, which includes stability, is generally preferred [5]–[8] over stability alone. Neuro-dynamic programming (NDP) technique, on the other hand, proposed in [10] and [26], intends to obtain the optimal control of uncertain nonlinear systems in a forward-in-time manner in contrast to traditional backward-in-time optimal control technique, which normally requires complete knowledge of system dynamics. Moreover, using value or policy iterations, reinforcement learning can be incorporated with NDP to obtain optimal control [11]. Wang et al. [12] and Zhang et al. [13] proposed policy or value iteration-based NDP scheme to obtain finite horizon optimal control inputs for nonlinear system with unknown internal system dynamics. However, to compute the optimal solution, these iteration-based NDP methods require significant number of iterations within a fixed sampling interval [14], which can be a bottleneck for implementation. Therefore, Dierks and Jagannathan [14] developed a timebased NDP scheme to solve infinite horizon optimal control in the presence of unknown internal dynamics using previous history of system states and cost function values. Nevertheless, the existing NDP schemes [10], [11], [13], [14] are unsuitable for finite horizon optimal control of NNCS since: 1) these techniques [10], [11], [13], [14] are only developed to solve infinite horizon optimal control [15]; 2) the partial system dynamics in the form of control coefficient matrix is needed; and 3) network imperfections resulting from the communication network are ignored. Therefore, in [9], a model-free infinite horizon optimal control of LNCS is derived in the presence of uncertain dynamics and network imperfections using approximate dynamics programming-based technique. However, finite horizon optimal control is more difficult to solve for NNCS due to terminal constraints than infinite horizon-based schemes [5]–[9]. To the best of the authors’ knowledge, this paper for the first time considers finite horizon optimal design for NNCS by incorporating terminal constraints.

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

Fig. 1.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

General NNCS block diagram.

In this paper, an optimal adaptive control scheme using time-based NDP is undertaken to obtain finite horizon stochastic optimal regulation of NNCS in the presence of uncertain system dynamics due to network imperfections. First, to relax the NNCS dynamics, a novel neural network (NN) identifier is proposed to learn online the control coefficient matrix. Then, using an initial admissible control, a critic NN [16] is introduced and tuned in a forward-in-time manner to approximate the stochastic value function using Hamilton–Jacob–Bellman (HJB) equation [15], given the terminal constraint. Eventually, an actor NN is introduced to generate optimal control input by minimizing the estimated stochastic value function. Compared with the traditional stochastic optimal controller design [25] that requires full knowledge of system dynamics, our proposed stochastic optimal controller design for NNCS can relax the requirement on system dynamics and network imperfection as well as value or policy iterations using a novel time-based NDP technique. Moreover, the available control techniques for time-delay systems with known deterministic delays [17], [18] are unsuitable here since network imperfections from NNCS result in random delays and packet losses, which cannot be handled by the time-delay control techniques. The main contribution of this paper includes: 1) a timebased stochastic optimal control NDP approach over finite horizon for uncertain NNCS by incorporating the terminal constraints; 2) a novel online identifier to obtain the NNCS dynamics; and 3) demonstration of the closed-loop stability via a combination of Lyapunov and geometric sequence analysis, which guarantees the stability. This paper is organized as follows. First, Section II presents the background of NNCS and traditional finite horizon optimal control. The novel finite horizon stochastic optimal control scheme with an online identifier is developed in Section III, where the stability is also analyzed. Section IV illustrates the effectiveness of the proposed approach, while Section V carries the concluding remarks. II. BACKGROUND

the communication network is shared [19], [20], the NNCS in this paper considers the network imperfections including: 1) τsc (t): sensor-to-controller delay; 2) τca (t): controller-toactuator delay; and 3) γ (t): indicator of network-induced packet losses. In the recent NCS [2]–[9] and communication network protocol development literature [19], [20], the following assumptions [5], [9] are utilized for the controller design. Assumption 1: 1) Due to the wide area network, two types of network-induced delays are considered independent, ergodic, and unknown, whereas their probability distribution functions are considered known. The sensor-to-controller delay is assumed to be less than a sampling interval. 2) The sum of two delays is considered to be bounded, while the initial state of system is assumed to be deterministic [5]. Incorporating network-induced delays and packet losses, original affine nonlinear system can be represented as x(t) ˙ = f (x(t)) + γ (t)g(x(t))u(t − τ (t))

(1)

with

⎧ n×n if control input is received by the actuator ⎪ ⎨I γ (t) = at time t ⎪ ⎩ n×n if control input is lost at time t 0 In×n is n × n identity matrix, x(t) ∈ Rn×n , u(t) ∈ Rm×m , f (x) ∈ Rn×n , and g(x) ∈ Rn×m represent system state, control inputs, nonlinear internal dynamics, and control coefficient matrix, respectively. Similar to [9] and [21], integrating (1) over a sampling interval [kTs , (k + 1)Ts ) with network-induced delays and packet losses, the NNCS can be represented as     x k+1 = X τ,γ x k , u k−1 , . . . u k−d¯ + Pτ,γ x k , u k−1 , . . . u k−d¯ u k (2) ¯ s is the maximum network-induced delay, with Ts where dT the sampling interval and d¯ the number of sampling interval, x(kTs ) = x k and u((k − i )Ts ) = u k−i , ∀i = 0, 1, . . . , d¯ are discretized system state and previous control inputs, and X τ,γ (·) and Pτ,γ (·) are defined similar to [21]. Next, define a new augment state variable z k = ¯ T T ]T ∈ Rn+dm · · · u k− . Equation (2) can be expressed [x kT u k−1 d¯ equivalently as z k+1 = F(z k ) + G(z k )u k

(3)

with NNCS internal dynamics and control coefficient matrix F(·) and G(·) derived similar to [21] with G(z k ) F ≤ G M , where  ·  F denotes the Frobenius norm [16] and G M is a positive constant. Due to the presence of network-induced delays and packet losses, (3) becomes uncertain and stochastic, thus needing adaptive control methods. Next, the traditional stochastic optimal control scheme is briefly discussed for comparison.

A. NNCS Representation The block diagram representation of general NNCS is shown in Fig. 1, where control loop is closed using a communication network. Fig. 1 shows a typical NNCS that is practical and utilized in many real-time applications. In addition, since

B. Traditional Stochastic Optimal Control Consider the affine nonlinear discrete-time system x d,k+1 = fd (x d,k ) + gd (x d,k )u d,k

(4)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU AND JAGANNATHAN: NN-BASED FINITE HORIZON STOCHASTIC OPTIMAL CONTROL DESIGN

where x d,k and u d,k represent the system state and control input, while f d (x d,k ) and gd (x d,k ) denote the internal dynamics and control coefficient matrix, respectively. According to [15], optimal control input is derived by minimizing value function expressed as ⎧ 

N−1 ⎪ ⎪ Vk (x d,k , k) = E ϕ N (x d,N ) + r (x d,l , u d,l ) ⎪ ⎪ ⎪ ⎪ l=k ⎪

 ⎪ ⎨ N−1 T (Q d (x d,l ) + u d,l Rd u d,l ) = E ϕ N (x d,N ) + ⎪ l=k ⎪ ⎪ ⎪ ⎪ ∀k = 0, 1, . . . , N − 1 ⎪ ⎪ ⎪ ⎩ VN (x d,N , N) = E [ϕ N (x d,N )]

(5)

with the cost-to-go denoted as r (x d,l , u d,l ) = Q d (x d,l ) + T R u , ∀k = 0, 1, . . . , N − 1, N T being the final time u d,l d d,l s instant, Q d (x) ≥ 0, ϕ N (x) ≥ 0 and Rd being symmetric positive definite matrix, and E(·) being the expectation operator (the mean value). Here, the terminal constraint ϕ N (x) needs to be satisfied in the finite horizon optimal control design. Equation (5) can also be rewritten as Vk (x d,k , k) = E r (x d,k , u d,k )+ϕ N (x d,N )+

N−1

r (x d,l , u d,l )

l=k

=E



  T Q d (x d,k )+u d,k Rd u d,k +Vk+1 (x d,k+1 )

Substituting (9) into (7), the discrete-time HJB equation (7) becomes  1 ∂ V ∗T (x d,k+1 ) gd (x d,k+1 )Rd−1 V ∗ (x d,k , k) = E Q d (x d,k ) + 4 ∂ x d,k+1

∂ V ∗T (x d,k+1 ) ×gd (x d,k+1 ) + V ∗ (x d,k+1 ) ∂ x d,k+1 ∀k = 0, . . . , N − 1 T 1 ∂ϕ N (x d,N ) V ∗ (x d,N−1 ) = E Q d (x d,N−1 )+ gd (x d,N−1 )Rd−1 4 ∂ x d,N

∂ϕ N (x d,N ) ×gd (x d,N−1 ) + ϕ N (x d,N ) . ∂ x d,N (10) It is worthwhile to observe that obtaining a closed-form solution to the discrete-time HJB is difficult since future system state x k+1 and system dynamics are needed at kTs . To circumvent this issue, normally value and policy iterationbased schemes are utilized [12], [13]. However, iterationbased methods are unsuitable for real-time control due to large number of iterations needed within a sampling interval. Inadequate number of iterations will lead to instability [14]. Therefore, time-based NDP finite horizon optimal controller design is presented next for the NNCS. III. S TOCHASTIC O PTIMAL C ONTROLLER D ESIGN FOR NNCS

k = 0, . . . , N − 1. (6) According to the observability condition [22], when x = 0 and Vk (x) = 0, the value function Vk (x) serves as a Lyapunov function [15], [16]. Based on the Bellman principle of optimality [15], [16], the optimal value function also satisfies the discrete-time HJB equation given by ⎧ ∗ (Vk (x d,k , k) ⎪ ⎪V (x d,k , k) = min u d,k ⎪ ⎪ ⎪    ⎪ ⎨ T R u ∗ = min E Q d (x d,k ) + u d,k d d,k + V (x d,k+1 ) u d,k ⎪ ⎪ ⎪ ∀k = 0, . . . , N − 1 ⎪ ⎪ ⎪ ⎩ ∗ V (x d,N , N) = E[ϕ N (x d,N )].

3

(7)

In this section, a novel time-based NDP technique is derived to obtain stochastic optimal regulation of NNCS over finite time horizon with uncertain system dynamics due to network imperfections, such as random delays and packet losses. First, an online NN identifier is introduced to obtain the control coefficient matrix. Then, the critic NN is proposed to approximate the stochastic value function within a fixed final time. Eventually, using an actor NN, identified NNCS dynamics, and estimated stochastic value function, the finite horizon stochastic optimal control of NNCS is derived. The details are given in the following. A. Online NN Identifier Design

Differentiating (7), the optimal control u ∗d,k is obtained as

  T R u T ∂ Q d (x d,k )+u d,k ∂ x d,k+1 d d,k ∂ V ∗ (x d,k+1 ) = 0. E + ∂u d,k ∂u d,k ∂ x d,k+1 (8) In other words 

⎧ 1 ∂ V ∗ (x d,k+1 , k + 1) ⎪ −1 T ∗ (x ⎪ u E R ) = − g (x ) ⎪ d,k d d,k d ⎪ 2 ∂ x d,k+1 ⎪ ⎪ ⎨ k = 0, . . . , N − 1 (9) ⎪ ⎪ 

⎪ ⎪ 1 ∂ϕ N (x d,N , N) ⎪ ⎪ ⎩u ∗ (x d,N−1 ) = − E Rd−1 gdT (x d,N−1 ) . 2 ∂ x d,N

The control coefficient matrix is required for the optimal control of affine nonlinear system [12], [13]. However, the control coefficient matrix is not normally known in advance. To overcome this deficiency, a novel NN identifier is proposed to estimate the control coefficient matrix denoted as G(z). Due to network imperfections that are presented by random variables, stochastic mathematical treatment will be used throughout this paper. Based on [23] and universal function approximation property, the NNCS internal dynamics and control coefficient matrix can be represented on a compact set  as   F(z k ) = E W FT υ F (z k ) + ε F,k ∀k = 0, 1, . . . , N τ,γ   G(z k ) = E WGT υG (z k ) + εG,k ∀k = 0, 1, . . . , N (11) τ,γ

with W F and WG denoting the target NN weights, υ F (·) and υG (·) denoting activation functions, and ε F,k and εG,k

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

representing reconstruction errors, respectively. Substituting (11) into (3) to get z k = F(z k−1 )+G(z k−1 )u k−1

 



   υ F (z k−1 ) 0 In = E W FT WGT +ε I,k−1 0 υG (z k−1 ) u k−1 τ,γ   = E W IT υ I (z k−1 )Uk−1 + ε I,k−1 ∀k = 0, 1, . . . , N (12) τ,γ

Then, dynamics of auxiliary identification error can be represented as  T  υ¯ I (z k )U¯ k E ( I,k+1 ) = E (Z k+1 ) − E Wˆ I,k+1 τ,γ

τ,γ

τ,γ

∀k = 0, 1, . . . , N − 1.

(17)

To force the actual NN identifier weight matrix close to its target within the fixed final time, the stochastic update law for E (Wˆ I,k ) can be expressed as

where W I = [W FT WGT ]T and υ I (z k−1 ) = diag[υ F (z k−1 )υG (z k−1 )] are NN identifier target weight and activation funcT ]T tion, respectively, augment control input Uk−1 = [InT u k−1 includes historical input values u k−1 and In = [11 · · · 1]T ∈ R n×1 , ε I,k−1 = ε F,k−1 + εG,k−1 u k−1 represent the NN identifier reconstruction error, and E (·) is the expectation operator.

τ,γ

Since the NN activation function and augmented control input from previous time instants are considered bounded with  an initial bounded input, the term  E [υ I (z k−1 )Uk−1 ] ≤

where α I is the tuning parameter satisfying 0 < α I < 1. Substituting update law (18) into auxiliary error dynamics (17), the error dynamics E ( I,k+1 ) can be represented as

τ,γ

τ,γ

ζ M , where ζ M is a positive constant. Moreover, the NN identifier reconstruction error is considered to be bounded,   i.e.,  E [ε I,k−1 ] ≤ ε I,M , where ε I,M denotes a positive τ,γ

constant. Therefore, given the bounded NN activation functions υ F (·), υG (·), and υ I (·), the NNCS control coefficient matrix G(z) can be identified once the NN identifier weight matrix W I is obtained. Next, the update law for the NN identifier will be introduced. The NNCS system state z k can be approximated using an NN identifier as  T  (13) zˆ k = E Wˆ I,k υ I (z k−1 )Uk−1 ∀k = 0, 1, . . . , N τ,γ

with Wˆ I,k the actual identifier NN weight matrix at the time instant kTs , and E [υ¯ I (z k )Uk ] the basis function of NN τ,γ

identifier. Based on (12) and (13), the identification error can be expressed as E (e I,k ) = E (z k − zˆ k )

τ,γ

τ,γ

 T  = E (z k ) − E Wˆ I,k υ I (z k−1 )Uk−1 . τ,γ

τ,γ

(14)

Moreover, identification error dynamics can be derived as  T  υ I (z k )Uk E (e I,k+1 ) = E (z k+1 ) − E Wˆ I,k+1 τ,γ

τ,γ

τ,γ

∀k = 0, 1, . . . , N − 1. (15) According to [21] and using the history from NNCS, an auxiliary identification error vector can be defined as E ( I,k ) = E (Z k − Zˆ k )

τ,γ

τ,γ

 T  υ¯ I (z k−1 )U¯ k−1 = E (Z k ) − E Wˆ I,k τ,γ

τ,γ

(16)

with Z k = [z k z k−1 · · · z k+1−l ], υ¯ I (z k−1 ) = [υ I (z k−1 ) T  T T υ I (z k−2 ) · · · υ I (z k−l )], U¯ k−1 = diag Uk−1 · · · Uk−l , and ε¯ I,k−1 = [ε I,k−1 · · · ε I,k−l ] with 0 < l < k − 1. The l previous identification errors (17) are recalculated using the most recent  actual NN identifier weight matrix i.e., E (Wˆ I,k ) . τ,γ

τ,γ

 −1  E (Wˆ I,k+1 ) = E U¯ k υ¯ I (z k ) υ¯ IT (z k )U¯ kT U¯ k υ¯ I (z k ) τ,γ  T  Z k − α I I,k ∀k = 0, 1, . . . , N −1 (18)

τ,γ

E ( I,k+1 ) = α I E ( I,k ) ∀k = 0, 1, . . . , N − 1.

τ,γ

τ,γ

(19)

To learn the NNCS control coefficient matrix G(z), the E [υ¯ I (z k )Uk ] has to be persistently existing (PE) [14], [16]

τ,γ

long enough, namely, there exists a positive constant ζmin such that 0 < ζmin ≤  E [υ¯ I (z k )Uk ] is satisfied for k = τ,γ

0, 1, . . . , N. Recalling (12), the identification error dynamics (15) can be represented as  T  υ I (z k )Uk +ε I,k ∀k = 0, 1, . . . , N −1 E (e I,k+1 ) = E W˜ I,k+1 τ,γ

τ,γ

(20) where W˜ I,k = W I − Wˆ I,k is the NN identifier weight estimation error at time kTs . Using the NN identifier update law and (20), the NN weight estimation error dynamics of the NN identifier can be derived as  T    T  E W˜ I,k+1 υ I (z k )Uk = E α I W˜ I,k υ I (z k )Uk−1 τ,γ τ,γ  +α I ε I,k−1 − ε I,k . (21) Next, the stability of NN identification error (14) and weight estimation errors E (W˜ I,k ) will be analyzed. τ,γ

Lemma 1 (Boundedness in the Mean [21] for the NN Identifier): Given the initial NN identifier weight matrix W I,0 that resides in a compact set , let the proposed NN identifier be defined as (13), its update law be given by (18). Assuming that E [υ¯ I (z k )Uk ] satisfies the PE condition, there exists a positive τ,γ √ tuning parameter α I with 0 < α I < ζmin / 2ζ M such that identification error (14) and NN identifier weight estimation error E (W˜ I,k ) are all uniformly ultimately bounded (UUB) τ,γ

within the fixed final time t ∈ [0, N Ts ]. Proof: Proof follows similar to [21] and omitted here. B. Stochastic Value Function Setup and Critic NN Design According to the value function defined in [12] and [13] and given NNCS dynamics (3), the stochastic value function

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU AND JAGANNATHAN: NN-BASED FINITE HORIZON STOCHASTIC OPTIMAL CONTROL DESIGN

where ψ(z k , N − k) = ψ(z k+1 , N − k − 1) − ψ(z k , N − k), r (z k , u k ) = E (z kT Q z z k + u kT Rz u k ), and εV ,k = E (εV ,k −

can be expressed in terms of augment state z k as ⎧  N−1   ⎪ ⎪ TR u ⎪ V (z ϕ Q , k) = E (z ) + (z ) + u k N N z l z l ⎪ l ⎨ τ,γ l=k ⎪ ⎪ ⎪ ⎪ ⎩V (z N , N) = E [ϕ N (z N )]

τ,γ

k = 0, . . . , N − 1 (22)

τ,γ

with Q z (z k ) ≥ 0 and Rz = 1/d¯ Rd . Compared with stochastic value function under infinite horizon case, a terminal constraint (i.e., VN (z N , N) = E [ϕ N (z N )]) needs to be considered while developing stochastic optimal controller. Next, according to the universal approximation property of NN [16], the stochastic value function (22) can be represented using a critic NN as   V (z k , k) = E WVT ψ(z k , N − k) + εV ,k ∀k = 0, 1, . . . , N τ,γ

(23) where WV and εV ,k denote critic NN target weight matrix and NN reconstruction error, respectively, and ψ(z k , N − k) represents the time-dependent critic NN activation function. Since the activation function explicitly depends upon time, the finite horizon design is different and difficult when compared with the infinite horizon case [14]. The target weight matrix of the critic NN is considered bounded in the mean as  E (WV ) ≤ WV M , with WV M being a positive constant, τ,γ

and the reconstruction error is also considered bounded in the mean such that  E (εV ,k ) ≤ εV M , with εV M being τ,γ

a positive constant. In addition, the gradient of the NN reconstruction error is assumed to be bounded in the mean as  E (∂εV ,k /∂z k ) ≤ εV M , with εV M being a positive τ,γ

constant [14]. Next, approximate stochastic value function (23) can be represented as   Vˆ (z k , k) = E Wˆ VT,k ψ(z k , N −k) ∀k = 0, 1, . . . , N (24) τ,γ

with E (Wˆ V ,k ) being the estimated critic NN weight matrix τ,γ

and ψ(z k , N − k) representing the time-dependent activation function selected from a basis function set whose elements in the set are linearly independent [14]. In addition, since the activation function is continuous and smooth, time-independent functions ψmin (z k ) and ψmax (z k ) can be found such that ψmin (z k ) ≤ ψ(z k , N − k) ≤ ψmax (z k ), k = 0, . . . , N. Then, the target stochastic value function is considered bounded as WVT ψmin (z k ) ≤ V (z k , k) ≤ WVT ψmax (z k ) with the stochastic value function estimation error satisfying W˜ VT,k ψmin (z k ) ≤ V˜ (z k , k) ≤ W˜ VT,k ψmax (z k ). Recall (6) and substitute (22) into (6) to get  E (εV ,k −εV ,k+1 ) = E WVT (ψ(z k+1 , N −k −1) τ,γ τ,γ    − ψ(z k , N −k)) + E z kT Q z z k + u kT Rz u k τ,γ

namely

(25)

  E WVT ψ(z k , N −k) +r (z k , u k ) = εV ,k

τ,γ

k = 0, 1, . . . , N −1

(26)

τ,γ

εV ,k+1 ), with  εV ,k  = εV M , ∀k = 0, . . . , N −1. However, (26) cannot be held while utilizing the approximated critic NN Vˆ (z k , k) instead of V (z k , k). Similar to [14] and [21], using delayed values for convenience, the residual error dynamics associated with (26) are derived as   T E (eHJB,k ) = E z kT Q z z k + u k−1 Rz u k−1 τ,γ

τ,γ

τ,γ

∀k = 0, 1, . . . , N − 1

5

+Vˆ (z k+1 , k + 1) − Vˆ (z k , k)   = E Wˆ VT,k ψ(z k , N − k) + r (z k , u k ) τ,γ

k = 0, . . . , N − 1

(27)

with E (eHJB,k ) being the residual error of the HJB equation τ,γ

for the finite horizon scenario. Moreover, since r (z k , u k ) =

εV ,k − E (WVT ψ(z k , N − k)), ∀k = 0, 1, . . . , N, the residτ,γ

ual error dynamics are represented as   E (eHJB,k ) = E Wˆ VT,k ψ(z k , N − k) τ,γ τ,γ   − E WVT ψ(z k , N − k) + εV ,k τ,γ   = − E W˜ VT,k ψ(z k , N − k) + εV ,k τ,γ

k = 0, . . . , N − 1

(28)

where E (W˜ V ,k ) = E (WV )− E (Wˆ V ,k ) denotes the critic NN τ,γ

τ,γ

τ,γ

weight estimation error. Next, to consider the terminal constraint, the estimation error E (e F C,k ) is defined as τ,γ

  E (e F C,k ) = E [ϕ N (z N )] − E Wˆ VT,k ψ(ˆz N,k , 0)

τ,γ

τ,γ

τ,γ

∀k = 0, 1, . . . , N

(29)

where zˆ N,k is the estimated final NNCS system state vector ˆ ˆ at time kTs using NN identifier [i.e., F(·), G(·)]. Recalling (23), (29) can be represented in terms of critic NN weight estimation error as     E (e F C,k ) = E WVT ψ(z N , 0) + εV ,0 − E Wˆ VT,k ψ(ˆz N,k , 0) τ,γ τ,γ τ,γ     ˜ N , zˆ N,k , 0) = E W˜ VT,k ψ(ˆz N,k , 0) + E WVT ψ(z τ,γ

τ,γ

+ E (εV ,0 ) ∀k = 0, 1, . . . , N τ,γ

(30)

˜ N , zˆ N,k , 0) = ψ(z N , 0) − ψ(ˆz N,k , 0). Since the critic with ψ(z NN activation function ψ(·) is bounded, i.e., ψ(·) ≤ ψ M ˜ N , zˆ N,k , 0) ≤ 2ψ M , with ψ M being [14], [21], we have ψ(z a positive constant. Combining both the HJB residual and terminal constraint estimation errors and using the gradient-descent scheme, the stochastic update law of critic NN weight can be given by   ψ(ˆz N,k )e TF C,k   E Wˆ V ,k+1 = E (Wˆ V ,k ) + αV E τ,γ τ,γ τ,γ ψ T (ˆ z N,k )ψ(ˆz N,k ) + 1   T

ψ(z k , N −k)eHJB,k −αV E τ,γ ψ T (z k , N −k) ψ(z k , N −k)+1 k = 0, . . . , N − 1. (31)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Remark 1: When the NNCS system state becomes zero, z k = 0, k = 0, 1, . . . , N, both the stochastic value function (23) and the critic NN approximation (24) become zero. Therefore, the critic NN will stop updating once the system state vector converges to zero. According to [14] and [21], this can be considered as a PE requirement for the input to the critic NN. In other words, the system state has to PE long enough for the critic NN to learn the stochastic value function within the finite time t ∈ [0, N Ts ]. Similar to [14], [16], and [21], the PE requirement can be satisfied by introducing exploration noise such that 0 < ψmin < ψmin (z) ≤ ψ(z, k) and 0 < ψmin <  ψmin (z k ) ≤  ψ(z k , N − k), with ψmin and ψmin being positive constants. Recalling the definition of critic NN weight estimation error E (W˜ V ,k ), the stochastic dynamics of E (W˜ V ,k ) can be τ,γ

τ,γ

expressed as E (W˜ V ,k+1 )

τ,γ



 T (ˆ ˜ V ,k ψ(ˆ z , 0)ψ z , 0) W N,k N,k = E (W˜ V ,k ) − αV E τ,γ τ,γ ψ T (ˆz N,k , 0)ψ(ˆz N,k , 0) + 1   ψ(ˆz N,k , 0)ψ˜ T (z N , zˆ N,k , 0)WV −αV E τ,γ ψ T (ˆz N,k , 0)ψ(ˆz N,k , 0) + 1   ψ(ˆz N,k , 0)εV ,0 −αV E τ,γ ψ T (ˆ z N,k , 0)ψ(ˆz N,k , 0) + 1  

ψ(z k , N − k) ψ T (z k , N − k)W˜ V ,k −αV E τ,γ

ψ T (z k , N − k) ψ(z k , N − k) + 1  

ψ(z k , N − k) εV ,k −αV E τ,γ ψ T (z k , N − k) ψ(z k , N − k) + 1 k = 0, . . . , N − 1. (32)

Next, the boundedness of the critic NN weight estimation Error E (W˜ V ,k ) derived by (29) is demonstrated. τ,γ

Theorem 1 (Boundedness in the Mean of Critic NN Weight Estimation Error): Given an initial NNCS admissible control policy u 0 (z k ), let critic NN weight update law be designed as (31). Then, there exists a positive constant αV satisfying αV  < 2 − χ/χ + 5 with 0 < χ =  2  0 < 2 + 2 /((ψ 2 + 1)( ψ 2 + 1)) < 2 such ψmin + ψmin min min that critic NN weight estimation error (32) is UUB in the mean within a fixed final time. Furthermore, the ultimate bound depends upon final time N Ts and initial bounded critic NN weight estimation error BW V ,0. Proof: Refer to the Appendix. C. Actor NN Estimation of Optimal Control Input According to the universal approximation property of NN, the ideal finite horizon stochastic optimal control input can be expressed using actor NN as   u ∗ (z k ) = E WuT ϑ(z k , k) + εu,k ∀k = 0, 1, . . . , N (33) τ,γ

with E (Wu ) and E (εu,k ) denoting the target weight and τ,γ

τ,γ

reconstruction error of the actor NN, respectively, and ϑ(z k , k) represents the smooth time-varying actor NN activation function. Moreover, two time-independent functions ϑmin (z k ) and

ϑmax (z k ) can be found such that ϑmin (z k ) ≤ ϑ(z k , k) ≤ ϑmax (z k ), k = 0, . . . , N. In addition, the ideal actor NN weight matrix, activation function, and reconstruction are all considered to be bounded such that  E (Wu ) ≤ τ,γ

Wu M ,  E (ϑ(z k , k)) ≤ ϑ M and  E (εu,k ) ≤ εu M , with τ,γ

τ,γ

Wu M , ϑ M , and εu M being positive constants. Next, similar to [14] and [21], the actor NN estimation of (33) can be represented as  T  u(z ˆ k ) = E Wˆ u,k ϑ(z k , k) ∀k = 0, 1, . . . , N (34) τ,γ

E (Wˆ u,k ) represents the estimated weights for

where

τ,γ

actor NN. Moreover, the estimation error of the actor NN can be defined as the difference between the actual control inputs (34) applied to the NNCS and control policy, which minimizes the estimated stochastic value function (24) with ˆ k ) during the interval identified control coefficient matrix G(z t ∈ [0, kTs ] as  1 T ϑ(z k , k) + Rz−1 Gˆ T (z k ) E (eu,k ) = E Wˆ u,k τ,γ τ,γ 2

∂ψ T (z k+1 , N − k − 1) ˆ × WV ,k . (35) ∂z k+1 Select the stochastic update law for the actor NN actual weight matrix as 

ϑ(z k , k) T e E (Wˆ u,k+1 ) = E (Wˆ u,k )−αu E τ,γ τ,γ τ,γ ϑ T (z k , k)ϑ(z k , k) + 1 u,k k = 0, . . . , N − 1 (36) with the tuning parameter αu satisfying 0 < αu < 1. Furthermore, the ideal actor NN output (33) should be equal to the control policy that minimizes the ideal stochastic value function (23), which is given by   E WuT ϑ(z k , k) + εu,k τ,γ   T T (z , N − k − 1)W + ε ∂ ψ 1 k+1 V V ,k = − E Rz−1 G T (z k ) . 2 τ,γ ∂z k+1 (37a) In other words E

τ,γ

1 −1 T R G (z k ) 2 z   ∂εVT ,k ∂ϑ T (z k+1 )WV = 0. × + ∂z k+1 ∂z k+1

WuT ϑ(z k ) + εu,k +

(37b)

Substitute (37b) into (35), the actor NN estimation error can be represented equivalently as T E (eu,k ) = − E [W˜ u,k ϑ(z k , k)] τ,γ

τ,γ



1 ∂ψ T (z k+1 , N − k) ˜ E Rz−1 Gˆ T (z k ) WV ,k 2 τ,γ ∂z k+1 

T 1 ∂ψ (z k+1 , N − k) − E Rz−1 G˜ T (z k ) WV 2 τ,γ ∂z k+1

T ∂εV ,k 1 −1 T − εu,k − E Rz G (z k ) 2 τ,γ ∂z k+1



∀k = 0, 1, . . . , N

(38)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU AND JAGANNATHAN: NN-BASED FINITE HORIZON STOCHASTIC OPTIMAL CONTROL DESIGN

7

where E(W˜ u,k ) = E(Wu ) − E(Wˆ u,k ) is the actor NN weight ˜ k ) = G(z k ) − G(z ˆ k ). estimation error with G(z Next, substitute (38) into (36), the actor NN stochastic weight estimation error dynamics can be expressed as 

ϑ(z k , k) T ˜ ˜ e E (Wu,k+1 ) = E (Wu,k )+αu E τ,γ τ,γ τ,γ ϑ T (z k , k)ϑ(z k , k) + 1 u,k 

ϑ(z k , k) = E (W˜ u,k ) − αu E W˜ u,k τ,γ τ,γ ϑ T (z k , k)ϑ(z k , k) + 1  ϑ(z k , k) αu − E R −1 Gˆ T (z k ) T 2 τ,γ ϑ (z k , k)ϑ(z k , k) + 1 z

∂ψ T (z k+1 , N − k − 1) ˜ × WV ,k ∂z k+1  αu ϑ(z k , k) − E R −1 G˜ T (z k ) 2 τ,γ ϑ T (z k , k)ϑ(z k , k) + 1 z

∂ψ T (z k+1 , N − k − 1) WV × ∂z k+1

∂εVT ,k ϑ(z k , k) αu −1 T E R G (z k ) − εu,k − 2 τ,γ ϑ T (z k , k)ϑ(z k , k)+1 z ∂z k+1 k = 0, . . . , N − 1.

(39)

In this approach, due to the novel NN identifier, the need for the NNCS control coefficient matrix G(z k ) is relaxed, which itself is a contribution when compared with [12]–[14]. Next, the closed-loop stability of NNCS with the proposed novel time-based NDP algorithm will be demonstrated. D. Closed-Loop Stability In this section, we will prove that the closed-loop NNCS system is UUB in the mean within a fixed final time with the ultimate bounds dependent upon initial conditions and final time. Moreover, when the final time instant goes to infinity, k → ∞, the estimated control input approaches the infinite horizon optimal control input. Before demonstrating the main theorem on closed-loop stability, the flowchart of the proposed novel time-based NDP finite horizon optimal control design is shown in Fig. 2. Similar to [14] and [21], the initial NNCS system state is considered to reside in a compact set  due to the initial admissible control input u 0 (z k ). Furthermore, the actor NN activation function, the critic NN activation function, and its gradient are all considered bounded in  as  E (ψ(z k , k)) ≤ τ,γ     ψ M ,  E [∂ψ(z k )/∂z k ] ≤ ψ M , and  E (ϑ(z k , k)) ≤ ϑ M . τ,γ

τ,γ

In addition, the PE condition will be satisfied by introducing exploration noise [14], [21]. The three NN tuning parameters α I , αV , and αu will be derived to guarantee that all future system state vector remains in . To proceed, the following lemma is needed before introducing the theorem. Lemma 1: Let an optimal control policy be utilized for the controllable NNCS (3) such that (3) is asymptotically stable in the mean [21]. Then, the closed-loop NNCS dynamics E [F(z k ) + G(z k )u ∗ (z k )] satisfy τ,γ      E [F(z k ) + G(z k )u ∗ (z k )] ≤ lo  E (z k ) ∀ = 0, 1, . . . , N τ,γ

τ,γ

(40)

Fig. 2.

Flowchart of proposed finite horizon stochastic optimal control.

where u ∗ (z k ) is the optimal control input, with 0 < lo < 1 being a positive constant. Proof: Proof follows similar to [14] and [21], and omitted here. Theorem 2 (Convergence of Stochastic Optimal Control Input): Let u 0 (z k ) be any initial admissible control policy for the NNCS (3) such that (40) holds with 0 < lo < 1/2. Given the NN weight update laws for identifier, critic, and actor NN as (18), (31), and (36), respectively, there exist three positive tuning parameters √ α I , αV , and αu satisfying 0 < α I < min{1/2ζ M , ζmin / 2ζ M }, 0 < αV < 2 − χ /χ + 5, and 0 < αu < 1, with 0 < 2 ) + ( ψ 2 + 2)/((ψ 2 + 1)( ψ 2 + 1)) < 2 χ = (ψmin min min min defined in Theorem 1 such that NNCS system state E (z k ), τ,γ

identification error E (e I,k ), NN identifier weight estimation τ,γ

error E (W˜ I,k ), and critic and actor NN weight estimation τ,γ

errors E (W˜ V ,k ) and E (W˜ u,k ) are all UUB (A.12) in the τ,γ

τ,γ

mean within the fixed final time t ∈ [0, N Ts ]. In addition, the ultimate bounds are dependent upon the final time instant, N Ts , bounded initial state Bz,0 , identification error BeI,0 , and weight estimation error for NN identifier, critic, and actor NN BW I,0 , BW V ,0, and BW u,0 . Moreover, u ∗k − uˆ k  ≤ Bu with Bu is small bound. Proof: Refer to the Appendix. IV. S IMULATION R ESULTS The performance of proposed finite horizon stochastic optimal regulation control of NNCS in the presence of unknown

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

Fig. 3.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Distribution of network-induced delay in NNCS. Fig. 5.

State regulation errors with the proposed controller.

A. State Regulation Errors and Controller Performance

Fig. 4. Distribution of network-induced packet losses in NNCS (1 denotes that packet is received and 0 denotes that packet is lost).

system dynamics and network imperfections has been evaluated. The continuous-time version of original two-link robot system is shown as [30] x˙ = f (x) + g(x)u

(41)

with internal dynamics f (x) and control coefficient matrix g(x) given as ⎡ ⎤ x3 ⎢ ⎥ ⎢ ⎥ x4 ⎢ ⎥ ⎢ ⎥   2 − x 2 − x 2 cos x ) sin x ⎢ ⎥ −(2x x + x 3 4 2 2 4 3 3 ⎢ ⎥ ⎢ ⎥ +20 cos x − 10 cos(x + x ) cos x 1 1 2 2 ⎢ ⎥ f (x) = ⎢ 2 ⎥ cos x 2 − 2 ⎢⎛ ⎞⎥ ⎢ (2x 3 x 4 + x 2 + 2x 3 x 4 cos x 2 + x 2 cos x 2 + 3x 2 ⎥ ⎢⎜ 4 4 3 ⎥ ⎢⎜ +2x 2 cos x 2 + 20(cos(x 1 + x 2 ) − cos x 1 )× ⎟ ⎟⎥ ⎠⎥ ⎢⎝ 3 ⎣ ⎦ (1 + cos x 2 ) − 10 cos x 2 cos(x 1 + x 2 ) 2 cos x 2 − 2 ⎡ ⎤ 0 0 ⎢ ⎥ 0 0 ⎢ ⎥ ⎢ ⎥ 1 −1 − cos x ⎢ ⎥. 2 g(x) = ⎢ (42) ⎥ ⎢ 2 − cos2 x 2 2 − cos2 x 2 ⎥ ⎣ −1 − cos x 2 3 + 2 cos x 2 ⎦ 2 − cos2 x 2 2 − cos2 x 2 The network parameters are selected as follows [21]: 1) sampling time is given by Ts = 10 ms; 2) the upper bound of network-induced delay is given as d¯ Ts = 20 ms; 3) the network-induced delays: E(τsc ) = 8 ms and E(τ ) = E(τ ) = 15 ms; 4) network-induced packet losses follow Bernoulli distribution with γ¯ = 0.3; and 5) the final time is set as t f = N T s = 20 s with simulation time steps N = 2000. The distribution of network-induced delays and packet losses is shown in Figs. 3 and 4.

Note that the problem attempted in this paper is optimal regulation, which implies that NNCS states should converge to the origin in an optimal manner. After incorporating the network imperfections into NNCS, the augment state vector is presented as z k = [x k u k−1 u k−2 ]T ∈ R 8×1 , and admissible control and the initial state are selected as 

−100 0 −100 0 0 0 0 0 z u o (z k ) = 0 −100 0 −100 0 0 0 0 k and

T  π π 0 0 x0 = − 6 6 respectively. Moreover, the NN identifier activation function is given as tanh{(z k,1 )2 , z k,1 z k,2 , . . . , (z k,8 )2 , . . . , (z k,1 )6 , (z k,1 )5 (z k,2 ), . . . , (z k,8 )6 }, the state-dependent part of activation function for the critic NN is defined as sigmoid of sixth-order polynomial [i.e., sigmoid {(z k,1 )2 , z k,1 z k,2 , . . . , (z k,8 )2 , . . . , (z k,1 )6 , (z k,1 )5 (z k,2 ), . . . , (z k,8 )6 }) and the timedependent part of critic NN activation function is selected as saturation polynomial time function (i.e., sat{(N − k)31 , (N − k)30 , . . . , 1; · · · ; 1, (N − k)31 , . . . , N − k}), and the activation function of actor NN is selected as the gradient of critic NN activation function. The saturation operator for time function is added to ensure that the magnitude of time function stays within a reasonable range such that the NN weights are computable. Moreover, all three NNs have two layers, where the first layer is input layer and the second layer is hidden layer. For NN identifier, hidden layer has 39 neurons, whereas critic NN and actor NN have 32 hidden neurons. Feedforward NNs structure is selected for all the NNs. The tuning parameters of NN identifier, critic NN, and actor NN are defined as α I = 0.03, αV = 0.01, and αu = 0.5, with the initial weights of NN identifier and critic NN in hidden layer being selected as zero, whereas actor NN weight matrix in the hidden layer is set to reflect the initial admissible control at the beginning of simulation, whereas the weights of three NNs’ input layer are chosen as all ones. The results are shown in Figs. 5–9. First, the state regulation error and stochastic optimal control inputs are studied. As shown in Figs. 5 and 6, the proposed stochastic optimal controller can force the NNCS

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU AND JAGANNATHAN: NN-BASED FINITE HORIZON STOCHASTIC OPTIMAL CONTROL DESIGN

Fig. 6.

9

Stochastic optimal control inputs.

Fig. 8.

Estimated NN weights of (a) critic NN and (b) actor NN.

Fig. 9.

Identification errors.

Fig. 7. State regulation errors using NN feedback linearizing controller with network imperfections.

state regulation errors converge close to zero within a fixed final time even in the presence of uncertain NNCS dynamics and network imperfections. Moreover, the stochastic control signal is also bounded in the mean. The initial admission control selection affects the transient performance similar to the NN tuning parameters, and these have to be carefully selected for suitable transient performance. Next, the effect of network imperfections is evaluated. According to [16] and [30], an NN-based feedback linearization control input can maintain the stability of the two-link robot system (41). However, after introducing the networkinduced delays and packet losses shown in Figs. 3 and 4, this feedback linearization controller cannot retain the stability of NNCS, as shown in Fig. 7. This in turn confirms that a controller should be carefully designed after incorporating the effects of network imperfections. Now, the evolution of the NN weights is studied. In Fig. 8, the actual weights of critic and actor NN are shown. Within the fixed final time (i.e., t ∈ [0, 20s]), the actual weights of critic and actor NN converge and remain UUB in the mean consistent with Theorem 2. Furthermore, as shown in Fig. 9, the identification error converges close to zero, which indicates that the NN identifier learns the system dynamics properly. B. HJB Equation and Terminal Constraint Estimation Errors In this section, the HJB equation and terminal constraint estimation errors have been analyzed. It is well known that

the proposed control input will approach finite horizon optimal control [15], [25] input only when the control input satisfies both HJB and terminal constraint errors. If the HJB equation error is near zero, then the solution of the HJB equation is optimal, and the control input that uses the value function becomes optimal. In Fig. 10, within the fixed final time t ∈ [0, 20s], not only the HJB equation error but also terminal constraint estimation errors converge close to zero. This indicates that the proposed stochastic optimal control inputs approach the finite horizon optimal control inputs. C. Cost Function Comparison Subsequently, the cost function of the proposed finite horizon stochastic optimal controller is studied. For comparison, with known system dynamics and network imperfections, a conventional NN-based feedback linearization control [16] and an ideal offline finite horizon optimal control [25] of NNCS have been included. In Fig. 11, the cost function comparison result is shown for the three controllers. Compared with conventional NN-based NNCS feedback linearization control, the proposed optimal control design can deliver a much better

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

stochastic solution of the HJB equation online while satisfying the terminal constraint. An initial admissible control ensures that NNCS is stable when NN identifier, critic, and actor NN were being tuned. Using Lyapunov and geometric sequence theories, the NNCS system state, identification error, and weight estimation errors of NN-identifier, critic, and actor NNs have been proven to be UUB in the mean with ultimate bounds dependent upon initial condition BC L ,0 and final time instant N Ts . While the final time instant N Ts increases, all the ultimate bounds would decrease and converge to bounds derived for the infinite horizon case [21]. Fig. 10.

Estimated HJB equation and terminal constraint errors.

A PPENDIX P ROOF OF T HEOREM 1 Consider the Lyapunov function candidate as $ #  ∀k = 0, 1, . . . , N. L V ,k = tr E W˜ VT,k W˜ V ,k τ,γ

(A.1)

Then, using (32), the first difference of (A.1) can be derived for k = 0, 1, . . . , N as $ #  $ # 

L V ,k = tr E W˜ VT,k+1 W˜ V ,k+1 − tr E W˜ VT,k W˜ V ,k #

τ,γ



≤ −4αV tr E W˜ VT,k W˜ V ,k τ,γ %  + 2αV tr Fig. 11.

Comparison of cost functions.

performance since optimality is neglected in feedback linearization control. Moreover, in contrast to traditional offline NNCS finite horizon optimal control [25], cost function of proposed scheme is slightly higher due to system uncertainties and NN approximation, whereas the proposed design is more practical since the prior knowledge on network imperfections and system dynamics is not needed unlike in the case of traditional offline optimal controller design. However, computational complexity of an optimal controller is higher than a traditional controller. Despite the increase in computational cost for the optimal controller, these advanced controllers can still be realized in practice cheaply due to a drastic increase in processor speed. Therefore, the advanced controllers, such as the one proposed, can be utilized on NNCS for generating an improvement in performance over traditional controllers. In the end, the simulation results shown in Figs. 3–11 confirm that the proposed time-based NDP scheme renders acceptable performance in the presence of uncertain NNCS dynamics due to network imperfections. V. C ONCLUSION In this paper, a novel time-based finite horizon NDP scheme was proposed for NNCS using NN identifier, critic, and actor NNs to obtain stochastic optimal control policy in the presence of uncertain system dynamics due to network imperfections. Using historical inputs and NN identifier, the requirement on both internal dynamics and control coefficient matrix was relaxed. Furthermore, critic NN was derived to estimate the

E

% + αV2 tr #

τ,γ



E

τ,γ

τ,γ

W˜ VT,k W˜ V ,k

&

ψ T (ˆz N,k , 0)ψ(ˆz N,k , 0) + 1 W˜ VT,k ψ(ˆz N,k , 0)ψ T (ˆz N,k , 0)W˜ V ,k (ψ T (ˆz N,k , 0)ψ(ˆz N,k , 0) + 1)2

&

$ ˜ N , zˆ N,k , 0)ψ˜ T (z N , zˆ N,k , 0)WV + tr E WVT ψ(z τ,γ & %  T W˜ V ,k ψ(ˆz N,k , 0)ψ T (ˆz N,k , 0)W˜ V ,k 2 + αV tr E τ,γ (ψ T (ˆz N,k , 0)ψ(ˆz N,k , 0) + 1)2 # T $ + tr εV ,0 εV ,0 & %  W˜ VT,k W˜ V ,k + 2αV tr E τ,γ ψ T (z k , N − k) ψ(z k , N − k) + 1 %  T & W˜ V ,k ψ(z k , N −k) ψ T (z k , N −k)W˜ V ,k 2 + αV tr E τ,γ ( ψ T (z k , N −k) ψ(z k , N −k) + 1)2 $ # $ # + tr εVT ,k εV ,k + 10αV2 tr E (W˜ VT,k W˜ V ,k ) τ,γ # $ + 5αV2 tr E (WVT WV ) τ,γ & % 

εVT ,k εV ,k 2 + 5αV tr E τ,γ ( ψ T (z k , N − k) ψ(z k , N − k) + 1) & %  εVT ,0 εV ,0 2 + 5αV tr E τ,γ ψ T (ˆ z N,k , 0)ψ(ˆz N,k , 0) + 1  ψ 2 + ψ 2 + 2  ≤ −2αV 2 −  2 min  min2 ψmin + 1 ψmin + 1    2 + ψ2 + 3 2ψmin  min2  + 5 αV −  2 2 ψmin + 1 ψmin +1    E (W˜ V ,k )2 + (4ψ 2 + 5α 2 )W 2 M V VM τ,γ



$

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU AND JAGANNATHAN: NN-BASED FINITE HORIZON STOCHASTIC OPTIMAL CONTROL DESIGN

 + 1+





5αV2 2 +1

ψmin

ε2V M

+ 1+

5αV2



2 +1 ψmin  2 ≤ −2αV (2 − χ − (χ + 5)αV ) E (W˜ V ,k ) τ,γ     2 5αV2 + 1+ 2 + 5αV2 WV2 M ε2V M + 4ψ M ψmin + 1   5αV2 + 1+

ε2V M 2 +1

ψmin  2 ≤ −2αV (2 − χ − (χ + 5)αV ) E (W˜ V ,k ) + εT2 V

ε2V M

since 0 < αV < (2 − χ)/(χ + 5), 0 < η < 1. Therefore, term ηk will decrease when the final time instant kTs increases. Furthermore, ultimate bound BV ,k will also decrease with time kTs . The bounded critic NN weight estimation errors will decrease with time horizon. In addition, when time goes to infinity, k → ∞, the ultimate bound on critic NN weight estimation error will be equal to infinite horizon case. P ROOF OF T HEOREM 2 Consider the Lyapunov function candidate for closed-loop NNCS as

τ,γ

k = 0, 1, . . . , N

(A.2)

L C L ,k = L z,k + L I,k + L V ,k + L u,k +L A,k + L B,k ∀k = 0, . . . , N

where 2 + ψ 2 + 2 ψmin min

χ =

< 2, 0 < ψmin < ψ(z)0 2 + 1)( ψ 2 + 1) (ψmin min < ψmin <  ψmin (z k ) ≤  ψ(z k , N − k)

can be guaranteed by the PE condition given in Remark 1,  2 + 1)ε 2 2 + 5α 2 W 2 + + 4ψ and εT2 V = (1 + 5αV2 /ψmin VM M V VM 2 + 1) ε 2 (1 + 5αV2 / ψmin V M for k = 0, 1, . . . , N. Note that when tuning parameter is selected as  0 < αV < 22 − χ/χ + 5, the term −2αV (2 − χ − (χ + 5)αV ) E (W˜ V ,k ) in (A.2) will τ,γ

be less than zero. Therefore, under a fixed final time, critic NN weight estimation error is proven to be UUB in the mean, with ultimate bound being a function of initial conditions and final time N Ts as in (A.5). Assume that critic NN weight estimation error  is initiated  as a bound positive constant BW V ,0, with  E(W˜ V ,0 ) ≤ BW V ,0, ∀k = 0, 1, . . . , N. According to standard Lyapunov analysis [16], geometric sequence theory [24], and (A.2), Lyapunov function can be expressed during the interval t ∈ [0, N Ts ] as L V ,k = L V ,k−1 + L V ,k−2 + · · · + L V ,0 + L V ,0 k−1 = ( L V , j ) + L V ,0 ∀k = 0, 1 . . . , N. (A.3) j =0

Using (A.2), (A.3) can be expressed as L V ,k =

k−1  2   −2αV (2 − χ − (χ + 5)αV ) E (W˜ V ,k ) + εT2 V τ,γ

j =0

+ L V ,0 +

k−1

2  ≤ [1−2αV (2−χ −(χ +5)αV )]  E (W˜ V ,k ) k

τ,γ

[1−2αV (2−χ −(χ +5)αV ] j εT2 V ∀k = 0, 1 . . . , N

j =0

 2 1 − η k 2 ε ∀k = 0, 1 . . . , N (A.4) ≤ ηk  E (W˜ V ,0 ) + τ,γ 1 − η TV with η = 1 − 2αV (2 − χ − (χ + 5)αV ). Since L V ,k =   E (W˜ V ,k )2 , we have τ,γ

   E (W˜ V ,k ) ≤ τ,γ



'

 2 1 − ηk 2 ε ηk  E (W˜ V ,0 ) + τ,γ 1 − η TV '

1 − ηk 2 ε ≡ BV ,k 1 − η TV ∀k = 0, 1, . . . , N

11

ηk BV2 ,0 +

where

  L z,k = κ  E (z k ) τ,γ

L V ,k = tr{ E (W˜ VT,k W˜ V ,k )} #τ,γ T # $ T L I,k = tr E (e I,k ×e I,k )}+tr E (W˜ I,k W˜ I,k ) τ,γ τ,γ   L u,k =  E (W˜ u,k ) τ,γ

and

    L A,k = ω E (W˜ I,k ), L B,k =  E (W˜ V ,k ) ∀k = 0, . . . , N τ,γ

τ,γ

with κ ω ρ 

= = = =

  2 αu /2G M ϑ M 1 − 1/ϑmin +1 αu λmax (Rz−1 )ψ M WV M υ I M αu λmax (Rz−1 )G M ψ M /αV (2 − χ) αu ψ M2 /αV (2 − χ − (χ + 5)αV )I

and 2 2 − 2α 2I ζ M )I  = αu λ2max (Rz−1 )υ I2M /(ζmin

where I is the identity matrix, λmax (Rz−1 ) is the maximum eigenvalue of Rz−1 , and ϑ M , ϑmin , G M , ψ M , ψ M , WV M , υ I M , ζmin , and ζ M are defined in Lemma 1 and Theorem 1. Then, the first difference of (A.6) can be expressed as

L C L ,k = L z,k + L I,k + L V ,k + L u,k + L A,k + L B,k .

(A.7)

Recall (3) and use the Cauchy–Schwartz inequality and Lemma 1, L z,k can be expressed as    

L z,k = κ  E (z k+1 ) − κ  E (z k ) τ,γ τ,γ    ≤ κ  E F(z k ) + G(z k )u ∗ (z k )  τ,γ     +κ  E (G(z k )eu,k ) − κ  E (z k ) τ,γ τ,γ       ≤ lo κ  E (z k ) + κ  E (G(z k )eu,k ) − κ  E (z k ) τ,γ τ,γ     T τ,γ  ≤ −(1 − lo )κ  E (z k ) + G M κ  E W˜ u,k ϑ(z k )  τ,γ  τ,γ  +G M κ  E (εu,k ) τ,γ      ≤ −(1 − lo )κ  E (z k ) + G M ϑ M κ  E W˜ u,k  τ,γ

(A.5)

(A.6)

+G M κεu M , k = 0, . . . , N.

τ,γ

(A.8)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Next, recall (A.6), the first difference L u,k can be represented as    

L u,k =  E (W˜ u,k+1 ) −  E (W˜ u,k ) τ,γ τ,γ  

 T (z )      ϑ(z )ϑ k k ˜ u,k  − κ  E (W˜ u,k ) E ≤ 1 − α W u τ,γ  T τ,γ ϑ (z k )ϑ(z k ) + 1   αu  ϑ(z ) k E + R −1 G T (z k ) 2 τ,γ ϑ T (z k )ϑ(z k ) + 1 z

  ∂ψ T (z k+1 , N − k − 1) ˜ × WV ,k   ∂z k+1    αu  ϑ(z k ) + E R −1 G˜ T (z k ) 2 τ,γ ϑ T (z k )ϑ(z k ) + 1 z

  ∂ψ T (z k+1 , N − k − 1) ˜ × WV ,k   ∂z k+1    ϑ(z k ) αu  E R −1 G˜ T (z k ) + 2 τ,γ ϑ T (z k )ϑ(z k ) + 1 z

  ∂ψ T (z k+1 , N − k − 1) WV  ×  ∂z k+1 

  ∂εVT ,k ϑ(z k ) αu    −1 T R G (z ) − ε +  E k u,k  2 τ,γ ϑ T (z k )ϑ(z k ) + 1 z ∂z k+1     1  E (W˜ u,k ) ≤ −αu 1 − 2 τ,γ ϑmin + 1   αu λmax (Rz−1 )G M ψ M  E (W˜ V ,k ) + τ,γ 2 2 αu 2  2  αu 2 + λmax (Rz−1 )υ I2M  E (W˜ I,k ) + ψ M  E (W˜ V ,k ) τ,γ τ,γ 2 4   αu −1 λmax (Rz )ψ M WV M υ I M  E (W˜ I,k ) + τ,γ 2 αu 2 αu −1 2 + λmax (Rz )ε I M + λmax (Rz−1 )G M εV M 2 2 αu εu M , k = 0, 1, . . . , N. + (A.9) 2 Next, according to (21) and Lemma 1, the first difference of L A,k can be derived as    

L A,k = ω E (W˜ I,k+1 ) − ω E (W˜ I,k ) τ,γ τ,γ       ˜ ≤ α I ζ M ω E (W I,k ) + ω E (α I ε I,k−1 − ε I,k ) τ,γ τ,γ   −ω E (W˜ I,k ) τ,γ   ≤ −(1 − α I ζ M )ω E (W˜ I,k ) + ω ε I M , τ,γ

k = 0, 1, . . . , N.

(A.10)

Meanwhile, using (32) and Theorem 1, L B,k can be expressed as    

L B,k = ρ  E (W˜ V ,k+1 ) − ρ  E (W˜ V ,k ) τ,γ τ,γ     1 1  E (W˜ V ,k ) ≤ −ραV 2 − 2 − 2 +1 τ,γ ψmin + 1 ψmin + 2ραV ψ M WV M + ραV εV ,0 + ραV εV M   ≤ −ραV (2 − χ) E (W˜ V ,k ) + ρεT M , τ,γ

k = 0, 1, . . . , N

(A.11)

2 + ψ 2 + 2/(ψ 2 + 1)( ψ 2 +1) < 2 with 0 < χ = ψmin min min min defined in Theorem 1. Eventually, combining (23), (A.2), and (A.7)–(A.10), the overall L C L ,k can be represented as

L C L ,k = L z,k + L I,k + L V ,k + L u,k + L A,k + L B,k     αu   1    E (W˜ u,k ) ≤ −(1 − lo )κ E (z k ) − 1− 2 τ,γ τ,γ 2 ϑmin + 1    2   7α 2 u 2  − 1 − α 2I  E (e I,k ) − ψ M  E (W˜ V ,k ) τ,γ τ,γ 4 2 αu 2  −1  2    ˜ − λmax Rz υ I M E (W I,k ) τ,γ 2     1 − α I ζ M ω E (W˜ I,k ) − τ,γ 2    −1  αu − λmax Rz ψ M WV M υ I M  E (W˜ I,k ) τ,γ 2   2   +G M κεu M +  εT V     αu αu λmax Rz−1 G M εV M + λ2max Rz−1 ε2I M + 2 2   αu αu λmax Rz−1 G M εV M + εu M + ρεT M + 2  2    αu  1  E (W˜ u,k ) ≤ −(1 − lo )κ  E (z k ) − 1− 2 τ,γ τ,γ 2 ϑmin + 1  2      7αu 2  2 − 1 − α 2I  E (e I,k ) − ψ M E (W˜ V ,k ) τ,γ τ,γ 4 2 αu 2  −1  2  − λmax Rz υ I M  E (W˜ I,k ) τ,γ 2     1 − α I ζ M ω E (W˜ I,k ) − τ,γ 2     αu − λmax Rz−1 ψ M WV M υ I M  E (W˜ I,k ) τ,γ 2 +εC L M , k = 0, 1, . . . , N (A.12) where 0 < ψmin < ψ(z), 0 < ψmin <  ψ(z k , N − k), and 0 < ϑmin < ϑ(z) hold due to the PE condition given in Remark 1. In addition, εC L M is defined as εC L M = G M κεu M + εT2 V + αu /2λ2max (Rz−1 )ε2I M + αu /2λmax (Rz−1 )G M εV M + αu /2λmax (Rz−1 )G M εV M + ρεT M + αu /2εu M for k = 0, 1, . . . , N. Similar to Lemma 1 and Theorem 1, when tuning parameters for three NNs, α I , αV ,√ and αu , are selected as 0 < α I < min{1/2ζ M , ζmin / 2ζ M }, 0 < αV < 2 − χ/χ + 5, and 0 < αu < 1, the terms −(1−lo )κ E (z k ), −(1−α 2I ) τ,γ



2 + 1) E ( W ˜ u,k ), −7αu /4ψ 2  E (e I,k )2 , −αu /2(1 − 1/ϑmin M τ,γ

τ,γ

 E (W˜ V ,k )2 , −αu /2λ2max ×(Rz−1 )υ I2M  E (W˜ I,k )2 , −(1/2− τ,γ

τ,γ



α I ζ M )ω E (W˜ I,k ), and −αu /2λmax × (Rz−1 )ψ M WV M υ I M τ,γ

 E (W˜ I,k ) are negative. τ,γ

Assuming that the initial NNCS system state is bounded as Bz,0 , initial identification error is bounded as BeI,0 , and initial NN weight estimation errors of three NNs are bounded as BW I,0 , BW V ,0 , and BW u,0, then using standard Lyapunov [16] and geometric sequence theories [24], the closed-loop Lyapunov function L C L ,k , k = 0, 1, . . . , N in (A.6) can be

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU AND JAGANNATHAN: NN-BASED FINITE HORIZON STOCHASTIC OPTIMAL CONTROL DESIGN

written as L C L ,k = L C L ,k−1 + L C L ,k−2 + · · · + L C L ,0 + L C L ,0 k−1  =

L z, j + L I, j + L V , j + L u, j j =0

 + L A, j + L B, j + L C L ,0 .

(A.13)

Using (A.12), (A.13) can be expressed as     2   2    E (W˜ I,k )2 L C L ,k = κ  E (z k ) +  E (e I,k ) + ζmin τ,γ

τ,γ

τ,γ

  2   + E (W˜ V ,k ) +  E (W˜ u,k ) τ,γ τ,γ        ˜ +ω E (W I,k ) + ρ E (W˜ V ,k ) τ,γ

=−

k−1 

τ,γ

  (1 − lo )κ  E (z j ) τ,γ

j =0



 k−1 αu 2

j =0



k−1  

1−

2 +1 ϑmin







+

2

k−1

2

2  ψ M2  E (W˜ V , j )

λ2max

2

k−1  αu j =0





τ,γ

4

k−1  1 j =0

τ,γ

τ,γ

k−1  αu j =0

   E (W˜ u,k )

2   1 − α 2I  E (e I, j )

k−1  7αu j =0



1

j =0







Rz−1

2  2  υ I M  E (W˜ I, j )



τ,γ



  − α I ζ M ω E (W˜ I, j ) τ,γ

    λmax Rz−1 ψ M WV M υ I M  E (W˜ I, j ) τ,γ

(εC L M ) + L C L ,0

j =0

1 − k εC L M , k = 0, 1, . . . , N (A.14) 1− # $ 2 )/(ζ 2 ) defined in Lemma 1, with β I = max α 2I , (α 2I ζ M min closed-loop initial condition as BC L ,0 = κ Bz,0(+ Bu,0 + 2 BW V ,0 + α 2I BeI,0 ζ 2 BW I,0 + ω BW I,0 + I M ( ρ BW V ,0, and  = max{l0 , [1 − αu /2(1 − 1/ 2 + 1)], β , (1 − 2α (2 − χ − (χ + 5)), [1 − α (2 − χ)]}. ϑmin I V V Therefore, the ultimate bounds for the system state, identification error, NN identifier weight estimation error, and critic and actor NN weight estimation errors can be represented as ' k k    E (z k ) ≤  BC L ,0 + 1 −  εC L M ≡ Bz,k or τ,γ κ(1 −  ) ' κ k k    E (ek ) ≤   BC L ,0 +  1−  ≡ BeI,k or ε  (1 −  ) C L M τ,γ ≤  k BC L ,0 +

13

   E (W˜ I,k ) τ,γ %'

k 1 − k   2 BC L ,0 +   ε , ζ (1 −  )ζ 2 C L M min & min k 1 − k BC L ,0 + εC L M ≡ BW I,0 or ω ω(1 −  )    E (W˜ V ,k ) τ,γ %' k 1 − k   BC L ,0 +   ε ≤ max , or  (1 −  ) C L M & k 1 − k BC L ,0 + εC L M ≡ BW V ,0 ρ ρ(1 −  ) k    E (W˜ u,k ) ≤  k BC L ,0 + 1 −  εC L M τ,γ 1− ≡ Bu,k , k = 0, 1, . . . , N. (A.15) √ Since 0 < lo < 1, 0 < α I < min{1/2ζ M , ζmin / 2ζ M }, 0 < αV < 2− χ/χ + 5, and 0 < αu < 1, we have 2 +1 < 1, 0 < β I < 1, 0 < 0 < 1 − αu /2 1 − 1/ϑmin (1 − 2αV (2 − χ − (χ + 5)) < 1, and 0 < [1 − αV (2 − χ)] < 1. Furthermore, we know 0 <  < 1. Therefore, the term  k will decrease when the time kTs increases. In addition, since the initial bounds Bz,0 , BeI,0 , BW I,0 , BW V ,0, and BW u,0 are positive constants, the closed-loop initial condition BC L ,0 will also be a positive constant. Therefore, the ultimate bounds Bz,k , BeI,k , BW I,k , BW V ,k , and BW u,k will decrease over time. Moreover, when final time instant N Ts increases, not only all the signals will be UUB in the mean but also all ultimate bounds will decrease with time. Similar to Lemma 1 and Theorem 1, when time goes to infinity, the NNCS system state, identification error, and weights estimation errors of three NNs will decrease and approach their bounds. Moreover, we have     ∗ u − uˆ k  =  E (W˜ T ϑ(z k , k)) + εu,k  ≤ max

k

τ,γ

u,k

≤ BW u,k ϑ M + εu M ≡ Bu . R EFERENCES [1] Y. Tipsuwan and M. Y. Chow, “Control methodologies in networked control systems,” Control Eng. Practice, vol. 11, no. 10, pp. 1099–1111, 2003. [2] W. Zhang, M. S. Branicky, and S. Phillips, “Stability of networked control systems,” IEEE Control Syst. Mag., vol. 21, no. 1, pp. 84–99, Feb. 2001. [3] G. C. Walsh, O. Beldiman, and L. G. Bushnell, “Asymptotic behavior of nonlinear networked control systems,” IEEE Trans. Autom. Control, vol. 46, no. 7, pp. 1093–1097, Jul. 2001. [4] N. V. D. Wouw, D. Nesic, and W. P. H. Heemels, “A discretetime framework for stability analysis of nonlinear networked control systems,” Automatica, vol. 48, no. 12, pp. 1144–1153, 2012. [5] H. Shousong and Z. Qixin, “Stochastic optimal control and analysis of stability of networked control systems with long delay,” Automatica, vol. 39, no. 11, pp. 1877–1884, 2003. [6] L. L. Feng, J. Moyne, and D. Tilbury, “Optimal control design and evaluation for a class of networked control systems with distributed constant delays,” in Proc. IEEE Amer. Control Conf., Anchorage, AK, USA, Jun. 2002, pp. 3009–3014. [7] M. Tabbara, “A linear quadratic Gaussian framework for optimal networked control system design,” in Proc. IEEE Amer. Control Conf., Seattle, WA, USA, Jun. 2008, pp. 3804–3809. [8] A. K. Dehghani, “Optimal networked control system design: A dual-rate approach,” in Proc. Can. Conf. ECE, Saskatoon, SK, Canada, May 2005, pp. 790–793.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14

[9] H. Xu, S. Jagannathan, and F. L. Lewis, “Stochastic optimal control of unknown linear networked control system in presence of random delays and packet losses,” Automatica, vol. 48, no. 6, pp. 1017–1030, 2012. [10] D. P. Bertsekas and J. Tsitsiklis, Neuro-Dynamics Programming. Cambridge, MA, USA: MIT Press, 1996. [11] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Aug. 2009. [12] F. Y. Wang, N. Jin, D. Liu, and Q. L. Wei, “Adaptive dynamic programming for finite horizon optimal control of discrete-time nonlinear systems with ε-error bound,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36, Jan. 2011. [13] H. Zhang, Y. Lou, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw. Learn. Syst., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [14] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1118–1129, Jul. 2012. [15] F. L. Lewis and V. L. Syrmos, Optimal Control, 2nd ed. New York, NY, USA: Wiley, 1995. [16] S. Jagannathan, Neural Network Control of Nonlinear Discrete-Time Systems. Boca Raton, FL, USA: CRC Press, 2006. [17] M. S. Mahmoud and A. Ismail, “New results on delay-dependent control of time-delay systems,” IEEE Trans. Autom. Control, vol. 50, no. 1, pp. 95–100, Jan. 2005. [18] R. Luck and A. Ray, “An observer-based compensator for distributed delays,” Automatica, vol. 26, no. 6, pp. 903–908, 1990. [19] W. Stallings, Wireless Communications and Networks, 1st ed. Upper Saddle River, NJ, USA: Prentice-Hall, 2002. [20] A. Goldsmith, Wireless Communication. Cambridge, U.K.: Cambridge Univ. Press, 2003. [21] H. Xu and S. Jagannathan, “Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 471–484, Mar. 2013. [22] H. F. Chen and L. Guo, Identification and Stochastic Adaptive Control. Cambridge, U.K.: Cambridge Univ. Press, 1991. [23] J. Dankert, L. Yang, and J. Si, “A performance gradient perspective on approximate dynamic programming and its application to partially observable Markov decision process,” in Proc. Int. Symp. Intell. Control, Munich, Germany, Oct. 2006, pp. 458–463. [24] R. W. Brochett, R. S. Millman, and H. J. Sussmann, Differential Geometric Control Theory. Cambridge, MA, USA: Birkhäuser, 1983. [25] T. Chen and F. L. Lewis, “Fixed-final-time-constrained optimal control of nonlinear systems using neural network HJB approach,” IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 1725–1737, Nov. 2007. [26] P. J. Werbos, “A menu of designs for reinforcement learning over time,” J. Neural Netw. Control, vol. 3, no. 1, pp. 835–846, 1983. [27] H. Zhang, D. Yang, and T. Chai, “Guaranteed cost networked control for T–S fuzzy systems with time delay,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 37, no. 2, pp. 160–172, Feb. 2007. [28] X. Lin, Y. Huang, N. Cao, and Y. Lin, “Optimal control scheme for nonlinear systems with saturating actuator using ε-iterative adaptive dynamic programming,” in Proc. UKACC Int. Conf. Control, Cardiff, U.K., Sep. 2012, pp. 58–63.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[29] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming for large-scale systems with an application to multimachine power systems,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 10, pp. 693–697, Oct. 2012. [30] F. L. Lewis, S. Jagannathan, and A. Yesilderik, Neural Network Control of Robot Manipulators and Nonlinear Systems. Florence, KY, USA: Taylor & Francis, 1999. [31] H. Zhang, Y. H. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009.

Hao Xu (M’12) was born in Nanjing, China, in 1984. He received the master’s degree in electrical engineering from Southeast University, Nanjing, China, and the Ph.D. degree from the Missouri University of Science and Technology (formerly, University of Missouri-Rolla), Rolla, MO, USA, in 2009 and 2012, respectively. He is currently an Assistant Professor with the University of Tennessse at Martin, Martin, TN, USA. His current research interests include networked control system, cyber-physical system, distributed network protocol design, approximate/adaptive dynamic programming, optimal control, and adaptive control.

Sarangapani Jagannathan (SM’99) is with the Missouri University of Science and Technology, Rolla, MO, USA, where he is a Rutledge-Emerson Endowed Chair Professor of Electrical and Computer Engineering and Site Director with the NSF Industry/University Cooperative Research Center on Intelligent Maintenance Systems. He has co-authored 120 peer-reviewed journal articles most of them in the IEEE Transactions, 220 refereed IEEE conference articles, several book chapters, and three books. He holds 20 U.S. patents. He supervised to graduation around 18 doctoral and 29 M.S. level students and his funding is in excess of $14 million from various U.S. federal and industrial members. He was the Co-Editor of the IET book series on control from 2010 to 2013, and currently serving as the Editor-in-Chief of Discrete Dynamics and Society and on many editorial boards. His current research interests include neural network control, adaptive event-triggered control, secure networked control systems, prognostics, and autonomous systems/robotics. Dr. Jagannathan is the IEEE CSS Tech Committee Chair on Intelligent Control. He received many awards and has been on organizing committees of several IEEE Conferences.

Neural network-based finite horizon stochastic optimal control design for nonlinear networked control systems.

The stochastic optimal control of nonlinear networked control systems (NNCSs) using neuro-dynamic programming (NDP) over a finite time horizon is a ch...
2MB Sizes 17 Downloads 7 Views