471

Stochastic Optimal Controller Design for Uncertain Nonlinear Networked Control System via Neuro Dynamic Programming Hao Xu, Member, IEEE, and Sarangapani Jagannathan, Senior Member, IEEE

Abstract— The stochastic optimal controller design for the nonlinear networked control system (NNCS) with uncertain system dynamics is a challenging problem due to the presence of both system nonlinearities and communication network imperfections, such as random delays and packet losses, which are not unknown a priori. In the recent literature, neuro dynamic programming (NDP) techniques, based on value and policy iterations, have been widely reported to solve the optimal control of general affine nonlinear systems. However, for realtime control, value and policy iterations-based methodology are not suitable and time-based NDP techniques are preferred. In addition, output feedback-based controller designs are preferred for implementation. Therefore, in this paper, a novel NNCS representation incorporating the system uncertainties and network imperfections is introduced first by using input and output measurements for facilitating output feedback. Then, an online neural network (NN) identifier is introduced to estimate the control coefficient matrix, which is subsequently utilized for the controller design. Subsequently, the critic and action NNs are employed along with the NN identifier to determine the forwardin-time, time-based stochastic optimal control of NNCS without using value and policy iterations. Here, the value function and control inputs are updated once a sampling instant. By using novel NN weight update laws, Lyapunov theory is used to show that all the closed-loop signals and NN weights are uniformly ultimately bounded in the mean while the approximated control input converges close to its target value with time. Simulation results are included to show the effectiveness of the proposed scheme. Index Terms— Neuro dynamic programming, nonlinear networked control system, stochastic optimal control.

τsc τca γ Ts d¯ Ts zk yko

N OMENCLATURE Sensor-to-controller delay. Controller-to-actuator delay. Indicator of packet losses. Sampling time. Upper bound on delay. Augmented states of NCS at time k. Modified state vector with current output and pervious inputs.

Manuscript received April, 2, 2011; revised December 4, 2012; accepted December 5, 2012. Date of publication January 14, 2013; date of current version January 30, 2013. This work was supported in part by NSF under Grant ECCS 1128281. The authors are with the Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MI 65401 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2234133

Vk WC Wˆ Ck e yk WV Wˆ V eV k Wu Wˆ uk eV k αC , αV , αu εCk , εV k , εuk

Stochastic value function at time k. Target weight matrix of NN-identifier. Estimated weight matrix of NN-identifier at time k. Identification errors at time k. Target weight matrix of Critic NN. Estimated weight matrix of Critic NN at time k. Residual error. Target weight matrix of Action NN. Estimated weight matrix of Action NN at time k. Action NN estimation error. Tuning parameters for NN-identifier, Critic NN, and Action NN, respectively. Reconstruction errors for NN-identifier, Critic NN and Action NN respectively. I. I NTRODUCTION

F

EEDBACK control systems with control loops closed through a real-time communication network are called networked control systems (NCS) [1]. In NCS, a communication packet carries the reference input, plant output, and control input, which are exchanged among control system components, such as the sensor, controller, and actuators by using the communication network. The benefits of NCS include reduced system wiring with ease of system diagnosis and maintenance, and increased system agility. Adding a communication network in the feedback control loop, however, brings challenging issues. The first is the network-induced delay in the control loop that occurs when exchanging data among devices connected to the shared medium. The delay, either constant or random, can degrade the performance of the control system and even destabilize the system when the delay is not explicitly considered in the design process. In addition, the packet losses in the communication network due to unreliable transmission can cause a loss in the control input resulting in instability. Recently, Walsh [2] proposed a scheduling protocol and analyzed the asymptotic behavior of nonlinear NCS (or NNCS). Polushin [3] proposed a model-based stabilizing control for NNCS. However, the only objective of these controller designs [2], [3] is to make the NNCS stable when the dynamics are considered known. In general, optimality is

2162–237X/$31.00 © 2013 IEEE

472

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

generally preferred for NCS [4]–[6] and especially for NNCS, which is very difficult to attain. Recently, authors proposed an approximate dynamics programming (ADP)-based model-free optimal design [7] for linear NCS by using Q-learning and state feedback. However, the controller work for NNCS is more involved due to the presence of system nonlinearities and neural network (NN) reconstruction errors. In addition, handling network imperfections in the case of NNCS further complicates the optimal controller design. Finally, the optimal controller design for NNCS is developed by using input–outputs in contrast with state availability in [7]. Neuro dynamics programming (NDP) and adaptive or ADP techniques proposed by Bertsekas and Tsitsiklis [8] and Werbos [9], respectively, intend to solve the optimal control problem in the forward-in-time manner in contrast to a standard Riccati equation-based backward-in-time solution for linear systems. In NDP and ADP, one combines adaptive critics, a reinforcement learning technique, with dynamic programming [10]–[14], [31]. Zhang et al. [15] introduced near-optimal control of affine nonlinear discrete-time systems with control constraints by using iterative ADP algorithm. Greedy ADP iteration scheme is derived to obtain optimal tracking control of nonlinear discrete-time system in [16]. Recently Lewis and Vrabie [17] introduced the NDP scheme for both linear time-invariant and nonlinear system with partially unknown dynamics by using value and policy iterations. In contrast, in [18], NNs are utilized to solve the optimal regulation of a nonlinear discrete-time system in an offline manner by assuming that there are no reconstruction errors of an online approximator. Besides ignoring the reconstruction errors, complete dynamics are needed to implement the offline NN training. To improve the iterative yet offline training methodology, Dierks and Jagannathan [19] used two NNs to solve the Hamilton–Jacobi–Bellman (HJB) equation in forward-in-time for time-based optimal control of a class of affine nonlinear discrete-time systems. However, these papers [15]–[19] are not applicable to NNCS since the effects of delays and packet losses are not considered while state measurement is assumed. Moreover, value- and policy-based schemes are not suitable for the hardware implementation. When the network imperfections, such as delays and packet losses, are not incorporated carefully, they can cause instability [1], which in turn makes the optimal controller design for NNCS more involved and different than [17]–[19]. Although NDP is an effective technique to solve the optimal control of NNCS, traditional NDP techniques [17] require partial knowledge of the system dynamics, which are unsuitable for NNCS due to the presence of unknown random delays and packet losses. In addition, the NDP technique using value and policy iterations [17], [20] is not preferred for real-time control since the number of iterations needed for convergence within a sampling interval is unknown. Also, in some cases [17], [20], a model may be needed to iterate the value and policies. Therefore, the standard heuristic dynamic programming (HDP0-based value and policy iteration methods) [17], [20] cannot be utilized for NNCS.

Besides relaxing the need for value and policy iterations, it would be desirable to express the system dynamics in the input/output form since the system states are normally not measurable. Such techniques belong to the field of data-based control techniques [21], where the control input depends on input/output data measured directly from the system. To the best knowledge of the authors, there are no known NDP methods developed in the literature for such NNCS. Thus, in this paper, first a nonlinear system in affine form with network imperfections is expressed as a stochastic nonlinear discrete-time system in state space form. Next, this nonlinear discrete-time system is converted into an input/output form with uncertain dynamics referred to as NNCS for the purpose of output controller design. Subsequently, a novel time-based NDP algorithm is derived for the uncertain NNCS in the presence of network imperfections, such as random delays and packet losses, which are normally unknown. To learn the partial dynamics of NNCS, an online NN identifier is introduced first. Then by using an initial stabilizing control, a critic NN is tuned online to learn the value function of NNCS by solving the discrete-time HJB equation in an approximate manner. Subsequently, an action NN is utilized to minimize the value function based on the information provided by the critic NN and NN identifier. Therefore, the proposed novel input–output feedback-based NDP algorithm relaxes the need for system dynamics and information on random delays and packet losses. Value and policy iterations are not utilized and the value function and control inputs are updated once a sampling instant making the proposed NDP scheme a timebased model-free optimal controller for NNCS. The main contribution of this paper includes a time-based NDP optimal control scheme for uncertain NNCS, by using output feedback and without utilizing value and policy iterations. Closed-loop stability in the mean is demonstrated by selecting novel NN update laws. This paper is organized as follows. Section II presents the NNCS background and its input/output system representation. A novel online optimal adaptive NN control scheme with an online identifier is proposed in Section III while the stability in the mean is verified by using Lyapunov theory. Section IV illustrates the effectiveness of proposed schemes and Section V provides concluding remarks. II. NNCS BACKGROUND A. NNCS Structure The NNCS structure, considered in this paper, is shown in Fig. 1, where the feedback control loop is closed over the communication network. Due to an unreliable communication network, networked-induced delays and packet losses are included in this structure, such as: 1) τsc (t): sensor-tocontroller delay; 2) τca (t): controller-to-actuator delay; and 3) γ (t): indicator of packet losses at the actuator. Next, the following assumption is needed consistent with the literature in NCS [22], [23]. Assumption 1: 1) Sensor is time-driven while the controller and actuator are event-driven [24].

XU AND JAGANNATHAN: STOCHASTIC OPTIMAL CONTROLLER DESIGN FOR UNCERTAIN NNCS

473

Ts Nonlinear Plant

Actuator

Sensor

Communication Network Delay And Packet losses

τ ca (t )

γ (t)

τ sc (t )

γ (t)

Delay And Packet losses

Fig. 2.

Controller

Fig. 1.

NNCS.

2) Communication network is a wide area network so that the two network-induced delays are considered independent, ergodic and unknown whereas their probability distribution functions are considered known [22], [23]. 3) The total delay (sum of both types) is bounded [22] while the initial state of the nonlinear system is deterministic [23]. B. NNCS System Dynamics Representation In this paper, a continuous-time affine nonlinear system of the form x˙ = f (x) + g(x)u and y = C x is considered, where x, y, and u denote system state, output, and input vector while f (•) and g(•) are smooth nonlinear functions of the state and C is the output matrix. When the random delays and packet losses of the communication network are considered, the control input u(t) is delayed and can be lost at times due to packet losses. Therefore, the nonlinear system after considering the effect of delay and packet losses can be expressed as x˙ (t) = f (x (t)) + γ (t) g (x (t)) u (t − τ (t)) y (t) = C x (t)

can be obtained as u(t) = u k [δ(t − kTs ) − δ(t − (k + 1)Ts )] t or E (e yk ) (A.3) τ,γ 1 − α2 C

Using the standard Lyapunov extension [11], the identification errors and NN weights estimation errors are UUB in the mean. P ROOF OF T HEOREM 2 Consider the Lyapunov function candidate T ˜ ˜ L V = tr E (WV k WV k ) . τ,γ

Vk

k

(A.4)

k

T T yo W V ˜ Vk + E ϑ T y o2α ϑ o +1 tr ε V k ϑ k y ( k) ( k) τ,γ 2 αV tr W˜ VT k ϑ ( yko ) ϑ T ( yko ) ϑ ( yko ) ϑ T ( yko ) W˜ V k + E τ,γ ( ϑ T ( yko ) ϑ ( yko )+1)2 ˜ Vk 2αV2 tr εVT k ϑ T ( yko ) ϑ ( yko ) ϑ T ( yko ) W + E τ,γ ( ϑ T ( yko ) ϑ ( yko )+1)2 ! " αV2 tr εVT k ϑ T ( yko ) ϑ ( yko ) εV k + E ( ϑ T ( yko ) ϑ ( yko )+1)2 τ,γ −tr ≤

E (W˜ VT k W˜ V k )

τ,γ

2 2 (W αV (1−2aV ) ϑmin E ˜ V k ) τ,γ 2 − (2−aV ) ϑ M +1

+

2 αV (2aV +1) εVM 2 (2−aV ) ϑ M +1

(A.5) where 0 < ϑmin < E τ,γ ( ϑ(yko )) is ensured by the PE condition described in Remark 2 and εV k ≤ εVM for a constant εVM is ensured by the boundness of εV k . Therefore,

L V < 0 if # 2aV + 1 2 ≡ B E (W˜ V k ) >

εVM (A.6) Wv. τ,γ 2 (1 − 2a V ) ϑmin Using standard Lyapunov theory [27], it can be concluded that L V is less than zero outside of a compact set rendering the critic NN weights estimation errors to be UUB in the mean. P ROOF OF T HEOREM 3 Consider the Lyapunov function candidate L = L D N + L u N + L V N + L C N + L AN + L B N

where L D N = E τ,γ [(yko )T yko ], L u N , L V N , L C N , L AN , and L B N are defined T T ˜ ˜ ˜ ˜ L u N = tr E Wuk Wuk , L V N = tr E WV k WV k τ,γ τ,γ T L C N = tr E e Tyk e yk + tr E W˜ Ck OW˜ Ck τ,γ τ,γ 2 L AN = tr E W˜ VT k W˜ V k τ,γ

LBN

2 T ˜ ˜ = tr E WCk WCk

(A.8)

τ,γ

with

The first difference of (A.4) is given by L V = tr{ E τ,γ (W˜ VT k+1 W˜ V k+1 )} − tr{ E τ,γ (W˜ VT k W˜ V k )}, and using (30) yields T V ˜ ˜

L V = tr E (WV k WV k ) − E ϑ T y o2α ϑ ( k ) ( yko )+1 τ,γ τ,γ × tr W˜ T ϑ y o ϑ T y o W˜ V k

481

(A.7)

= =

24G 2M φ 2M φ 2M + 1

I 2 φmin 2 +1 288φ 2M 2 ϑ M #

2 ϑ 2 φmin min

× G 2M +12

2 ε 2 ψM VM 2

ϑmin

I

24 (ψ M φ M )2 = I 2 φmin 2 2 2 2 ¯ 2 ψ φ 9 ε φ ε M VM M CM 2 O = 2 M + +6 (ψ M )2 φ 2M M I 2 2 2 2φmin φmin min and ⎛

⎞ ' ( ( (ψ M φ M )2 ϑ 2 + 12 ⎜ ⎟ M = ⎝85) ⎠I 2 ϑ 4 φmin min are positive definite matrices, I is identity matrix, is defined

G , and λ −1 as λmax (R −1 )ϑ M M max (R ) is the maximum singular value of R. The first difference of (A.7) is given by L =

L D N + L u N + L V N + L C N + L AN + L B N . o o Considering first difference L D N = E τ,γ [(yk+1 )T yk+1 ]− o o T E τ,γ [(yk ) yk ], using the NNCS dynamics (4), and applying the Cauchy–Schwartz inequality reveals that the first difference becomes 2 o F y +G y o u ∗ y o −G y o u ∗ y o k o o k ok o k o k o

L DN ≤ E ˆ τ,γ +G yk uˆ yk −G yk uˆ yk + G yk uˆ yk T − E yko yko τ,γ 2 ≤ 2 E F yo + G yo u∗ yo τ,γ

k

k

k

2 +4 E G˜ yko uˆ yko τ,γ 2 T +8 E G yko W˜ uk φ yko τ,γ 2 T +8 E G yko εuk − E yko yko τ,γ τ,γ 2 2 ∗ o 2 T ˜ ≤ − 1 − 2k E (yk ) + 4 M E (WCk ) τ,γ τ,γ 2 2 2 ˜ +8G 2M φ 2M (A.9) E (Wuk ) + 8G M εu M . τ,γ

482

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

Next, first different L u can be expressed as T

L u N = tr E (W˜ uk+1 W˜ uk+1 ) τ,γ T −tr E (W˜ uk W˜ uk ) τ,γ αu o o = E T τ,γ φ yk φ yk + 1 T T ×tr euk φ T y o W˜ uk + W˜ uk φ y o euk

k

αu2 φ T yko φ yko + E 2 τ,γ φ T yko φ yko + 1 T ) . ×tr E (euk euk τ,γ

where 0 < φmin < E τ,γ (φ(yko )) is ensured by the PE condition described in Remarks 1 and 2, E τ,γ ( εeuk )2 = 2 , which is a (αu E τ,γ (εek )2 /(φ T (yko )φ(yko ) + 1)) ≤ εeM bounded positive constant. Next, first difference L AN can be expressed as

L AN

2 T ˜ ˜ ≤ tr E WV k+1 WV k+1 τ,γ

2 − tr E W˜ VT k W˜ V k τ,γ 1 2 2 4 αV − 2a V2 − 12

ϑmin E W˜ V k 2 ≤− τ,γ

ϑ M + 1 2 4 ε2 ˜Vk W + 2 2VM E 3 ϑ + 1 τ,γ

k

(A.10)

M

Substituting (34) into (A.10), we get 2αu T o o

L u = E φ yko tr (−W˜ uk T τ,γ φ yk φ yk + 1 o 1 −1 T o ∂ϑ T yk+1 W˜ V k − R y G yk o 2 ∂yk+1 o 1 −1 ˜ T o ∂ϑ T yk+1 + R y G yk W˜ V k o 2 ∂yk+1 o ∂εV k o 1 T T ˜ ˜ y y W G + R −1 − ε )φ ek uk k k o 2 y ∂yk+1 α2 φ T y o φ y o T + E u k k 2 tr (−W˜ uk φ yko τ,γ φ T yko φ yko + 1 o 1 −1 T o ∂ϑ T yk+1 − R y G yk W˜ V k o 2 ∂yk+1 o 1 −1 ˜ T o ∂ϑ T yk+1 W˜ V k + R y G yk o 2 ∂yk+1 1 ˜ T y o ∂εV k − εek )T + R −1 y G k o 2 ∂yk+1 o o 1 −1 T o ∂ϑ T yk+1 T ˜ × − Wuk φ yk − R y G yk W˜ V k o 2 ∂yk+1 o 1 −1 ˜ T o ∂ϑ T yk+1 W˜ V k + R y G yk o 2 ∂yk+1

o ∂εV k 1 T ˜ + R −1 − ε y G ek k o 2 y ∂yk+1 2 2 3αu − 6αu2 φmin ˜ ≤− ( W E uk τ,γ φ 2M + 1 2 2 2 αu + αu E (W˜ V k ) + T o o 2 φ yk φ yk + 1 τ,γ 2 4 2αu + αu (ψ M )2 E (W˜ Ck ) 2 2 + τ,γ 4 φM + 1 G M 2 4 2αu + αu (ψ M )2 ˜ 2 2 + E ( W V k ) τ,γ 4 φM + 1 G M 2 2 2 2αu + αu εVM ψ M ˜ 2 2 + E (WCk ) τ,γ 2 φM + 1 G M 2 + εeM (A.11)

4 4 2 εVM + . 2 +1 2 9 ϑ M

(A.12)

Next, the first difference of L B N can be expressed as

L B N =

2 T tr E W˜ Ck+1 W˜ Ck+1 τ,γ

2 T − tr E W˜ Ck W˜ Ck τ,γ 4 4 2 4 M ˜ ≤ − 1 − 4αC 4 W E Ck min τ,γ 2 8α 2 2 2 E W˜ ¯ε2 + C M2 CM τ,γ Ck min 4 2 + 4 ¯εC4 M . (A.13) min Next, using (A.2), (A.5), (A.9), (A.11)–(A.13) to form L as o 2 ∗

L ≤ − 1 − 2k E yk τ,γ

2 1 E W˜ V k −288A αV − 2a V2 − 12 τ,γ 2 − 1 − αC2 E e yk τ,γ

2 1 E W˜ uk −24G 2M φ 2M 3αu − 6αu2 − 3 τ,γ 2 2 −4M [1 − 4αC2 2M E W˜ Ck min τ,γ 4 216 αV − 2a V2 − 19 (ψ M φ M )2 ˜ W − E V k 2 τ,γ φmin 4 4 6 (ψ M φ M )2 4 M +εTM ˜ W − 1 − 8αC 4 E Ck τ,γ φ2 min

min

(A.14)

XU AND JAGANNATHAN: STOCHASTIC OPTIMAL CONTROLLER DESIGN FOR UNCERTAIN NNCS

where

= λmax

R −1 y

9 2 + M

ϑM GM

2 εVM ψ M φ 2M 2

6 (ψ M )2 φ 2M ¯εC2 M

+ 2 2 2 2φmin φmin min 2 2 2 2 2 2 2 ) are and ρ = (φ M /φmin ) G M + 12(ψ M εVM / ϑmin positive constant and εT M is 2 24 φ 2M +1 G 2M φ 2M M 2 2 2 2

εeM εT M = 8G M εu M +16 2 ¯εC M + 2 min φmin 2 ψ M φ 2M 2 2 72 εVM (23G M φ M )2 2 + 2

εVM +

¯εC M 2 2 φmin ϑmin φmin η=

+

4 96 (ψ M φ M )2 εVM 48 (ψ M φ M )2 4

¯ ε + C M 2 4 2 ϑ 2 2 φmin φmin min min ϑ M + 1

+

4 96 (ψ M )2 φ 2M ¯εC4 M (68ψ M φ M )2 εVM + . 2 ϑ 4 2 4 φmin φmin min min

Therefore, L is less than zero when the following inequalities hold: # E e yk > εT M ≡ bey or τ,γ 1 − α2 C

ε 2 E W˜ Ck > max 2 T M min 2 M τ,γ 4 min −4αC2 M 2 ε 4 φmin × 4 4 T M4 min ≡ bW C 4 2 6 min −8αC M (ψ M φ M )

E W˜ V k > max εT M τ,γ 1 288A αV −2aV2 − 12 # ×4

2 φmin 216 αV −2aV2 − 19 (ψ M φ M )2

≡ bW V

(A.15) or

# εT M E W˜ uk > ≡ bW u τ,γ 8G 2M φ 2M 9αu − 18αu2 − 1

or

o εT M E y > ≡ by τ,γ k (1 − 2k ∗ )

provided the tuning gains are selected according to (19), (29), and (32) for the NNCS (4). Using the standard Lyapunov extension [27], the system outputs, NN identifier, and weight estimation errors, critic and action NN estimation errors are UUB in the mean while the system outputs never leave the compact set. ˆ ko ) − u ∗ (yko )] = Next using (24) and (26), we have E τ,γ [u(y − Eτ,γ [W˜ uT φ(yko ) + εuk ]. When k → ∞, the upper bound of ˆ ko ). − .u ∗ (yko )] can be represented as E τ,γ [u(y o ∗ o T o E uˆ y ˜ − E u yk ≤ E Wu φ yk k τ,γ τ,γ τ,γ o T ˜ + W φ y ≤ (ε ) E uk E u k + εu M τ,γ τ,γ ≤ bW u + εu M ≡ εbu . (A.16)

483

Now, if the NN identifier, critic and action NN approximation errors εe , εu and εV are neglected as in [7] and [12] and when k → ∞, εT M in (A.15) and εbu in (A.16) will become zero in the mean. In this case, it can be shown that the NN-based identification, action NN, and critic NN estimation errors converge to zero asymptotically in the mean, ˆ ko )] → E τ,γ [u ∗ (yko )]. i.e., E τ,γ [u(y R EFERENCES [1] J. Nilsson, “Real-time control systems with delays,” Ph.D. dissertation, Dept. Automatic Control, Lund Inst. Technology, Lund, Sweden, 1998. [2] G. C. Walsh, O. Beldiman, and L. Bushnell, “Error encoding algorithm for networked control system,” Automatica, vol. 38, no. 2, pp. 261–267, 2002. [3] I. G. Polushin, P. X. Liu, and C. H. Lung, “On the model-based approach to nonlinear networked control systems,” Automatica, vol. 44, no. 9, pp. 2409–2414, 2008. [4] L. L. Feng, J. Moyne, and D. Tilbury, “Optimal control design and evaluation for a class of networked control systems with distributed constant delays,” in Proc. Amer. Control Conf., 2002, pp. 3009–3014. [5] M. Tabbara, “A linear quadratic gaussian framework for optimal networked control system design,” in Proc. Amer. Control Conf., Jun. 2008, pp. 3804–3809. [6] A. K. Dehghani, “Optimal networked control system design: A dual-rate approach,” in Proc. Can. Conf. ECE, May 2005, pp. 790–793. [7] H. Xu, S. Jagannathan, and F. L. Lewis, “Stochastic optimal control of unknown linear networked control system in presence of random delays and packet losses,” Automatica, vol. 48, no. 6, pp. 1017–1030, Jun. 2012. [8] D. P. Bertsekas and J. Tsitsiklis, Neuro-Dynamics Programming. Belmont, MA: Athena Scientic, 1996. [9] P. J. Werbos, “A menu of designs for reinforcement learning over time,” Journal Neural Network Control, MA: MIT Press, 1991. [10] D. V. Prokhorov and D. C. Wunsch II, “Adaptive critic designs,” IEEE Trans. Neural Network, vol. 8, no. 5, pp. 997–1007, Sep. 1997. [11] J. Dankert, L. Yang, and J. Si, “A performance gradient perspective on approximate dynamic programming and its application to partially observable markov decision process,” in Proc. Int. Symp. Intell. Control, 2006, pp. 458–463. [12] R. Enns, and J. Si, “Helicopter trimming and tracking control using direct neural dynamic programming,” IEEE Trans. Neural Network, vol. 14, no. 4, pp. 929–939, Jul. 2003. [13] M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1671–1676, Oct. 2012. [14] Q. Yang, Z. Yang, and Y. Sun, “Universal neural network control of MIMO uncertain nonlinear system,” IEEE Trans. Neur. Netw. Learn. Syst., vol. 23, no. 7, pp. 1163–1169, Jul. 2012. [15] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Network, vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [16] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern. B, vol. 38, no. 4, pp. 937–942, Aug. 2008. [17] F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Jul.–Sep. 2009. [18] C. Zheng and S. Jagannathan, “Generalized Hamilton–Jacobi–Bellman formulation-based neural network control of affine nonlinear discretetime systems,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 90–106, Jan. 2008. [19] T. Dierks, and S. Jagannathan, “Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1118–1129, Jul. 2012. [20] A. Al-Tamimi and F. L. Lewis, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Trans. Syst., Man., Cybern. B, vol. 38, no. 4, pp. 943–949, Aug. 2008. [21] M. G. Safonov and T. C. Tsao, “The unfalsified control concept and learning,” IEEE Trans. Automat. Control, vol. 42, no. 6, pp. 843–847, Jun. 1997.

484

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

[22] L. W. Liou and A. Ray, “A stochastic regulator for integrated communication and control systems: Part I—formulation of control law,” ASME J. Dynamic Syst., Meas. Control, vol. 113, no. 4, pp. 604–611, 1991. [23] H. Shousong and Z. Qixin, “Stochastic optimal control and analysis of stability of networked control systems with long delay,” Automatica, vol. 39, no. 11, pp. 1877–1884, Nov. 2003. [24] X. Liu and A. Goldsmith, “Wireless medium access control in networked control systems,” in Proc. IEEE Amer. Contr Conf., 2004, pp. 688–694. [25] W. Stallings, Wireless Communications and Networks, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002. [26] D. S. Bernstein and W. M. Haddad, “LQG control with an H∞ performance bound: A Riccati equation approach,” IEEE Trans. Automat. Control, vol. 34, no. 3, pp. 293–305, Mar. 1989. [27] S. Jagannathan, Neural network control of nonlinear discrete-time systems, Boca Raton, FL: CRC Press, 2006. [28] H. F. Chen and L. Guo, Identification and Stochastic Adaptive Control. Cambridge, MA: MIT Press, 1991. [29] P. R. Kumar, “A survey of some results in stochastic adaptive control,” SIAM J. Control Optimatizat., vol. 23, no. 3, pp. 329–380, 1985. [30] B. A. Finlayson, The Method of Weighted Residuals and Variational Principles. New York: Academic, 1972. [31] D. P. Bertsekas, Dynamic Programming and Optimal Control, 3rd ed, Belmont, MA: Athena Scientific, 2007.

Hao Xu (M’12) was born in Nanjing, China, in 1984. He received the Masters degree in electrical engineering from Southeast University, Nanjing, and the Ph.D. degree from the Missouri University of Science and Technology (formerly, the University of Missouri-Rolla), Rolla, in 2009 and 2012, respectively. He is currently a Post-Doctoral Fellow with the Embedded Control Systems and Networking Lab, Missouri University of Science and Technology. His current research interests include networked control system, cyber-physical system, distributed network protocol design, approximate/adaptive dynamics programming, and optimal control and adaptive control.

Jagannathan Sarangapani (SM’99) received the Ph.D. degree in electrical engineering from the University of Texas, Arlington, in 1994. He is currently with the Missouri University of Science and Technology (formerly the University of Missouri-Rolla), Rolla, where he is a RutledgeEmerson Distinguished Professor and Site Director for the NSF Industry/University Cooperative Research Center on Intelligent Maintenance Systems. He has co-authored around 95 peer reviewed journal articles, 180 refereed IEEE conference articles, several book chapters and three books entitled Neural network control of robot manipulators and nonlinear systems, (Taylor & Francis, London, 1999), Discrete-time neural network control of nonlinear discrete-time systems (CRC Press, April 2006) and Wireless Ad Hoc and Sensor Networks: Performance, Protocols and Control (CRC Press, April 2007). He holds 18 patents with several pending. He guided 15 doctoral students and 26 M.S. students. His research funding is in excess of $13 million dollars from the NSF, NASA, AFRL, Sandia and from companies, such as Boeing, Caterpillar, Chevron, and Honeywell. His current research interests include adaptive and neural network control, networked control systems and sensor networks, prognostics, and autonomous systems/robotics. Dr. Sarangapani served on a number of editorial boards and currently serves as the co-editor for the IET Book series on Control and on a number of IEEE Conference Committees.