IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

1641

Zero-Sum Two-Player Game Theoretic Formulation of Affine Nonlinear Discrete-Time Systems Using Neural Networks Shahab Mehraeen, Member, IEEE, Travis Dierks, Member, IEEE, S. Jagannathan, Senior Member, IEEE, and Mariesa L. Crow, Fellow, IEEE

Abstract—In this paper, the nearly optimal solution for discrete-time (DT) affine nonlinear control systems in the presence of partially unknown internal system dynamics and disturbances is considered. The approach is based on successive approximate solution of the Hamilton–Jacobi–Isaacs (HJI) equation, which appears in optimal control. Successive approximation approach for updating control and disturbance inputs for DT nonlinear affine systems are proposed. Moreover, sufficient conditions for the convergence of the approximate HJI solution to the saddle point are derived, and an iterative approach to approximate the HJI equation using a neural network (NN) is presented. Then, the requirement of full knowledge of the internal dynamics of the nonlinear DT system is relaxed by using a second NN online approximator. The result is a closed-loop optimal NN controller via offline learning. A numerical example is provided illustrating the effectiveness of the approach. Index Terms—Hamilton–Jacobi–Isaacs (HJI), neural networks (NNs), nonlinear discrete-time (DT) systems, optimal control.

I. I NTRODUCTION

C

LOSED-LOOP stability is often the sole purpose of many controller designs [1]. However, other objectives, such as optimality, require a control policy to stabilize the system in an optimal manner according to an overall performance index. In the robust optimal control formulation, the objective of the controller is to minimize a certain performance index which represents a penalty associated with the states and control input [2] while exposing the system to the disturbances that the system can tolerate. The H∞ optimal control problem is a branch of robust optimal control which seeks to not only minimize a cost function but also attenuate a worst case disturbance [2]. State space

Manuscript received August 20, 2011; revised March 30, 2012 and October 5, 2012; accepted October 16, 2012. Date of publication December 21, 2012; date of current version November 18, 2013. This paper was recommended by Associate Editor T. Vasilakos. S. Mehraeen is with the Department of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803 USA (e-mail: smehraeen@ lsu.edu). T. Dierks is with DRS Sustainment Systems, Inc., St. Louis, MO 63121 USA (e-mail: [email protected]). S. Jagannathan and M. L. Crow are with the Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2012.2227253

techniques for optimal control are derived for linear systems [3] by solving the Riccati equations in both continuous-time and discrete-time (DT) domains. On the other hand, the work of Basar and Bernard [4] introduces a zero-sum two-player differential game and extends the H∞ optimal control to nonlinear dynamic systems. In contrast, the concept of dissipativity [5] was employed to convert the H∞ optimal control problem into an L2 -gain optimal control problem in [6]. However, the L2 -gain optimal problem requires solving the nonlinear differential or difference Hamilton–Jacobi–Isaacs (HJI) equations, which does not have a closed-form solution. In order to overcome this problem, the continuous-time and DT problems are addressed in [7]–[9], respectively, where a solution is found by using the Taylor series expansion and solving the coefficients using Riccati equations. Thus, the HJI problem is reduced to solving the Riccati equation and a sequence of linear algebraic equations. In contrast, Beard and McLain [10] propose an iterative-based policy to successively solve the HJI in continuous time by transforming the nonlinear (in value function) partial differential equation to a sequence of linear differential equations using the Galerkin techniques. An associated drawback of this approach is the requirement of a large number of integral calculations. A similar approach was adopted in [11], where the value function is approximated by a neural network (NN) that is trained offline using the least squares techniques. In addition, in [12], a sequence of Hamilton–Jacobi–Bellman (HJB) equations have been obtained to solve a class of continuous-time HJI equations, where each HJB equation can be solved using a sequence of linear partial differential equations. In [13], an actor–critic–identifier structure is used to implement the policy iteration where adaptive NNs are used to identify the uncertain system and a critic NN is used to approximate the value function in continuous-time nonlinear system. By contrast, online solution to the HJI problem is proposed in [14], where policy iteration in an actor/critic structure with adaptive NNs is used to solve the continuoustime two-player zero-sum game for nonlinear systems with known dynamics. While the continuous-time HJI problem has been under consideration [7], [10], [11], DT robust optimal control problem for nonlinear systems is addressed in a limited manner. The work in [2] presents important fundamental principles concerning the HJI optimization problem via the L2 -gain optimal problem; however, no approximation strategy has been used to find

2168-2267 © 2012 IEEE

1642

a solution of the value function (available storage function.) Similarly, [8] and [9] require the knowledge of the system dynamics and use the Taylor series expansion of the system dynamics. In addition, dynamic programing approaches have been introduced to solve the DT HJB problem using policy [15] and value iteration methods [16]–[18] while DT HJI has been addressed using value iteration [19] only. In [15], HJB solution is found by using single-player (i.e., control policy only) policy iterations for value function through the Taylor series expansion of the value function. By contrast, the work in this paper solves the DT HJI problem by considering two-player policy iterations (i.e., control policy and disturbance) using dynamic programming. In other words, this paper seeks to solve a nonlinear DT HJI problem and extend the effort in [15] by adding the effects of the disturbance and proposes a practical method for obtaining the L2 -gain near optimal control while keeping a tradeoff between accuracy and computational complexity. By using the Taylor series expansion of the available storage function only, an iterative approach to solve the approximate HJI (AHJI) is presented. Successive solutions for the available storage function ensure that the available storage function (when it exists) reaches its saddle point in a zero-sum two-player differential game where the players are the system disturbance and the control input. The successive approximations of the available storage function are accomplished using the approximation properties of NN [1] and least squares. Subsequently, an NN identifier is introduced in this work to learn the nonlinear internal dynamics of the system. Using the Lyapunov theory, it is shown that the identification errors converge to a small bounded region around the origin. Then, using the learned NN representation of the internal dynamics, offline training is undertaken resulting in a novel solution to the AHJI optimal control problem. The application of the Taylor series expansion to solve the nonlinear HJI equation is more involved than the use of the Taylor series expansion to solve the HJB in [15] due to the added disturbance term in the system representation and addressing this optimal control problem as a two-player game. For instance, additional considerations are required when solving the HJI equation to ensure the existence of a saddle point in the zero-sum two-player game, whereas solving the HJB equation [15] does not have such a requirement. Although, in [8] and [9], based on the Riccati equation, a solution to the HJI equation was found (without using NN), the approach uses the Taylor series expansion of the system dynamics as well as the value function, optimal policy, and worse case disturbance, which are obtained in a fundamentally different manner than that of the present paper since we only require the Taylor series expansion of available storage function under small-perturbation assumption and other expansions are not needed. In addition, the sufficient conditions for satisfaction of Isaacs’s condition (saddle point) in the two-player game, which can lead to the minimum disturbance gain (nonlinear H∞ optimal control problem), are shown in this work as opposed to [8] and [9]. Moreover, the available storage function in this paper is approximated by an NN in a successive approach in an inner and outer loop fashion in contrast with [8] and [9].

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

In addition, it is observed that [8] (as well as [9] and [15]) requires knowledge of the internal system dynamics fk (x), whereas this requirement is relaxed in our work using the NN identifier. Finally, convergence of the successive approximations is demonstrated while explicitly considering the NN identifier reconstruction errors. Consequently, the proposed work in this paper makes significant contributions beyond ([8], [9], and [15]–[17].) This paper is organized as follows. First, background information for the DT nonlinear Hamilton–Jacobi (HJ) formulation is presented in Section II, and policy iteration is presented to find the worse case disturbance. In Sections III and IV, the HJI equation in the zero-sum two-player game framework and under small-signal assumption is derived, and an iterative approach is proposed to solve the HJI equation. In addition, convergence of the successive approximations is demonstrated. Section V presents the NN implementation of the successive approximation of the obtained HJI equation as well as the NN identification scheme, while numerical simulations and concluding remarks are provided in Sections VI and VII, respectively. II. Z ERO -S UM G AME FOR DT S YSTEMS AND P OLICY I TERATION Consider the DT affine nonlinear system with disturbance described by xk+1 = f (xk ) + g(xk )uk + h(xk )wk

(1)

where xk = x(k) ∈ n is the state vector at time step k, f (xk ) ∈ n , g(xk ) ∈ n×m , and h(xk ) ∈ n×M are smooth functions defined in a neighborhood of the origin, uk = u(xk ) ∈ m is the control input, and wk = w(xk ) ∈ M is the disturbance input. Also, for convenience, define fk = f (xk ), gk = g(xk ), bk = f (xk ) + g(xk )uk , and hk = h(xk ). Assume that there exists a control input that is able to stabilize system (1) asymptotically. Then, our goal is to find a control input uk which can minimize the infinite horizon cost function J(xk ) =

∞ 

r(xj , uj , wj ) = z(xk )2 −γ 2 wkT P wk +J(xk+1 )

j=k

(2) in the presence of worst case disturbance wk , where r(xj , uj , wj ) = Q(xj ) + z(xk )2 = Q(xk ) + uT k Ruk , 2 T Ru − γ w P w with Q(.) ≥ 0, R and P are positive uT j j j j definite matrices, and γ > 0 is a constant. Thus, in addition to stabilizing the nonlinear system (1), the control input uk must make the cost function (2) finite. That is, uk is admissible. Definition 1 (Admissible Control): The control input uk = u(xk ) is called admissible with respect to the penalty function Q(xk ) ≥ 0 and the control energy penalty function uT k Ruk if the following conditions hold: 1) u(xk ) is continuous with respect to xk ; 2) u(xk )|xk =0 = 0; 3) uk stabilizes sys T 2 T tem (1); and 4) J = ∞ j=0 (Q(xj )+uj Ruj −γ wj P wj ) < ∞. In (2), the control input and disturbance terms affect the cost function such that an increase in magnitude of the disturbance term requires an increase in magnitude of the control input. This

MEHRAEEN et al.: GAME THEORETIC FORMULATION OF NONLINEAR DT SYSTEMS USING NNs

formulation is often referred to as zero-sum two-player game. Before proceeding, the following definitions are required. Definition 2 [20] (L2 -Gain): Nonlinear system (1) with feedback control uk and disturbance wk ∈ l2 is said to have an L2 -gain less than or equal to γ if N 

z(xk )2 ≤

k=0

N 

γ 2 wkT P wk

(3)

k=0

with z(xk ) being excited by the disturbance from an initial state x0 = 0. When N approaches infinity, (3) becomes 2

∞ 

2

z(xk ) ≤

k=0

∞ 

γ 2 wkT P wk .

(4)

k=0

The problem of disturbance attenuation can be addressed by using the L2 -gain of a nonlinear system [11]. The disturbance wk is locally attenuated by a real value γ > 0 if there exists a neighborhood around the origin such that ∀wk ∈ l2 for which the trajectories of the closed-loop system (1) starting from the origin remain the same neighborhood, the response z(xk ) ∈ in ∞ 2 T 2 l2 satisfies k=0 (γ wk P wk − z(xk ) ) ≥ 0. Under these conditions, local disturbance attenuation with internal stability lends an admissible control input providing a closed-loop system with an L2 -gain less than or equal to γ [11]. Definition 3 [2], [5] (Finite-Gain Dissipative System): The DT nonlinear system (1) is said to be finite gain dissipative with supply rate W (xk , wk ) = γ 2 wkT P wk − z(xk )2

(5)

if there exists a nonnegative function V : n →  called available storage with V (0) = 0 such that, for all k ∈ Z + and wk ∈ l2 , it yields that V (xk+1 ) − V (xk ) ≤ W (xk , wk ). This, in turn, yields the HJ inequality [21] as   V (xk ) ≥ sup V (xk+1 ) + z(xk )2 − γ 2 wkT P wk . (6) wk

In general, function V (·) need not be smooth and time invariant, i.e., V (·) can be an explicit function of time (Vk (xk )) defined as V : [0, ∞] × n → . However, in this paper, we restrict our analysis to systems that possess smooth timeinvariant available storage functions V (·), where V : n → . The relationship between system (1) having L2 -gain and being dissipative can be expressed by the following [2]. 1) The nonlinear system (1) has L2 -gain less than or equal to γ if it is finite gain dissipative with the supply rate (5) and V (0) = 0. 2) The nonlinear system (1) is finite gain dissipative with the supply rate (5) if it has L2 -gain less than or equal to γ and is reachable from x = 0. A. Approximate HJ Equation In order to find the worst case disturbance wk∗ that satisfies wk∗ = arg max{V (xk+1 ) + z(xk )2 − γ 2 wkT P wk }, the wk

Hamiltonian function is defined as H(xk , wk ) = V (xk+1 )−V (xk )+z(xk )2 −γ 2 wkT P wk

(7)

1643

where H(xk , wk ) = 0 represents the HJ equation. The worst case disturbance wk∗ can be found by using the stationarity condition ∂H(xk , wk )/∂wk = 0, which yields wk∗ = 1/(2γ 2 )P −1 hT k

∂V ∗ (xk+1 ) . ∂xk+1

(8)

Next, by substituting (8) into H(xk , wk∗ ) = 0 in (7), the DT HJ equation becomes V ∗ (xk+1 ) − V ∗ (xk ) + z(xk )2   1 ∂V ∗ (xk+1 ) T h(xk )p−1 h(xk )T ∂V ∗ (xk+1 ) + = 0. 4 ∂xk+1 γ2 ∂xk+1

(9)

Note that the differential equation (9) is nonlinear with respect to V ∗ (xk+1 ) and, in general, is difficult to solve as it does not have a closed-form solution. Thus, in this paper, the Taylor series expansion approach is employed to solve the approximate DT HJ optimization. By assuming small perturbation about the operating point xk , we expand ΔV (xk ) by keeping the first two terms in the Taylor series and considering the higher order terms to be negligible. It was shown in [15] that this assumption is not stringent and can be applied to quadratic cost function without making the small-perturbation assumption. Thus, we obtain ΔV (xk ) = V (xk+1 ) − V (xk ) ≈ ∇V T (xk+1 − xk ) 1 + (xk+1 − xk )T ∇2 V (xk+1 − xk ) 2

(10)

where ∇V and ∇2 V are the gradient vector and Hessian matrix, respectively, as shown in  (x) (x) T (11) ∇V = ∂V · · · ∂V x=xk ∂x1 ∂xn ⎤ ⎡ ∂ 2 V (x) ∂ 2 V (x) ∂ 2 V (x) · · · ∂x 2 ∂x1 ∂x2 1 ∂xn 1 ⎥ ⎢ ∂ 2∂x ∂ 2 V (x) ⎥ ⎢ V (x) ∂ 2 V (x) · · · 2 ⎢ ∂x2 ∂xn ⎥ ∂x2 ∇2 V = ⎢ ∂x2 ∂x1 . (12) ⎥ .. .. .. ⎥ ⎢ .. ⎦ ⎣ . . . . 2 ∂ 2 V (x) ∂ 2 V (x) V (x) · · · ∂ ∂x 2 ∂xn ∂x1 ∂xn ∂x2 x=x n

k

Lemma 1: Let uk be an initial admissible control policy applied to the nonlinear DT system (1) with the associated cost function (2) and available storage function V (·) satisfying approximate HJ equation (10), as ∇V T (bk + hk wk − xk ) 1 + (bk + hk wk − xk )T ∇2 V (bk + hk wk − xk ) 2 + z(xk )2 − γ 2 wkT P wk = 0

(13)

and V (0) = 0. Then, J(xj ) = V (xj ) for all j ∈ Z + . Proof: Proof can be completed by expanding (10) to obtain V (x∞ ) − V (xj ), considering V (x∞ ) = 0 (due to admissibility of uk ) and constructing J(xj ) − V (xj ) = 0 by using the definition of J(xk ).  Definition 4 (DOV [22]): The domain of validity (DOV) of V is the set Ω of all x satisfying (13).

1644

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

B. Policy Iteration Assumption 1: The Hessian matrix ∇ V is positive definite in the set Ω around the origin. Assumption 1 can be realized in many systems, including linear systems due to nonnegative continuous available storage function in the set Ω around the origin. For example, consider the available storage V (x1 , x2 , x3 ) = x41 + x42 + x43 that has a positive definite Hessian matrix in the entire 3 . The worst case disturbance wk∗ can be found by setting the first partial derivative of equation (13) with respect to wk equal to zero, i.e., ∂H(xk , wk )/∂wk = 0, which yields 2

−1  2 wk∗ = 2γ 2 P − hT k ∇ V hk  T × hT + ∇2 V (fk + gk u∗k − xk ) . k ∇V

T   1 (j) (j) bk +hk wk −xk ∇2 V (j) bk +hk wk −xk = 0 2

and the disturbance    −1 (j+1) 2 (j) wk hk − 2γ 2 P = − hT hT k ∇ V k   T × ∇V (j) + ∇2 V (j) (bk − xk )

+

(14)

(15)

T   1 (j+1) (j+1) −xk ∇2 V (j) bk + hk wk −xk . bk + h k w k 2 (19)

Next, we evaluate ΔV (j) along the trajectories of xk+1 = (j) bk + hk wk as   (j) (j)T (j) ∇V (j) bk +hk wk − xk +z(xk )2 − γ 2 wk P wk +

Theorem 1: Let uk be an admissible control input for system (0) (1) with initial disturbance wk = 0. Also, let system (1) have an L2 -gain less than or equal to γ and be reachable from x = 0. In addition, assume that the available storage is a smooth and time-invariant function V ∗ ≥ 0 and V ∗ ∈ C 2 with a DOV Ω∗ . Then, iterating between the pair (13) and (14) as   T (j) (j)T (j) ∇V (j) bk +hk wk − xk + z(xk )2 − γ 2 wk P wk +

Now, substituting (17) into (18) yields   (j+1) −xk ΔV (j) = ∇V (j) bk + hk wk

T   1 (j) (j) bk +hk wk −xk ∇2 V (j) bk +hk wk −xk = 0. 2

(20)

Also, from the update law (16), we have the relationship     (j+1)T 2 (j) 2 wk hT ∇ h V − 2γ P k k   T  T = − ∇V (j) + ∇2 V (j) (bk − xk ) hk .

(21)

Subtracting (20) from the right-hand side of (19) and using (21) yield T 1 (j+1) (j+1)T (j+1) (j) ΔV (j)= −z(xk )2 +γ 2 wk P wk + wk −wk 2      (j+1) (j) 2 (j) hk w k × 2γ 2 P −hT −wk . (22) k ∇ V Next, taking the infinite sum of (22) implies that V (x∞ )(j) − V (xl )(j)

(16)

=

∞ 

ΔV (xk )(j)

k=l

ensures that the sequence V (j) with a DOV Ω(j) is mono2 2 tonically increasing provided Yw = hT k (∇ V )hk − 2γ P is positive definite. Also, the sequence converges to the worst case disturbance for system (1), controlled by policy uk , provided HJI problem (15) has a solution (i.e., w(j+1) ∈ l2 ). That is, V (j) ≤ V (j+1) ≤ V ∗ with Ω∗ ⊆ Ω(j+1) ⊆ Ω(j) and limj→∞ V (j) = V (∞) = V ∗ . Moreover, the available storage function V ∗ satisfies ∇V

∗T

(bk +hk wk∗ −xk )+z(xk )2

−γ

2

(∞)

.

(17)

From (10), we have ΔV (j) ≈ ∇V (j) (xk+1 − xk ) 1 + (xk+1 − xk )T ∇2 V (j) (xk+1 − xk ). 2

2

(18)

(j)

− wk

T 

   2 (j) 2γ 2 P − hT hk k ∇ V

  (j+1) (j) × wk − wk



(j+1)T 2

− z(xk )2 + wk ∞  1 k=l

where wk∗ = wk . Proof: First, we evaluate ΔV (j) along the trajectories of (j+1)

k=l

(j+1)

wk

= −J(xl )(j+1) +

T wk∗ P wk∗

1 + (bk +hk wk∗ −xk )T ∇2 V ∗ (bk +hwk∗ −xk ) = 0 2

xk+1 = bk + hk wk

=

∞    1

2

(j+1)

γ P wk

(j+1)

wk

(j)

− wk

T

     (j+1) (j) 2 (j) × 2γ 2 P − hT − wk hk w k k ∇ V

(23)

where J(xl )(j+1) is the cost function (2) computed when (j+1) system (1) is excited by wk , l ≤ k < ∞. Note that ∇2 V (j) is evaluated at each xk for all l ≤ k < ∞ in the summation. Since V (x∞ )(j) = 0 and, in addition, from Lemma 1, it is known that J(xl )(j+1) = V (xl )(j+1) , (23) leads to ∞   T  1 (j+1) (j) (j+1) (j) wk V =V + − wk 2 k=l       (j+1) (j) 2 T 2 (j) hk × w k × 2γ P − hk ∇ V − wk . (24)

MEHRAEEN et al.: GAME THEORETIC FORMULATION OF NONLINEAR DT SYSTEMS USING NNs

2 (j) When 2γ 2 P − hT )hk is positive definite for all k (∇ V j x ∈ Ω , (24) shows that V (j+1) ≥ V (j) , which proves the first claim of the theorem. The second part of the theorem is easily shown by noting that, when convergence occurs, the sequence leads to solving (13) and (14) for available storage function V ∗ and disturbance wk∗ . Moreover, since system (1) has an L2 -gain less than or equal to γ and is reachable from x = 0, it is finite gain dissipative, and thus, the available storage function has a maximum according to (6). Thus, V (j) continues to increase by the sequential updates  until V (j) = V (j+1) = V ∗ . Remark 1: The success in approximation of the available 2 (j) hk be storage V ∗ requires that the function 2γ 2 P − hT k∇ V positive definite. This condition is also needed to update the disturbance as mentioned before. The inequality 2γ 2 λmin (P ) > 2 (j) hk ) imposes a requirement on the value λmax (hT k∇ V x∈S (j) ⊆Ω(j)

is a function of γ, and thus, the inequality is of γ since ∇ V to be solved for γ. Solving for V (j) can be achieved by using approximate solutions such as NNs, which is presented in this paper. This way, different values of γ can be tried to find a range of proper values for γ. Similar approach can be utilized to solve for the minimum value of γ in H∞ optimal problem [23]. The policy iteration introduced in Theorem 1 starts with admissible control input uk for system (1) and initial disturbance (0) wk = 0. Then, the available storage function is obtained by solving (15), and the disturbance is updated using (16). The procedure proceeds by iterating between (15) and (16) until V (j) = V (j+1) is reached. Remark 2: Although the policy iteration for the disturbance leads to the worst case disturbance, it may take a large number (if not infinity) of iterations. In practice, this problem can be resolved by setting an approximation error εw > 0. Thus, the iterations proceed until the condition V (j+1) − V (j) < εw is satisfied. 2

(j)

III. T WO -P LAYER G AME F ORMULATION AND AHJI E QUATION In the previous section, the L2 -gain problem is introduced where we seek the worst case disturbance wk∗ ∈ l2 for which the trajectories of the closed-loop system (1) starting from the origin remain in the same and the re neighborhood, 2 T 2 (γ w P w − z(x sponse z(xk ) ∈ l2 satisfies ∞ k k ) ) ≥ k k=0 2 T 0, where z(xk ) = Q(xk ) + uk Ruk . However, in this problem, finding the control policy u∗k that minimizes the cost function (2) subject to disturbance wk∗ is also of interest. This control problem, referred to as HJI problem, can be represented as a zero-sum two-player game [4], where the two policies u∗k = u∗ (xk ) and wk∗ = w∗ (xk ) represent the optimal policy and worst case disturbance of (2), respectively, such that J(u∗k , wk ) ≤ J(u∗k , wk∗ ) ≤ J(uk , wk∗ ) for all uk and wk . The pair (u∗k , wk∗ ) then becomes the feedback saddle-point solution of the optimization problem [4]. The necessary and sufficient condition for the existence of a saddle point for the two-player game, where the two players are control policy and disturbance in nonlinear system (1), is given next.

1645

Theorem 2 [2]: Consider the zero-sum two-player differential game associated with system (1) and cost function (2), where the players are control policy and disturbance. The feedback saddle-point solution (u∗k , wk∗ ) is achievable, if and only if there exists a smooth function V : Rn → R (available storage) such that the HJI equation   2 T ∗ V ∗ (xk )= min max Q(xk )+uT k Ruk −γ wk P wk +V (xk+1 ) uk wk   2 T ∗ = max min Q(xk )+uT k Ruk −γ wk P wk +V (xk+1 ) wk

uk

= Q(xk )+u∗k Ru∗k −γ 2 wk∗ P wk∗ T

T

+ V ∗ (f +gu∗k +hwk∗ )

(25) 

has a solution with V (0) = 0. AHJI Equation

According to the optimization problem (25), the DT HJI equation becomes [11] V ∗ (fk + gk u∗k + hk wk∗ ) − V ∗ (xk ) + T T Q(xk ) + u∗k Ru∗k − γ 2 wk∗ P wk∗ = 0, where the optimal policy u∗k and worst case disturbance wk∗ are the solutions of (25). Thus, the Hamiltonian function can be defined as H(xk , uk , wk ) = V (xk+1 ) − V (xk ) + Q(xk ) 2 T + uT k Ruk − γ wk P wk .

(26)

Note that, when H(xk , u∗k , wk∗ ) = 0, we have the DT HJI equation. According to the definition of V (xk ) in (25), the optimal control input u∗k and worst case disturbance wk∗ can be obtained by setting the first partial derivatives of the righthand side of (26) with respect to uk and wk equal to zero (the stationarity conditions [24]) as ∂H(xk , uk , wk )/∂uk = 0 and ∂H(xk , uk , wk )/∂wk = 0. These equations, along with H(xk , u∗k , wk∗ ) = 0, will in turn yield the optimal control policy, worst case disturbance, and DT HJI equation as 1 u∗k = − R−1 gkT ∂V ∗ (xk+1 )/∂xk+1 2 ∗ ∗ wk∗ = 1/(2γ 2 )P −1 hT k ∂V (xk+1 )/∂xk+1 , V (xk+1 ) − V ∗ (xk ) + Q(xk ) + 

1 ∂V ∗ (xk+1 ) T 4 ∂xk+1

 ∗ h(xk )p−1 h(xk )T −1 T ∂V (xk+1 ) + g(x )R g(x ) k k γ2 ∂xk+1 = 0. (27) ×

Similar to (9), we face a mixed algebraic and nonlinear partial differential equation with respect to V ∗ (xk+1 ), which is, in general, difficult to solve. In [15], a Taylor series expansion approach was undertaken to overcome the difficulties in finding V ∗ (·) in the single-player HJB optimization problem, where the optimal control policy u∗k is sought to minimize the cost function in the absence of disturbance. However, no work has been done in policy iteration formulation of nonlinear DT HJI. What makes the problem in a DT HJI distinct from the HJB case is that the DT HJI equation contains an additional player wk that makes the successive approximations more complicated. For

1646

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

instance, added complexity is introduced during the successive approximation of the HJI equation which requires an inner loop and an outer loop (which is discussed next), whereas successively approximating of the HJB equation requires only a single training loop. More importantly, due to the existence of two players, additional considerations are taken here when solving the HJI equation to ensure satisfaction of Isaacs’s condition (the existence of a saddle point) in the zero-sum two-player game, whereas solving the HJB equation does not have such a requirement. The HJ problem formulation introduced in the previous section is now exploited to obtain an AHJI equation as H(xk , uk , wk ) = ∇V T (fk + gk uk + hwk − xk ) 1 + (fk + gk uk + hwk − xk )T ∇2 V 2 × (fk + gk uk + hk wk − xk ) 2 T + Q(xk ) + uT k Ruk − γ wk P wk . (28)

Note that, when H(xk , u∗k , wk∗ ) = 0, the AHJI equation results, where u∗k and wk∗ are the new optimal policy and worst case disturbance inputs to be obtained, which can be found by setting the first partial derivative of (28) with respect to uk and wk , respectively, equal to zero as ∂H(xk , Vk , uk , wk )/∂wk = 0 and ∂H(xk , Vk , uk , wk )/∂uk = 0, which, in turn, yields −1  2 wk∗ = 2γ 2 P − hT k ∇ V hk  T × hT + ∇2 V (fk + gk u∗k − xk ) (29) k ∇V and  −1 u∗k = − gkT ∇2 V gk + 2R  × gkT ∇V T + ∇2 V (fk + hk wk∗ − xk ) .

equation H(xk , u∗k , wk∗ ) = 0 is still not easy. To overcome this problem, iterative-based schemes are proposed in [9] and [10] to update the disturbance and control inputs in continuous-time HJI and in [15] to update control input in DT HJB. In this paper, an iterative approach is proposed in order to reach the optimal controller u∗k and worst case disturbance wk∗ for the DT AHJI problem followed by NN to approximate the available storage function at each iteration. IV. S UCCESSIVE A PPROXIMATIONS AND S ADDLE -P OINT S OLUTION IN THE T WO -P LAYER G AME In Section III, policy iterations are proposed to obtain the worst case disturbance while a fixed admissible control policy is employed. In this section, policy iterations are introduced to find the optimal control policy u∗k . (0) Let uk be an initial admissible control input for system (1) in the absence of disturbances wk . In addition, assume that the available storage (25) exists and is a smooth and time-invariant function V ∗ > 0 and V ∗ ∈ C 2 with a DOV Ω∗ . The successive approximation procedure consists of a sequential set of updates (i,j) for the disturbance wk in an inner loop with indices i and j accompanied by a sequential set of updates for the control input (i) uk in an outer loop with index i leading to an available storage V (i,j) > 0 with a DOV Ω(i,j) at each step i and j. A schematic representation of the two loops is shown in Fig. 1. (i) (i,j) The sequence starts with setting uk and wk = 0 for i = (i,j) j = 0. Then, the AHJI equation (28) is solved for Vk as   (i) (i,j) ∇V (i,j) fk + gk uk + hk wk − xk (i)T

(30)

Remark 3: Equations (29) and (30) may also be solved together to obtain solutions in terms of the available storage function and system states as −1 −1 T  2 −1 T 2 wk∗ = − I − Yw−1 hT Y w hk k ∇ V g k Y u g k ∇ V hk  T 2 × ∇V + ∇ V (fk − xk )   −∇2 V gk Yu−1 gkT ∇V T + ∇2 V (fk − xk ) and −1 −1 T  2 u∗k = − I − Yu−1 gkT ∇2 V hk Yw−1 hT Yu gk k ∇ V gk  × ∇V T + ∇2 V (fk − xk )    T + ∇2 V (fk − xk ) −∇2 V hk Yw−1 hT k ∇V 2 where Yu = gkT ∇2 V · gk + 2R and Yw = hT k ∇ V · hk − 2 2γ P . In contrast to the policies (27), the optimal control and worst case disturbance in (29) and (30) can be calculated independent of xk+1 . Thus, once the function V (xk ) is obtained, the optimal control input and worst case disturbance can be implemented. However, despite its linear behavior, finding a solution for the

(i)

(i,j)T

(i,j)

+ Q(xk ) + uk Ruk − γ 2 wk P wk   T 1 (i) (i,j) + fk + gk uk + hk wk − xk ∇2 V (i,j) 2   (i) (i,j) (31) × fk + gk uk + hwk − xk = 0. The updated disturbance can now be found by using    −1 (i,j+1) 2 (i,j) wk hk − 2γ 2 P = − hT hT k ∇ V k    T (i) × ∇V (i,j) + ∇2 V (i,j) fk + gk uk − xk

(32)

(i,j+1)

which is used to find Vk using (31) at step j + 1. The inner (i,j) (i,j+1) loop j proceeds until it converges such that Vk = Vk = (i,∞) , as shown in Theorem 1. Vk (i) Next, uk is updated using   −1  (i+1) uk = − gkT ∇2 V (i,∞) gk + 2R gkT    T (i,∞) × ∇V (i,∞) + ∇2 V (i,∞) fk + hk wk − xk . (33) Then, the available storage function is found by solving (31) for (i+1,0) . Similar to the inner loop, the outer loop i proceeds Vk (i,∞) (i+1,∞) (∞,∞) until it converges such that Vk = Vk = Vk , as shown in Fig. 1, which is proven next.

MEHRAEEN et al.: GAME THEORETIC FORMULATION OF NONLINEAR DT SYSTEMS USING NNs

1647

Theorem 4 (Isaacs’s Condition): Let the pair (uk , wk∗ ) be an arbitrary admissible control and the worst case disturbance inputs as provided by (29) for system (1). In addition, let the pair (u∗k , wk ) be the optimal controller provided by (30) and an arbitrary disturbance input, respectively, for system (1). Then, the Hamiltonian function (28) satisfies H (xk , u∗k , wk ) ≤ H (xk , u∗k , wk∗ ) ≤ H (xk , uk , wk∗ ) .

Fig. 1.

Successive approximation procedure. (i)

Theorem 3: Let uk be an admissible control for pair (i, j) for system (1) on the set Ω(i,j) that has L2 -gain less than or equal to γ and is reachable from x = 0. In addition, assume that the available storage function is a smooth and time-invariant function V ∗ > 0 and V ∗ ∈ C 2 with a DOV Ω∗ . Then, V (i,∞) ≥ V (i+1,∞) ≥ V ∗ , and lim V (i,∞) = V ∗ , where V ∗ solves the i→∞

AHJI equation (28) provided a saddle point exists in the twoplayer game and (31) has a solution (i.e., w(i,j+1) ∈ l2 ). Also, if V (i,∞) = V (i+1,∞) , then V (i,∞) = V ∗ . Proof: See the Appendix.  Remark 4: Similar to Remark 2, an approximation error εu > 0 can be defined to avoid a large number of iterations and to stop the loop in practice. Moreover, in order to guarantee the loop convergence in practice, with prior knowledge or some offline effort, the set of valid initial conditions can be determined. The following theorem will now demonstrate that the optimal control and worst case disturbance (28) ensure Isaacs’s condition min max H = max min H in the zero-sum two-player u w w u game (28).

Proof: See the Appendix.  Since it is assumed that the available storage V ∗ (xk ) in (25) exists and that the regularity condition applies, Theorem 4 implies that V (xk , u∗k , wk ) ≤ V (xk , u∗k , wk∗ ) ≤ V (xk , uk , wk∗ ). Also, in the two-player game, where the two players are control (i) (i,j) policy and disturbance in nonlinear system (1), uk and wk ∗ ∗ approach the saddle-point solutions uk and wk , respectively, as i and j increase. Next, the admissibility of the controller is presented. (0) Lemma 2 (Admissibility of the Controller): Let uk be an initial admissible control input for system (1) with an L2 gain less than or equal to γ and be reachable from x = 0 in the compact set Ω. Let the proposed successive approximation (i,j) procedure of updates for the disturbance wk in the inner loop (i) j and updates for the control input uk in the outer loop i be per(i,j) exist at each iteration. Then, the control formed and that V (i) input uk remains admissible in each step of the outer loop i. Proof: Proof can be presented by J(xl )(0,∞) = (0) V (xl )(0,∞) (from admissibility of uk and Theorem 1) and that the positive function V (1,∞) can be obtained and V (1,∞) ≤ V (0,∞) followed by employing induction.  Remark 5: In [8], using the Taylor series to approximate the system dynamics as well as the value function, a solution for the DT HJI equation (27) was found. Consequently, the DT HJI equation is reduced to solving a Riccati equation along with a sequence of linear algebraic equations. In contrast, this work takes on a fundamentally different approach in forming the Taylor series expansion since we do not require Taylor series expansion of the system dynamics. In addition, the work of Huang [8] is a power series solution of the HJI problem using their approximation techniques, while this work proposes a two-player game formulation of the HJI problem and shows that Isaacs’s condition is satisfied, leading to the saddle-point solution. Finally, it is observed that [8] (and [15]) requires knowledge of the internal system dynamics fk (x) whereas this requirement is relaxed in our work using the NN identifier presented after Section IV. Remark 6: It can be shown that the AHJI equation (13) leads to the DT algebraic Riccati equation (DARE) for linear systems (see the Appendix). V. NN A PPROXIMATION OF THE AVAILABLE S TORAGE F UNCTION So far, we have demonstrated the recursive solution to the AHJI equation by successively updating the disturbance w(i,j) and control u(i) , as shown in Fig. 1. However, a general closedform solution for the AHJI is still difficult to obtain even though

1648

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

the AHJI equation is a linear differential equation and, in general, easier to solve than the original nonlinear HJI equation. Moreover, the solution for HJI requires the internal dynamics (i.e., fk (x)) to be known. In this section, by using an NN approximator for the internal dynamics fk (x), we show that the available storage function can be obtained with a small error. A. Successive Approximation of the Available Storage Function Using NN First, we show how to approximate the solution of HJI equation for the DT nonlinear system by using the approximation properties of NNs and by assuming that the disturbance term w(i,j) and the control input u(i) are in feedback form. It is known that NNs can approximate smooth functions on a compact set Ω [1], [28]. Then, we can approximate V (i,j) with an NN as V

(i,j)



(i,j) VL (x)

=

L 

ωl σl (x) = WLT σ L (x)

(34)

l=1

where the activation function σl (x) is continuous and zero at the origin, WL = [ω1 · · · ωL ]T is the vector of NN weights, σ L = [σ1 · · · σL ]T is the vector of activation functions, and L is the number of hidden layer neurons. The NN weights are tuned to minimize the residual error (which is defined next) in least squares method over a set of points within the stability region of the initial admissible control. In the AHJI equation (i,j) AHJI(VL , u(i) , w(i,j) ) = 0, the available storage V (i,j) is replaced by VL to obtain the residual error as   L  (i,j) (i) (i,j) AHJI VL = ωl σl (x), u , w = eL (x) (35)

By expanding (36), taking derivative with respect to the NN weights and distributive property of the inner product, we obtain   1 1 T T 2 T T 2 ∇σ L Δx+ Δx ∇ σ L Δx,∇σ L Δx+ Δx ∇ σ L Δx ·WL 2 2  T T + Q(x) + u(i) Ru(i) − γ 2 w(i,j) P w(i,j) , ∇σ T L Δx 1 + ΔxT ∇2 σ L Δx 2

 =0

where the terms ∇σ L and ∇2 σ L are the gradient vector and Hessian matrix of σ L (x) with respect to x, respectively, and Δx = f (x) + g(x)u(x)(i) + h(x)w(x)(i,j) − x. The following lemma is needed to proceed. Lemma 3 [15]: If the set {σj (x)}L 1 = {σ1 (x), . . . , σL (x)} is linearly independent, then so is the set L {∇σjT Δx + (1/2)ΔxT ∇2 σj Δx}1 . From (20), we have   −1  T T Q(x)+u(i) Ru(i) − γ 2 w(i,j) P w(i,j) , θ WL = − θ, θ (38) L where θ = {∇σjT Δx + ΔxT ∇2 σj Δx/2}1 . As a result of Lemma 3, θ, θ is full rank and invertible. Therefore, a unique solution for the weights can be obtained. In addition, the inner products in (38) can  pbe approximated as [25] a(x), b(x) = a(x)b(x)dx ≈ i=1 a(xi )b(xi )δx , where δx = xi − xi−1 Ω is the mesh size chosen small in Ω and p, the number of mesh sectors, is large. By employing the mesh size δx in the set Ω, the NN weights can be found as −1

WL = −(X T X) XY

l=1

where the residual error is defined as eL (x) = ∇V T (fk + gk uk + hwk − xk ) 1 + (fk + gk uk + hwk − xk )T ∇2 V 2 × (fk + gk uk + hk wk − xk ) 2 T + Q(xk ) + uT k Ruk − γ wk P wk .

Weighted residuals [25] are used to find the least squares solution for (35). The weights are determined by using   ∂eL (x) = 0. (36) eL (x), ∂WL

(39)

where X and Y are defined as shown at the bottom of the page, and p is the number of points in the mesh. In Fig. 1, the NN weight calculations are performed at the steps where the available storage function is to be obtained, i.e., steps 3, 4, and 5 in Fig. 1. Thus, the available storage function is obtained using NN in the algorithm. Remark 8: Similar to [23], in order to solve the H∞ problem that seeks the smallest value of γ, the NN approximation of the storage function can be utilized along with different values of γ to obtain the minimum value of γ for which a storage function exists. From (38), the NN weight update depends on fk (x) for implementation of the iterative scheme; however, in many

   ∇σ L Δx + ΔxT ∇2 σ L Δx/2 x=x1 · · · ∇σ L Δx + ΔxT ∇2 σ L Δx/2 x=xp T  ⎡ ⎤ T T Q(x) + u(i) Ru(i) − γ 2 w(i,j) P w(i,j) x=x1 ⎥ ⎢ ⎢ ⎥ .. Y =⎢ ⎥ .  ⎣ ⎦ (i)T (i) 2 (i,j)T (i,j) Q(x) + u Ru − γ w Pw

X=

(37)



x=xp

MEHRAEEN et al.: GAME THEORETIC FORMULATION OF NONLINEAR DT SYSTEMS USING NNs

practical applications, it is not always possible to obtain an explicit expression for the internal dynamics fk (x) a priori. B. Identification of Unknown Nonlinear Internal Dynamics Consider the affine nonlinear DT system with known input disturbance as xk+1 = fk + gk uk + hk wk .

(40)

k+1 + (1/α)tr{Wf (k + and (45), we have ΔL = (α/3) xT k+1 x T T xk x k −(1/α)tr{WfT Wf } ≤ αfk 2 + 1) Wf (k +1)}−(α/3) xk 2 − (α/3) xk 2 + (1/α)tr{(Wf (k) − αε2 + αK 2  T αΔW ) (Wf (k) − αΔW )} − (1/α)tr{WfT Wf }. After expanding terms and using the Cauchy–Schwartz inequality, we obtain 2 α (1 − 3K 2 )  xk 2 + αWf ψf F + αε2M 3 ! !2   2 ! ! + 2αψf2 M Wf ψf F + 2αψf2 M ε2M − 2 !WfT ψf !

ΔL ≤ −

By using the universal approximation properties of NNs [1], the smooth function f (x) can be represented using an NN as f (x) = Wf ψf (x) + ε

x ˆk+1 = fˆk + gk uk + hk wk + K x k

(42)

where x k = xk − x ˆk , K is a design constant, fˆ(x) = ˆ ˆ Wf ψf (x), and Wf is the NN approximation of Wf . Subtracting (42) from (40) reveals the identification error dynamics to be x k+1 = fk + ε − K x k

(43)

ˆ f . Let the NN tuning where fk = WfT ψf with Wf = Wf − W law be given by ˆ f (k + 1) = W ˆ f (k) + αψf ( W xk+1 + K x k )T .

(44)

Then, the NN weight estimation error dynamic Wf (k + 1) = ˆ f (k + 1) is given by Wf − W   Wf (k + 1) = Wf (k) − α ψf ψfT Wf (k) + ψf εT = Wf (k) − αΔW

F

+

(41)

where Wf represents the bounded target weight matrix and ψf (x) is a basis satisfying ψf (x) ≤ ψf M for all x ∈ Ω. It is observed that this condition is easily met with proper selection of the basis function. In addition, ε = [ε1 , . . . , εn ]T , and ε ≤ εM , where εM is a positive constant. The NN identification scheme is now defined as

(45)

where ΔW = ψf ψfT Wf (k) + ψf εT . The key feature of the ˆk , we update law (44) is that, when x k in very small, i.e., xk ≈ x can conclude that identifier (42) has learned the internal dynamics f (x), i.e., fˆ(x) ≈ f (x). Before proceeding, the following definition is required. Definition 5 [1]: An equilibrium point xe is said to be uniformly ultimately bounded (UUB) if there exists a compact set S ⊂ n so that, for all initial states x0 ∈ S, there exist a bound B and a step time T (B, xo ) such that x(k) − xe  ≤ B for all k ≥ k0 + T . Theorem 5: Let the proposed identification scheme in (42) be used to identify f (x), and let the NN update law be given by (44). Also, let a stabilizing control input uk be utilized for system (40) and identifier (42). Then, the state estimation errors x (k) and NN function approximation errors WfT ψf are UUB. k ) + Proof: Define the Lyapunov function L = α/3( xT kx T 1/αtr{Wf Wf }. Calculating the first difference and using (43)

1649

2WfT ψf F εM

which, in turn, leads to    α ΔL ≤ − (1 − 3K 2 ) xk 2 − 1 − α 2ψf2 M + 1 3 !2 !    ! ! × !WfT ψf ! + ε2M 1 + α 1 + 2ψf2 M . F

It can be concluded that ΔL √ ≤ 0 if the design parameters are selected according to K ≤ 1/ 3, α ≤ 1/(2ψf2 M + 1), and the following inequalities hold: " ⎫ ⎪ ε2M (1+α(1+2ψf2 M )) ⎬ = bx or ⎪  xk  ≥ α 2 3 (1−3K ) " . (46) ! ! 2 ⎪ εM (1+α(1+2ψf2 M )) ! T ! ⎪ = bW ⎭ ! Wf ψ f ! ≥ (1−α(2ψf2 M +1)) F Consequently, the state identification error, as well as the function approximation error, converges to the bounds bx and bW uniformly. Note that bx and bW can be made small by choosing proper design gains and decreasing εM by means of increasing the number of hidden layer neurons [1].  From Theorem 5, it is clear that εf = fk (x) − fˆk (x) = WfT ψf + ε is bounded such that εf  ≤ bW σ f M + εM ≡ εf M .

(47)

Next, we investigate the effect of using fˆk (x) in the NN least squares training method in Section V-A. Theorem 6: Let the internal dynamics fk (x) be provided using the NN identifier (42) so that the bounds (46) hold. If the NN least squares algorithm is utilized for tuning the NN weights in order to get fˆk (x) so that the available storage function Vˆ ∗ (x) can be constructed, then, |Vˆ ∗ (x) − V ∗ (x)| ≤ T (|εf |) ≤ εM , where T (·) is a function of εf , with εM being a positive constant. Proof: See the Appendix.  VI. S IMULATION R ESULTS '

Example 1: Consider the nonlinear system described by ( ' ( x1 ((k + 1)T ) −0.8x2 (kT ) = x2 ((k + 1)T ) sin (0.8x1 (kT ) − x2 (kT )) + 1.8x2 (kT ) ' ( ' ( 0 0 + u+ w −1 − x2 (kT ) 1

1650

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

Fig. 3.

Fig. 2. Nonlinear system states and control inputs with the optimal control.

with the initial stabilizing controller defined by u = [1 1.5]x(kT ), where T = 1 s. The initial conditions are taken as x(0) = [0.1 0.1]T with Q(x) = xT x, R = P = 1, and γ = 20. The region (−0.5 ≤ x1 ≤ 0.5, −0.5 ≤ x2 ≤ 0.5) is used to provide the mesh and train the NN while the mesh size in the (x1 , x2 ) plane is chosen to be (0.05, 0.05). The activation functions of the NN are even polynomial functions up to the tenth order in the form of [x21 , x1 x2 , x22 , x41 , x31 x2 . . . , x42 , . . . , x10 1 , . . . , x10 2 ], and the control input and disturbance are updated according to (32) and (33), respectively, while the available storage function is approximated by the NN whose weights are obtained from (38). Upon completion of the offline training (ten outer loops with different numbers of inner loops ranging from 4 to 12), the final NN weights are WL = [2.246, 2.113, 3.319, 28.41, 38.956, 35.115, −6.221, −74.443, −165.62, −202.56, −160.54, 9.629, 6.883, −60.43, −4.526, 143.20, −5.184, −132.63, 4.151, −39.21, 10.04, 117.56, 309.78, 521.20, 515.72, 372.54, 35.32, 24.96, 11.18, −79.09, −263.67, −430.76,−607.42,−532.26− 332.39]T . The performance of the final optimal control policy is then compared to the initial stabilizing control. In the comparison, a disturbance w = 0.05e−0.1kT is introduced into the system at k = 0. In order to evaluate the overall performance of the system, the following performance metric [7] is utilized as   k    T Q(xj ) + uj Ruj Attenuation(k) =

j=0



k  j=0



.

(48)

γ 2 wjT P wj

Fig. 2 shows the control effort of the optimal control law as well as the initial control law,while Fig. 3 shows the control attenuation associated with each policy evaluated using the metric (48). By examining Fig. 2, it is observed that, with the improved control law, the states converge to the origin faster and more smoothly, as opposed to the case with nonoptimal input. In addition, examining the control signals shown in Fig. 2, the final optimal control policy exhibits significant improvements over the initial policy in terms of magnitude and smooth-

Nonlinear system control cost.

ness. In Fig. 3, a significant decrease in the control effort (48) is observed when the optimal control law is applied to the system. Thus, the improved control policy behaves as expected. Next, the identifier (42) is used in the system of Example 1 to approximate the internal dynamics f (x), when the f (x) is ) T ψf (x), assumed unknown and is approximated by fˆ(x) = W f where the identification scheme (42) and NN weight update law (44) are used. Sigmoid is used for the NN activation function ψf (x) = sigmoid(VfT x), where the number of the hidden layer neurons is 20 and the hidden layer weights Vf are chosen initially random and kept constant during the simulation ˆ f are set initially to −10. Also, biases whereas the weights W are considered in both hidden and output layers. The identifier’s initial actual and estimated states are x1 (0) = x2 (0) = ˆ2 (0) = 0.1. In order to perform the identification, x ˆ1 (0) = x the control input u(k) is designed through backstepping and is obtained as u(k) = (1/ − 1 − x2 ) × [−(sin(0.8x1 − x2 ) + 1.8x2 ) + (1/ − 0.8)(r(k + 2) + K1 e1 (k + 1)) + K2 z2 ] to let the system state x1 (k) be excited through tracking the function r(k) = sin(k), where e1 = x1 − r, z2 = x2 − x2d , and x2d = −1/0.8 × (r(k + 1) + K1 e1 (k)). In addition, the design gains are K1 = 0.1 and K2 = 0.01, and the identifier design gain in (42) is K = 0.01. Then, the proposed nonlinear optimal controller design is considered for the system with the approximation ) T ψf (x). The original stabilizing controller u = fˆ(x) = W f [1 1.5][x1 (k) x2 (k)]T is applied. Also, Q(x) = xT Qx, Q = 1, R = P = 1, and γ = 20 are used. In order to implement the NN approximator, the mesh size in the (x1 , x2 ) plane is chosen to be 0.03. The region (−0.5 ≤ x1 ≤ 0.5, −0.5 ≤ x2 ≤ 0.5) is used to train the NN. Similar to Example 1, the NN is defined with the activation functions containing even polynomial functions up to the tenth order in the form of [x21 , x1 x2 , x22 , 10 x41 , x1 x32 , . . . , x42 , . . . , x10 1 , . . . , x2 ], and the control input and disturbance are updated according to the proposed procedure (shown in Fig. 1). Upon completion of the offline training, the performance of the final optimal control policy is compared to the initial stabilizing control. Similar to Example 1, a disturbance w = 0.05e−0.1k is introduced to the system at k = 0. Fig. 4 shows the performance of the optimal controller by using fˆ(x) (the approximated f (x)) through the NN identifier where it is shown that the convergence of the states, as well as the input, is close to the optimal case with known internal dynamics. This shows the significance of the NN identifier to relax the need for the internal dynamics.

MEHRAEEN et al.: GAME THEORETIC FORMULATION OF NONLINEAR DT SYSTEMS USING NNs

Fig. 4.

1651

Nonlinear system control cost.

Next, another example is employed to verify that the proposed AHJI results in DARE optimal policies. Example 2: In this example, the performance of the proposed method is compared against DARE solution for linear systems. Consider the linear system described by ( ' (' ( ' ( ' ( ' 0 −0.8 x1 (k) 0 0 x1 (k + 1) = + u+ w. 0.8 1.8 −1 1 x2 (k + 1) x2 (k)

Fig. 5. Linear system states and control inputs with nonlinear and DARE optimal controllers as well as original stabilizing controller.

The initial stabilizing controller is chosen to be ' ( x1 (k) u = [1.5 1.5] . x2 (k) The initial conditions are taken as x1 (0) = 1 and x2 (0) = 1. Moreover, Q(x) = xT Qx is used, R = Q(x) = P = 1, and γ = 20. In order to implement the NN approximator, the mesh size in the (x1 , x2 ) plane is chosen to be 0.05. The region (−0.5 ≤ x1 ≤ 0.5, −0.5 ≤ x2 ≤ 0.5) is used to train the NN. The NN is defined with the activation functions containing even polynomial functions up to the sixth order in the form of [x21 , x1 x2 , x22 , x41 , x1 x32 . . . , x42 , . . . , x61 . . . , x62 ], and the control input and disturbance are updated according to (32) and (33). Upon completion of the offline training, the performance of the final optimal control policy is compared to the linear DARE optimal policies wk∗ = Kxk and u∗k = Lxk [see (B5)] as well as the initial stabilizing controller. The DARE is solved ' using MATLAB, ( where the matrix Γ is obtained as 1.5074 1.0070 Γ= . 1.0070 3.7919 In the comparison, a disturbance input w = 10e−0.1k is introduced into the system at k = 0. The system states and the control efforts when the proposed optimal controller and the original stabilizing controller were applied are shown in Fig. 5, which shows that the DARE optimal controller provides the same results as the proposed nonlinear optimal controller. This verifies that the proposed optimal control design reduces to DARE optimal controller when the system is linear. Fig. 6 shows the attenuation associated with each policy, where the

Fig. 6. Linear system control attenuation with nonlinear optimal controller and original stabilizing controller.

attenuation is defined as (48). Examining Fig. 7, it can be observed that the proposed nonlinear optimal controller coincides with the DARE optimal policy with significant improvement over the initial stabilizing controller, where the figure illustrates the system trajectories with the nonlinear optimal control strategy as well as the DARE optimal controller and original stabilizing controller. These examples clearly indicate that the proposed optimal control policy renders the desired performance as expected. VII. C ONCLUSION In this paper, nearly optimal solutions for DT nonlinear control systems have been considered. A successive approximation approach was utilized to solve the AHJI equation that appears in optimal control. Under a small-perturbation assumption, the definition of AHJI function as well as methods for updating control input and disturbance for DT nonlinear affine systems

1652

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

Since V (x∞ )(i,j) = 0 and J(xl )(i,∞) = V (xl )(i,∞) (from Lemma 1), applying similar reasoning used in Theorem 1 shows ∞ T 1   (i+1) (i) uk − uk V (xl )(i+1,∞) = V (xl )(i,∞) − 2 k=l      (i+1) (i) T 2 (i,∞) gk uk × 2R+gk ∇ V −uk . (A2) (i,∞)

Fig. 7. Linear system trajectories with nonlinear and DARE optimal controllers as well as original stabilizing controller.

has been proposed, and the associated available storage function has been achieved. Moreover, sufficient conditions for algorithm convergence to the saddle point have been derived. Then, an NN was employed to approximate the HJI equation using a least squares approach so that a closed-loop optimal NN controller via offline learning can be derived. Finally, an NNbased identifier relaxes the need for system internal dynamics. Simulation results verify the theoretical conjectures. A PPENDIX A Proof of Theorem 3 By evaluating ΔV from (10) along the trajectories (i+1) (i,∞) of xk+1 = f (xk ) + g(xk )uk + h(xk )wk and xk+1 = (i) (i,∞) f (xk ) + g(xk )uk + h(xk )wk and following a similar procedure as in Theorem 1, it can be shown that T     1  (i+1) (i) 2R + gkT ∇2 V (i,∞) gk uk ΔV (i,∞) = − − uk  2 (i+1) (i) (i+1)T (i+1) −uk − Q(xk ) − uk Ruk × uk (i,∞)T

+ γ 2 wk

(i,∞)

P wk

.

(A1)

Next, take the infinite sum of (A1) to get V (x∞ )(i,∞) − V (xl )(i,∞) ∞  = ΔV (xk )(i,∞) =

k=l ∞   k=l



1 2

(i+1)

uk

(i)

− uk

T '





(

2R + gkT ∇2 V (i,∞) gk

  (i+1) (i) (i+1)T (i+1) − uk − Q(xk ) − uk Ruk × uk  (i,∞)T (i,∞) + γ 2 wk P wk

= −J(xl )(i+1,∞) ∞ T      1  (i+1) (i) − − uk 2R + gkT ∇2 V (i,∞) gk uk 2 k=l   (i+1) (i) − uk . × uk

Observing 2R + gkT (∇2 Vk )gk and Assumption 1, (A2) shows that V (xl )(i+1,∞) ≤ V (xl )(i,∞) . Moreover, since a saddle point is assumed to exist in the considered two-player game in the neighborhood Ω∗ around the origin, V (i,∞) has (i) a minimum V ∗ as ul varies. Thus, V (i,∞) continues to decrease by the sequential updates (33) in the outer loop i until V (i,∞) = V (i+1,∞) = V ∗ . Since V (i+1,∞) ≤ V (0,0) , it yields V (x∞ )(i+1,∞) = 0, and from Lemma 1, the new available storage function is equal to the cost function (2), i.e., J(xk ) = V (xk ). Next, from cost function (2), we obtain V (xk )(i+1,∞) = (∞) T

(∞)

z(xk )2 − γ 2 (wk ) P wk + V (xk+1 )(i+1,∞) , which satisfies HJ equation (6), and thus, the system is finite gain dissipative and has L2 -gain less than or equal to γ according to Definition 3.  Proof of Theorem 4 The proof will be shown in two steps. First, it will be shown that H(xk , uk , wk∗ ) − H(xk , u∗k , wk∗ ) ≥ 0. By noting that H(xk , u∗k , wk∗ ) and H(xk , uk , wk∗ ) are nothing but (28) rewritten in terms of wk∗ and/or u∗k , using (30), and where Yu = gkT (∇2 Vk )gk + 2R, it yields H (xk , uk , wk∗ ) − H (xk , u∗k , wk∗ ) 1 1 T T = −u∗k YuT (uk − u∗k ) + uT Yu uk − u∗k Yu u∗k 2 k 2 = (uk − u∗k )T Yu (uk − u∗k ) . Since the matrix Yu is positive definite, it can be concluded that H(xk , uk , wk∗ ) − H(xk , u∗k , wk∗ ) ≥ 0. Next, it will be shown that H(xk , u∗k , wk ) − H(xk , u∗k , wk∗ ) ≤ 0. Similar to the previous case, we employ (29) to have H (xk , u∗k , wk ) − H (xk , u∗k , wk∗ ) 1 1 T T = −wk∗ YwT (wk − wk∗ ) + wkT Yw wk − wk∗ Yw wk∗ 2 2 1 = (wk − wk∗ )T Yw (wk − wk∗ ) 2 2 2 ∗ where Yw = hT k (∇ Vk ).hk − 2γ P . Thus, H(xk , uk , wk ) − ∗ ∗ H(xk , uk , wk ) ≤ 0 can be achieved provided Yw is positive definite (from Remark 1,) which, in turn, results in  H(xk , u∗k , wk ) ≤ H(xk , u∗k , wk∗ ) ≤ H(xk , uk , wk∗ ).

Proof of Theorem 6 Similar to (34) which renders V ∗ (x) when NN successive approximation algorithm converges, we approximate the function Vˆ ∗ (x) with an NN as Vˆ ∗ (x) ≈

L  l=l

ˆ T σ L (x). ω ˆ l σl (x) = W L

(A3)

MEHRAEEN et al.: GAME THEORETIC FORMULATION OF NONLINEAR DT SYSTEMS USING NNs

ˆ L is a function of x and fˆ, From (38), it is easy to verify that W where WL is the same function of x and f , i.e., ˆ L = T1 (x, fˆ) W

WL = T1 (x, f )

(A4)

where T1 (x, f ) = −θ(x, f ), θ(x, f )−1 × Q(x) + T (i)T (i) u(x) Ru(x) − γ 2 w(x)(i,j) P w(x)(i,j) , θ(x, f ) at step (i, j). Thus, by using (34) and (A6), we can rewrite Vˆ ∗ (x) ˆ L σ L (x) = T2 (x, fˆ) and V ∗ (x) ≈ and V ∗ (x) as Vˆ ∗ (x) ≈ W WL σ L (x) = T2 (x, f ), respectively, where T2 (x, f ) = σ L (x)T1 (x, f ). Consequently, by using the Taylor series around the point (x, f ) and εf = f (x) − fˆ(x) from (47), we obtain V ∗ (x) − Vˆ ∗ (x) = T2 (x, f ) − T2 (x, fˆ) ≈−

S 



s=1 kf 1 ,...,kf n

1 kf 1 ! · · · kf n !

1653

Assuming that the Riccati equation (B3) has a unique solution  yields that ΓT = Γ. T Equation (B2) reveals that ∇V ∗ = 2Γxk and ∇2 V ∗ = 2Γ. Thus, the AHJI equation (13) becomes (Axk + Bu∗k + Cwk∗ )T Γ (Axk + Bu∗k + Cwk∗ ) + Q(xk ) + u∗k Ru∗k − γ 2 wk∗ P wk∗ = 0 T

T

(B4)

by using Lemma 5. The optimal policies (29) and (30) are now rewritten as   −1 T −1 wk∗ = C T ΓC − γ 2 P − C T ΓB B T ΓB + R B ΓC   −1 × −C T ΓA + C T ΓB[B T ΓB + R] B T ΓA xk and

* * ∂ s T2 (x, fˆ) kf 1 kf n * × ε1 · · · εn * *ˆ (∂ kf 1 fˆ1 ) · · · (∂ kf n fˆn )

f =f

where S is the number of required Taylor series powers such that the Taylor series error is negligible, −εf = [ε1 , . . . , εn ]T , fˆ = [fˆ , . . . , fˆn ]T , and kf i is an integer such that n 1 ˆ i=1 kf i = s with n being the size of vector f . Since ∗ ∗ ˆ V (x) and V (x) are bounded and continuous (due to selection of the activation functions), there exists a positive constant δM such that |∂ s T2 /((∂ kf 1 f1 ) · · · (∂ kf n fn ))| ≤ Consequently, |V ∗ (x) − Vˆ ∗ (x)| ≤ δM . S  s kf 1 ,...,kf n (1/kf 1 ! · · · kf n !)δM εf M , where εf M is s=1 defined in (47). Note that, if εf M < 1, then V ∗ (x) − Vˆ ∗ (x) can be made very small. 

 −1 −1 u∗k = B T ΓB + R − B T ΓC[C T ΓC − γ 2 P ] C T ΓB   −1 × −B T ΓA + B T ΓC[C T ΓC − γ 2 P ] C T ΓA xk . Next, we observe that the AHJI equation (B4) is equivalent to the HJI DARE (B3). This can be shown by following a similar approach to [26], where matrices R and P are not identity matrices. Define  −1 −1 K = C T ΓC − γ 2 P − C T ΓB[B T ΓB + R] B T ΓC   −1 × −C T ΓA + C T ΓB[B T ΓB + R] B T ΓA

A PPENDIX B

−1  −1 L = B T ΓB + R − B T ΓC[C T ΓC − γ 2 P ] C T ΓB   −1 × −B T ΓA + B T ΓC[C T ΓC − γ 2 P ] C T ΓA xk .

It will be shown that DARE for linear systems can be obtained by the proposed AHJI equation. Consider the DT linear system (B1) and associated available storage function (B2)

Thus, wk∗ = Kxk , and u∗k = Lxk . Define  −1 T C ΓB D11 = B T ΓB + R − B T ΓC C T ΓC − γ 2 P

Linear DT Systems



−1

xk+1 = Axk + Buk + Cwk V (∗) = xT k Γxk

(B1) (B2)

where xk ∈ n is the state vector evaluated at step k, A ∈ n×n , B ∈ n×m , and C ∈ n×M are smooth functions defined in a neighborhood of the origin, uk ∈ m is the control input, wk ∈ M is the disturbance, and Γ is a positive definite constant matrix. Lemma 5: If V (∗) = xT k Γxk is the available storage function for system (B1), then ΓT = Γ. Proof: This can be easily shown by obtaining the transpose of both sides of the linear HJI DARE equation hereinafter [26], [27], knowing that Q, R, and P are symmetric, and for any −1 T given invertible matrix X, we have (X T ) = (X −1 ) . Then Γ = AT ΓA + Q − [ AT ΓB AT ΓC ] ' (−1 ' T ( R + B T ΓB B ΓA B T ΓC × . C T ΓC − γ 2 P C T ΓB C T ΓA

(B3)

D22 = C T ΓC − γ 2 P − C T ΓB[B T ΓB + R] B T ΓC A11 = B T ΓB + R, A22 = C T ΓC − γ 2 P , A12 = B T ΓC, and A21 = C T ΓB. Note that D11 = A11 − A12 A−1 22 A21 and A are the Schur complements of A11 D22 = A22 − A21 A−1 11 12 and A22 , respectively. Thus [24] ( ' (−1 ' −1 −1 A11 A12 D11 −A−1 11 A12 D22 . = −1 −1 −A−1 D22 A21 A22 22 A21 D11 T T T = D11 , D22 = D22 , AT Also, note that D11 12 = A12 , A22 = T A22 , and A11 = A11 . Consequently, K and L can be calculated as ' ( ' (' T ( −1 −1 L D11 B ΓA −D11 A12 A−1 22 =− −1 −1 K −D22 A21 A−1 D22 C T ΓA 11 ' (−1 ' T ( A11 A12 B ΓA =− . (B5) A21 A22 C T ΓA

1654

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

By using (B5), (B4) may be rewritten as  T T T T T xT k −Γ + Q + A ΓA + A ΓBL + A ΓCK + L B ΓA + LT B T ΓBL + LT B T ΓCK + K T C T ΓA + K T C T ΓBL + K T C T ΓCK + LT RL  −γ 2 K T P K xk = 0 or equivalently as Γ = Q + AT ΓA + AT ΓBL + AT ΓCK + LT B T ΓA + K T C T ΓA − [ LT ' ×

'

KT ]

R + B T ΓB C T ΓB

( B T ΓC C T ΓC − γ 2 P (−1 ' T ( B ΓA B T ΓC C T ΓC − γ 2 P C T ΓA

R + B T ΓB C T ΓB

= Q + AT ΓA + AT ΓBL + AT ΓCK ' ( L = Q + AT ΓA + [ AT ΓB AT ΓC ] . K Thus, by using (B5), one can obtain Γ = Q + AT ΓA − [ AT ΓB AT ΓC ] ' (−1 ' T ( R + B T ΓB B ΓA B T ΓC × . C T ΓC − γ 2 P C T ΓB C T ΓA

R EFERENCES [1] S. Jagannathan, Neural Network Control of Nonlinear Discrete-Time Systems. Boca Raton, FL: CRC Press, 2006. [2] W. Lin and C. I. Byrnes, “H∞ control of discrete-time nonlinear systems,” IEEE Trans. Autom. Control, vol. 41, no. 4, pp. 494–510, Apr. 1996. [3] J. H. Doyle, K. Glover, P. Khargonekar, and B. Francis, “State-space solutions to standard H2 and H∞ control problems,” IEEE Trans. Autom. Control, vol. 34, no. 8, pp. 831–847, Aug. 1989. [4] T. Basar and P. Bernard, H∞ Optimal Control and Related Minimax Design Problems. Cambridge, MA: Birkhäuser, 1995. [5] J. C. Willems, “Dissipative dynamical systems Part I: General theory,” Arch. Rational Mech. Anal., vol. 45, no. 1, pp. 321–351, Jan. 1972. [6] A. J. Van Der Schaft, “L2 -gain analysis of nonlinear systems and nonlinear state feedback H∞ control,” IEEE Trans. Autom. Control, vol. 37, no. 6, pp. 770–784, Jun. 1992. [7] J. Huang and C. F. Lin, “Numerical approach to computing nonlinear H∞ control laws,” J. Guid. Control Dyn., vol. 18, no. 5, pp. 989–994, Sep./Oct. 1995. [8] J. Huang, “An algorithm to solve the discrete HJI equation arising in the L2 gain optimization problem,” Int. J. Control, vol. 72, no. 1, pp. 49–57, 1999. [9] M. Abu-Khalaf, J. Huang, and F. Lewis, Nonlinear H2 /H∞ Constrained Feedback Control. London, U.K.: Springer-Verlag, 2006. [10] R. W. Beard and T. W. McLain, “Successive Galerkin approximation algorithms for nonlinear optimal and robust control,” Int. J. Control, vol. 71, no. 5, pp. 717–743, 1998. [11] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Neural network H∞ state feedback control with actuator saturation: The nonlinear benchmark problem,” in Proc. ICC, 2005, pp. 1–9. [12] B. Anderson, Y. Feng, and W. Chen, “A game theoretic algorithm to solve Riccati and Hamilton–Jacobi–Bellman–Isaacs (HJBI) equations in H infinity control,” in Optimization and Optimal Control; Theory and Applications. New York: Springer-Verlag, pp. 277–308. [13] M. Johnson, S. Bhasin, and W. E. Dixon, “Nonlinear two-player zero-sum game approximate solution using a policy iteration algorithm,” in Proc. IEEE Conf. Decision Control, Orlando, FL, 2011, pp. 142–147.

[14] K. G. Vamvoudakis and F. L. Lewis, “Online solution of nonlinear twoplayer zero-sum games using synchronous policy iteration,” in Proc. 49th IEEE Conf. Decision Control, 2010, pp. 3040–3047. [15] Z. Chen and S. Jagannathan, “Generalized Hamilton–Jacobi–Bellman formulation-based neural network control of affine nonlinear discretetime systems,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 90–106, Jan. 2008. [16] A. Al Tamimi, F. L. Lewis, and M. Abu Khalaf, “Discrete-time–nonlinear HJB solution using approximate dynamic programming,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 943–949, Aug. 2008. [17] H. Zhang, Y. Luo, and D. Liu, “Neural network based near optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [18] H. G. Zhang, Q. L. Wei, and Y. H. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008. [19] A. Al Tamimi, M. Abu Khalaf, and F. L. Lewis, “Model-free Q-learning designs for discrete-time zero-sum games with application to H∞ control,” Automatica, vol. 43, no. 3, pp. 473–481, Mar. 2007. [20] C. A. Desoer and M. Vidyasagar, Feedback Systems: Input–Output Properties. New York: Academic, 1975. [21] J. A. Ball and J. W. Helton, “Nonlinear H∞ control theory for stable plants,” Math. Control Signals Syst., vol. 5, pp. 233–261, 1992. [22] G. Bianchini, R. Genesio, A. Parenti, and A. Tesi, “Global H∞ controllers for a class of nonlinear systems,” IEEE Trans. Autom. Control, vol. 49, no. 2, pp. 244–249, Feb. 2004. [23] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Policy iterations on the Hamilton–Jacobi–Isaacs equation for H∞ state feedback control with input saturation,” IEEE Trans. Autom. Control, vol. 51, no. 12, pp. 1989– 1995, Dec. 2006. [24] F. L. Lewis and V. L. Syrmos, Optimal Control. New York: Wiley, 1995. [25] B. A. Finlayson, The Method of Weighted Residuals and Variational Principles. New York: Academic, 1972. [26] A. A. Al-Tamimi, “Discrete-time control algorithms and adaptive intelligent system designs,” Ph.D. dissertation, Univ. Texas, Arlington, TX, 2007. [27] A. Al-Tamimi, D. Vrabie, M. Abu-Khalaf, and F. L. Lewis, “Modelfree approximate dynamic programming schemes for linear systems,” in Proc. IJCNN, Aug. 12–17, 2007, pp. 371–378. [28] F. L. Lewis, S. Jagannathan, and A. Yesildirek, Neural Network Control of Robot Manipulators and Nonlinear Systems. New York: Taylor & Francis, 1999.

Shahab Mehraeen (S’08–M’10) received the B.S. degree in electrical engineering from Iran University of Science and Technology, Tehran, Iran, in 1995, the M.S. degree in electrical engineering from Esfahan University of Technology, Esfahan, Iran, in 2001, and the Ph.D. degree in electrical engineering from Missouri University of Science and Technology, Rolla, in 2009. Since 2010, he has been an Assistant Professor with Louisiana State University, Baton Rouge. His research interests include renewable energies, power system control, decentralized control of large-scale interconnected systems, and nonlinear, adaptive, and optimal controllers for dynamic systems.

Travis Dierks (S’07–M’10) received the B.S. and M.S. degrees in electrical engineering and the Ph.D. degree from the Missouri University of Science and Technology (formerly the University of Missouri, Rolla), Rolla, in 2005, 2007, and 2009, respectively. While completing his doctoral degree, he was a GAANN Fellow and a Chancellor’s Fellow. He is currently with DRS Sustainment Systems, Inc., St. Louis, MO, and an Adjunct Assistant Professor with the University of Missouri. His research interests include nonlinear control, optimal control, neural network control, and the control and coordination of autonomous ground and aerial vehicles.

MEHRAEEN et al.: GAME THEORETIC FORMULATION OF NONLINEAR DT SYSTEMS USING NNs

S. Jagannathan (M’89–SM’99) received the B.S. degree in electrical engineering from the College of Engineering, Guindy, Anna University, Madras, India, in 1987, the M.Sc. degree in electrical engineering from the University of Saskatchewan, Saskatoon, SK, Canada, in 1989, and the Ph.D. degree in electrical engineering from The University of Texas, in 1994. During 1986 to 1987, he was a Junior Engineer with Engineers India Ltd., New Delhi, India, a Research Associate and an Instructor from 1990 to 1991 with the University of Manitoba, Winnipeg, MB, Canada, and a Consultant during 1994 to 1998 with Systems and Controls Research Division, Caterpillar Inc., Peoria, IL. During 1998 to 2001, he was with The University of Texas, San Antonio, and since September 2001, he has been with Missouri University of Science and Technology, Rolla, where he is currently a Rutledge-Emerson Distinguished Professor and the Site Director for the National Science Foundation (NSF) Industry/University Cooperative Research Center on Intelligent Maintenance Systems. He has coauthored over 78 peer-reviewed journal articles, 150 IEEE conference articles, several book chapters, and three books entitled “Neural network control of robot manipulators and nonlinear systems,” published by Taylor & Francis in 1999, “Discrete-time neural network control of nonlinear discrete-time systems” by CRC Press in 2006, and “Wireless Ad Hoc and Sensor Networks: Performance, Protocols and Control,” by CRC Press in 2007. He is the holder of 17 patents with several pending. His research interests include adaptive and neural network control, computer/communication/sensor networks, prognostics, and autonomous systems/robotics. Dr. Jagannathan was the recipient of NSF Career Award in 2000, Caterpillar Research Excellence Award in 2001, Boeing Pride Achievement Award in 2007, and many others. He served as an Associate Editor for IEEE T RANS ACTIONS ON N EURAL N ETWORKS , IEEE T RANSACTIONS ON C ONTROL S YSTEMS T ECHNOLOGY, and IEEE S YSTEMS J OURNAL. He served on a number of IEEE Conference Committees.

1655

Mariesa L. Crow (M’80–SM’94–F’10) received the B.S.E. degree from the University of Michigan, Ann Arbor, and the Ph.D. degree from the University of Illinois, Urbana. She is currently the Director of the Energy Research and Development Center and the F. Finley Distinguished Professor of electrical engineering with Missouri University of Science and Technology, Rolla. Her research interests include developing computational methods for dynamic security assessment and the application of power electronics in bulk power systems.

Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks.

In this paper, the nearly optimal solution for discrete-time (DT) affine nonlinear control systems in the presence of partially unknown internal syste...
655KB Sizes 0 Downloads 0 Views