Robust adaptive dynamic programming and feedback stabilization of nonlinear systems.

882

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 5, MAY 2014

Robust Adaptive Dynamic Programming and Feedback Stabilization of Nonlinear Systems Yu Jiang, Student Member, IEEE, and Zhong-Ping Jiang, Fellow, IEEE

Abstract— This paper studies the robust optimal control design for a class of uncertain nonlinear systems from a perspective of robust adaptive dynamic programming (RADP). The objective is to fill up a gap in the past literature of adaptive dynamic programming (ADP) where dynamic uncertainties or unmodeled dynamics are not addressed. A key strategy is to integrate tools from modern nonlinear control theory, such as the robust redesign and the backstepping techniques as well as the nonlinear small-gain theorem, with the theory of ADP. The proposed RADP methodology can be viewed as an extension of ADP to uncertain nonlinear systems. Practical learning algorithms are developed in this paper, and have been applied to the controller design problems for a jet engine and a one-machine power system. Index Terms— Adaptive dynamic programming nonlinear uncertain systems, robust optimal control.

(ADP),

I. I NTRODUCTION A. Background

R

EINFORCEMENT learning (RL) [38] is an important branch in machine learning theory. It is concerned with how an agent should modify its actions based on the reward from its reactive unknown environment so as to achieve a longterm goal. Originally observed from biological learning behavior, RL has been brought to the computer science and control science literature as a way to study artificial intelligence in the 1960s [28], [29], [45]. Since then, numerous contributions to RL have been made; see [4], [39], and [47]. Meanwhile, dynamic programming (DP) [6] is widely used for solving optimal control problems. In [9], an iterative technique called policy iteration (PI) was devised by Howard for Markov decision processes. Computing the optimal solution through successive approximations, PI is closely related to learning methods. Werbos [48] pointed out that PI can be employed to perform RL. Starting from then, many real-time RL methods for finding online optimal control policies have emerged and they are broadly called approximate/adaptive DP (ADP) [24], [26], [46], [49]– [52] or neuro-DP [7]. So far, various ADP-based algorithms have been applied for computing optimal control policies for uncertain systems. In [50], the action-dependent heuristic DP (or Q-learning [47])

Manuscript received November 5, 2012; revised August 12, 2013; accepted December 8, 2013. Date of publication December 13, 2013; date of current version April 10, 2014. This work was supported by the National Science Foundation under Grant DMS-0906659, Grant ECCS-1101401, and Grant ECCS-1230040. The authors are with the Department of Electrical and Computer Engineering, Polytechnic School of Engineering of New York University, Brooklyn, NY 11201 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2294968

was developed for discrete-time systems. This methodology has been widely extended and applied to many different areas [2], [25], [42], [54]. In the continuous-time setting, discretization-based methods were proposed in [3] and [8]. Alternative algorithms were developed in [31], where the state derivatives must be used. Exact methods (i.e., without discretization) for continuous-time linear systems can be found in [44], where the state derivatives and the system matrix A are not assumed to be known, but the precise knowledge of the input matrix B is required. This assumption on B has been further relaxed in [11]. In [43], an ADP strategy was studied for nonlinear continuous-time systems with partially unknown dynamics, by employing the Galerkin’s method [5]. In the presence of disturbance input, ADP methods and approaches in game theory were brought together to solve H∞ robust control problems for uncertain systems [1], [2]. In the past literature of ADP, it is commonly assumed that the system order is known and the state variables are either fully available or reconstructible from the output [25]. However, the system order may be unknown due to the presence of dynamic uncertainties (or unmodeled dynamics), which are motivated by engineering applications in situations where the exact mathematical model of a physical system is not easy to be obtained. Dynamic uncertainties also make sense for the mathematical modeling in other branches of science, such as biology and economics. This problem, often formulated in the context of robust control theory, cannot be viewed as a special case of output feedback control. In addition, the ADP methods developed in the past literature may fail to guarantee not only optimality, but also the stability of the closed-loop system when dynamic uncertainty occurs. Werbos [53] also pointed out the related issue that the performance of learning may deteriorate when using incomplete data in ADP. To fill up the above-mentioned gap in the past literature of ADP, we recently developed a new theory of robust ADP (RADP) [12]– [15], which can be viewed as an extension of ADP to linear and partially linear systems [40], [41] with dynamic uncertainties. B. Contributions The primary objective of this paper is to study RADP designs for genuinely nonlinear systems in the presence of dynamic uncertainties. We first decompose the open-loop system into two parts: 1) the system model (ideal environment) with known system order and fully accessible state and 2) the dynamic uncertainty, with unknown system order, dynamics, and unmeasured states, interacting with the ideal environment.

2162-237X © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

JIANG AND JIANG: RADP AND FEEDBACK STABILIZATION

883

To handle the dynamic interaction between two systems, we resort to the gain assignment idea [18], [33]. More specifically, we need to assign a suitable gain for the system model with disturbance in the sense of Sontag’s input-to-state stability (ISS) [36], [37] (see the Appendix). The backstepping, robust redesign, and small-gain techniques in modern nonlinear control theory are incorporated into the RADP theory, such that the system model is made of ISS with an arbitrarily small gain. To perform stability analysis for the interconnected systems, we apply the nonlinear small-gain theorem [18], which has been proved to be an efficient tool for nonlinear system analysis and synthesis. As a consequence, the proposed RADP method can be observed as a nonlinear variant of [11]. Moreover, it solves the semiglobal stabilization problem [40] in the sense that the domain of attraction for the closed-loop system can be made as large as possible. The remainder of the paper is organized as follows. Section II reviews the online PI technique for affine nonlinear systems. Section III studies the methodology of robust optimal design and gives a practical algorithm. Section IV extends the RADP theory to nonlinear systems with unmatched dynamic uncertainties. Two numerical examples, including the controller designs for a jet engine and for a synchronous power generator, are provided in Section V. Finally, concluding remarks are given in Section VI. Throughout this paper, we use R to denote the set of real numbers. Vertical bars | · | represent the Euclidean norm for vectors, or the induced matrix norm for matrices. For any piecewise continuous function u, u denotes sup{|u(t)|, t ≥ 0}. A function γ : R+ → R+ is said to be of class K if it is continuous, strictly increasing with γ (0) = 0. It is of class K∞ if additionally γ (s) → ∞ as s → ∞. A function β : R+ × R+ → R+ is of class KL if β(·, t) is of class K for every fixed t ≥ 0, and β(s, t) → 0 as t → ∞ for each fixed s ≥ 0. A control law is also called a policy, and it is said to be stabilizing if under the policy, the closed-loop system is asymptotically stable at the origin in the sense of Lyapunov [20]. The notation γ1 > γ2 means γ1 (s) > γ2 (s), ∀s > 0, while the notation γ1 ◦ γ2 denotes the composition of two functions, i.e., for all s ≥ 0, γ1 ◦ γ2 (s) = γ1 (γ2 (s)). II. P RELIMINARIES In this section, let us review a PI technique to solve optimal control problems [34]. To begin with, consider the system x˙ = f (x) + g(x)u

(1)

where x ∈ Rn is the system state, u ∈ R is the control input, f, g : Rn → Rn are locally Lipschitz functions. The to-beminimized cost associated with (1) is defined as ∞ (2) Q(x) + r u 2 dt, x(0) = x 0 J (x 0 , u) =

can be solved from the following Hamilton–Jacobi–Bellman equation: 2 1 (3) ∇V (x)T g(x) 4r with the boundary condition V (0) = 0. Indeed, if the solution V ∗ (x) of (3) exists, the optimal control policy is given by 0 = ∇V (x)T f (x) + Q(x) −

1 g(x)T ∇V ∗ (x). (4) 2r In general, the analytical solution of (3) is difficult to be obtained. However, if V ∗ (x) exists, it can be approximated using the PI technique [34]. 1) Find an admissible control policy u 0 (x). 2) For any integer i ≥ 0, solve for Vi (x), with Vi (0) = 0, from u ∗ (x) = −

0 = ∇Vi (x)T [ f (x)+g(x)u i (x)]+ Q(x)+r u i (x)2 . (5) 3) Update the control policy by 1 g(x)T ∇Vi (x). (6) 2r Convergence of the PI (5) and (6) is concluded in the following theorem, the proof of which follows the same lines of reasoning as in the proof of [34, Th. 4]. Theorem 2.1: Consider Vi (x) and u i+1 (x) defined in (5) and (6). Then, for all i = 0, 1, . . . u i+1 (x) = −

0 ≤ Vi+1 (x) ≤ Vi (x) ∀x ∈ R n

(7)

and u i+1 (x) is admissible. In addition, if the solution V ∗ (x) ∞ of (3) exists, then for each fixed x, {Vi (x)}∞ i=0 and {u i (x)}i=0 ∗ ∗ converge pointwise to V (x) and u (x), respectively. III. O NLINE L EARNING VIA RADP In this section, we develop the RADP methodology for nonlinear systems as follows: w˙ = w (w, x)

(8)

x˙ = f (x) + g(x) [u + (w, x)]

(9)

where x ∈ Rn is the measured component of the state available for feedback control, w ∈ R p is the unmeasurable part of the state with unknown order p, u ∈ R is the control input, w : R p × Rn → R p and : R p × Rn → R are unknown locally Lipschitz functions, and f and g are defined the same as in (1) but are assumed to be unknown. Our design objective is to find online the control policy that stabilizes the system at the origin. In addition, in the absence of the dynamic uncertainty (i.e., = 0 and the w-subsystem is absent), the control policy becomes the optimal control policy that minimizes (2).

0

where Q(·) is a positive definite function and r > 0 is a constant. In addition, assume there exists an admissible control policy u = u 0 (x) in the sense that, under this policy, the system (1) is globally asymptotically stable and the cost (2) is finite. By [23], the control policy that minimizes the cost (2)

A. Online PI The iterative technique introduced in Section II relies on the knowledge of both f (x) and g(x). To remove this requirement, we develop a novel online PI technique, which can be viewed as the nonlinear extension of [11].

884


To begin with, notice that (9) can be rewritten as x˙ = f (x) + g(x)u i (x) + g(x)v i

where (10)

where v i = u + − u i . For each i ≥ 0, the time derivative of Vi (x) along the solutions of (10) satisfies V˙i = ∇Vi (x)T [ f (x) + g(x)u i (x) + g(x)v i ] = =

−Q(x) − r u 2i (x) + ∇Vi (x)T g(x)v i −Q(x) − r u 2i (x) − 2r u i+1 (x)v i .

T θi,k

(11)

Integrating both sides of (11) on any time interval [t, t +T ], it follows that:

⎡

φ1 (x(tk+1 )) − φ1 (x(tk )) φ2 (x(tk+1 )) − φ2 (x(tk )) .. .

⎤

⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ φ N1 (x(tk+1 )) − φ N1 (x(tk )) ⎥ ⎥ ⎢

t =⎢ ⎥ ∈ R N1 +N2 . 2r tkk+1 φ1 (x)vî dt ⎥ ⎢

⎥ ⎢ t 2r tkk+1 φ2 (x)vî dt ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ ⎦ ⎣ .

tk+1 2r tk φ N2 (x)vî dt

Vi (x(t + T )) − Vi (x(t)) t +T − Q(x) − r u 2i (x) − 2r u i+1 (x)v i dτ. (12) =

Assumption 3.2: The closed-loop system composed of (8), (9), and

Notice that, if an admissible control policy u i (x) is given, the unknown functions Vi (x) and u i+1 (x) can be approximated using (12). To be more specific, for any given compact set ⊂ Rn containing the origin as an interior point, let {φ j (x)}∞ j =1 be an infinite sequence of linearly independent smooth basis functions on , where φ j (0) = 0 for all j = 1, 2, . . .. Then, by approximation theory [32], for each i = 0, 1, . . ., the cost function and the control policy can be approximated by

is ISS when e, the exploration noise, is considered as the input. Remark 3.1: The reason for imposing Assumption 3.2 is twofold. First, like in many other policy-iteration-based ADP algorithms, an initial admissible control policy is desired. Inspired by [55], we further assume that the initial control policy is stabilizing in the presence of dynamic uncertainties. Such an assumption is feasible and realistic by means of the designs in [17] and [33]. Second, by adding the exploration noise, we are able to satisfy Assumption 3.1, and simultaneously keep the system solutions bounded. Under Assumption 3.2, we can find a compact set 0 that is an invariant set of the closed-loop system composed of (8), (9), and u = u 0 (x). In addition, we can also let 0 contain i ∗ as its subset. Then, the compact set for approximation can be selected as = {x : ∃w, such that (w, x) ∈ 0 }. Theorem 3.1: Under Assumptions 3.1 and 3.2, for each i ≥ 0 and given > 0, there exist N1∗ > and N2∗ > 0, such that N1
0 and N2 > 0 are two sufficiently large integers, and cî, j and wˆ i, j are constant weights to be determined. Replacing Vi (x), u i (x), and u i+1 (x) in (12) with their approximations, we obtain N1

j =1

cî, j φ j (x(tk+1 )) − φ j (x(tk ))

j =1

=−

tk+1

2r tk

−

N2

N2 wˆ i, j φ j (x) − u i+1 (x) <

tk

wˆ i, j φ j (x)vî dt

Q(x) + r uˆ 2i (x) dt + ei,k

(15)

where uˆ 0 = u 0 , vî = u + − uˆ i , and {tk }lk=0 is a strictly increasing sequence with l > 0 a sufficiently large integer. Then, the weights cî, j and wˆ i, j can be solved in the sense of 2 ). least squares (i.e., by minimizing lk=0 ei,k Now, starting from u 0 (x), two sequences {Vî (x)}∞ i=0 and {uˆ i+1 (x)}∞ can be generated via the online PI technique (15). i=0 Next, we show the convergence of the sequences to Vi (x) and u i+1 (x), respectively. Assumption 3.1: There exist l0 > 0 and δ > 0, such that for all l ≥ l0 , we have 1 T θi,k θi,k ≥ δ I N1 +N2 l l

k=0

(19)

j =1

j =1

tk+1

(17)

(16)

for all x ∈ , if N1 > N1∗ and N2 > N2∗ . Proof: See the Appendix. Corollary 3.1: Assume V ∗ (x) and u ∗ (x) exist. Then, under Assumptions 3.1 and 3.2, for any arbitrary > 0, there exist integers i ∗ > 0, N1∗∗ > 0 and N2∗∗ > 0, such that N1 ∗ cî ∗ , j φ j (x) − V (x) ≤

(20) j =1

N2 ∗ wˆ i ∗ , j φ j (x) − u (x) ≤

(21)

j =1

for all x ∈ , if N1 > N1∗∗ , and N2 > N2∗∗ . Proof: By Theorem 2.1, there exists i ∗ > 0, such that

(22) |Vi ∗ (x) − V ∗ (x)| ≤ 2

|u i ∗ +1 (x) − u ∗ (x)| ≤ ∀x ∈ . (23) 2


885

By Theorem 3.1, there exist N1∗∗ > 0 and N2∗∗ > 0, such that for all N1 > N1∗∗ and N2 > N2∗∗ N1 ≤

∗ ∗ c ˆ φ (x) − V (x) (24) i ,j j i 2 j =1

N2 ≤

∗ ∗ w ˆ φ (x) − u (x) i ,j j i +1 2

∀x ∈ .

(25)

j =1

The corollary is thus proved using the triangle inequality.

B. Robust Redesign In the presence of the dynamic uncertainty, we redesign the approximated optimal control policy so as to achieve robust stability. The proposed method is an integration of optimal control theory [23] with the gain assignment technique [18], [33]. To begin with, let us make the following assumptions. Assumption 3.3: There exists a function α of class K∞ , such that for i = 0, 1, . . . α(|x|) ≤ Vi (x) ∀x ∈ Rn .

(26)

In addition, assume there exists a constant > 0 such that Q(x) − 2 |x|2 is a positive definite function. Notice that we can also find a class K∞ function α, ¯ such that for i = 0, 1, . . . ¯ ∀x ∈ Rn . Vi (x) ≤ α(|x|)

(27)

Assumption 3.4: Consider (8). There exist functions λ, λ¯ ∈ K∞ , κ1 , κ2 , κ3 ∈ K, and positive definite functions W and κ4 , such that for all w ∈ R p and x ∈ Rn , we have ¯ λ(|w|) ≤ W (w) ≤ λ(|w|) |(w, x)| ≤ max{κ1 (|w|), κ2 (|x|)}

(28) (29)

together with the following implication: W (w) ≥ κ3 (|x|) ⇒ ∇W (w)Tw (w, x) ≤ −κ4 (w). (30) Assumption 3.4 implies that the w-system (8) is ISS [36], [37] when x is considered as the input. See the Appendix for a detailed review of the ISS property. Now, consider the following type of control policy r (31) u ro (x) = 1 + ρ 2 (|x|2 ) uˆ i ∗ +1 (x) 2 where i ∗ > 0 is a sufficiently large integer as defined in Corollary 3.1 and ρ is a smooth, nondecreasing function, with ρ(s) > 0 for all s ≥ 0. Notice that u ro can be viewed as a robust redesign of the approximated optimal control policy uˆ i ∗ +1 . As in [17], let us define a class K∞ function γ by 1 γ (s) = ρ(s 2 )s ∀s ≥ 0. 2 In addition, define r ero (x) = ρ 2 (|x|2) uˆ i ∗ +1 (x) − u i ∗ +1 (x) 2 +uˆ i ∗ +1 (x) − u i ∗ (x).

(32)

(33)

Fig. 1. Illustration of Algorithm 1. (a) Initial stabilizing control policy is employed, and online information of state variables of the x-subsystem, the input signal u, and the output of the dynamic uncertainty is used to approximate the optimal cost and the optimal control policy. (b) Exploration noise is terminated after the convergence to the optimal control policy is attained. (c) Robust optimal control policy is applied as soon as the system trajectory enters the invariant set i ∗ .

Theorem 3.2: Under Assumptions 3.3 and 3.4, suppose (34) γ > max κ2 , κ1 ◦ λ−1 ◦ κ3 ◦ α −1 ◦ α¯ and the following implication holds for some constant d > 0 : 0 < Vi ∗ (x) ≤ d ⇒ |ero (x)| < γ (|x|).

(35)

Then, the closed-loop system composed of (8), (9), and (31) is asymptotically stable at the origin. In addition, there exists σ ∈ K∞ , such that i ∗ = {(w, x) : max [σ (Vi ∗ (x)), W (w)] ≤ σ (d)} is an estimate of the region of attraction of the closed-loop system. Proof: See the Appendix. Remark 3.2: In the absence of the dynamic uncertainty (i.e., = 0 and the w-system is absent), the control policy (31) can be replaced by uˆ i ∗ +1 (x), which is an approximation of the optimal control policy u ∗ (x) that minimizes the following cost: ∞ Q(x) + r u 2 dt, x(0) = x 0 . (36) J (x 0 , u) = 0

Remark 3.3: It is of interest to note that the constant d in (35) can be chosen arbitrarily large. Therefore, the proposed control scheme solves the semiglobal stabilization problem [40]. C. RADP Algorithm The RADP algorithm is given in Algorithm 1, and a graphical illustration is shown in Fig. 1. IV. RADP W ITH U NMATCHED DYNAMIC U NCERTAINTY In this section, we extend the RADP methodology to nonlinear systems with unmatched dynamic uncertainties. To begin with, consider the system w˙ = w (w, x)

(38)

x˙ = f (x) + g(x) [z + (w, x)] z˙ = f1 (x, z) + u + 1 (w, x, z)

(39) (40)

886


Under Assumption 4.2, we can find an invariant set 1,0 for the closed-loop system composed of (38)–(40) and (45), and approximate the unknown functions on the set

Algorithm 1 RADP Algorithm

1 = {(x, z) : ∃w, s.t.(w, x, z) ∈ 1,0 }.

(46)

Then, we give the following two-phase learning scheme. 1) Phase-One Learning: Similarly as in (15), to approximate the virtual control input ξ for the x-subsystem, we solve the weights cî, j and wˆ i, j using least-squares method from N1

cî, j φ j (x(tk+1 )) − φ j (x(tk ))

j =1

=−

tk

where [x T, z]T ∈ Rn × R is the measured component of the state available for feedback control; w, u, w , f , g, and are defined in the same way as in (8) and (9); f 1 : Rn × R → R and 1 : R p × Rn × R → R are locally Lipschitz functions and are assumed to be unknown. Assumption 4.1: There exist class K functions κ5 , κ6 , and κ7 , such that the following inequality holds: |1 (w, x, z)| ≤ max{κ5 (|w|) , κ6 (|x|) , κ7 (|z|)}.

(41)

A. Online Learning Let us define a virtual control policy ξ = u ro , where u ro is the same as in (31). Then, a state transformation can be performed as ζ = z − ξ . Along the trajectories of (39)–(40), it follows that ζ˙ = f¯1 (x, z) + u + 1 − g¯ 1(x)

N3

wˆ f, j ψ j (x, z)

N 4 −1

tk

N2

wˆ i, j φ j (x)v˜i dt

j =1

Q(x) + r uˆ 2i (x) dt + e˜i,k

tk

+

j =1 tk+1

tk

j =0

(u + 1 )ζ dt + e¯k

1 T θ¯k θ¯k ≥ δ1 I N3 +N4 l l

k=0

wˆ g, j φ j (x)

(44)

{ψ j (x, z)}∞ j =1

where is a sequence of linearly independent basis functions on some compact set 1 ∈ Rn+1 containing the origin as its interior, φ0 (x) ≡ 1, and wˆ f, j and wˆ g, j are constant weights to be trained. As in the matched case, a similar assumption on the initial control policy is given as follows. Assumption 4.2: The closed-loop system composed of (38)–(40), and u = u¯ 0 (x, z) + e

(45)

is ISS when e, the exploration noise, is considered to be the input.

(48)

where e¯k denotes the approximation error. Similarly, as in the previous section, let us introduce the following assumption. Assumption 4.3: There exist l1 > 0 and δ1 > 0, such that for all l ≥ l1 , we have

(43)

j =0

(47)

1 2 1 ζ (tk+1 ) − ζ 2 (tk ) 2 2 ⎤ ⎡ t N3 N 4 −1 k+1 ⎣ = wˆ f, j ψ j (x, z) − wˆ g, j φ j (x)⎦ ζ dt

j =1

gˆ 1(x) =

tk+1

2r

where v˜i = z + − uˆ i , and {tk }lk=0 is a strictly increasing sequence with l > l0 a sufficiently large integer, and e˜i,k is the approximation error. 2) Phase-Two Learning: To approximate the unknown functions f¯1 and g¯ 1 , the constant weights can be solved, in the sense of least squares, from

(42)

where f¯1 (x, z) = f 1 (x, z)−(∂ξ/∂ x) f (x)−(∂ξ/∂ x)g(x)z and g¯ 1 (x) = (∂ξ /∂ x)g(x). By approximation theory [32], f¯1 (x, z) and g¯ 1 (x) can be approximated by fˆ1 (x, z) =

−

tk+1

where

⎡ ⎤ tk+1 ψ (x, z)ζ dt 1 ⎢ tk ⎥ ⎢ tk+1 ⎥ ⎢ ⎥ ⎢ tk ψ2 (x, z)ζ dt ⎥ ⎢ ⎥ .. ⎢ ⎥ ⎢ ⎥ . ⎢ t ⎥ ⎢ k+1 ⎥ ψ (x, z)ζ dt ⎢ ⎥ N3 tk T ⎢ ⎥ ∈ R N3 +N4 . ¯ θk = ⎢ t ⎥ ⎢ t k+1 φ0 (x)ζ dt ⎥ ⎢ ⎥ ⎢ tkk+1 ⎥ ⎢ ⎥ ⎢ tk φ1 (x)ζ dt ⎥ ⎢ ⎥ .. ⎢ ⎥ ⎢ ⎥ . ⎣ t ⎦ k+1 φ (x)ζ dt N −1 4 t k

(49)


887

Theorem 4.1: Consider (x(0), z(0)) ∈ 1 . Then, under Assumption 4.3, we have lim

N3 ,N4 →∞

lim

fˆ(x, z) = f¯1 (x, z)

N3 ,N4 →∞

Algorithm 2 RADP Algorithm

(50)

g(x) ˆ = g¯ 1 (x) ∀(x, z) ∈ 1 .

(51)

B. Robust Redesign Next, we study the robust stabilization of the system (38)–(40). To this end, let κ8 be a function of K such that κ8 (|x|) ≥ |ξ(x)|

∀x ∈ Rn .

(52)

Then, Assumption 4.1 implies |1 | ≤ max{κ5 (|w|) , κ6 (|x|) , κ7 (|z|)} ≤ max{κ5 (|w|) , κ6 (|x|) , κ7 (|ξ | + κ8 (|x|))} ≤ max{κ5 (|w|) , κ9 (|X 1 |)} where κ9 (s) = max{κ6 , κ7 ◦ κ8 ◦ (2s), κ7 ◦ (2s)}, ∀s ≥ 0. In addition, we denote κ˜ 1 = max{κ1 , κ5 }, κ˜ 2 = max{κ2 , κ9 }, γ1 (s) = (1/2) ρ((1/2)s 2 )s, and 1 (53) Ui ∗ (X 1 ) = Vi ∗ (x) + ζ 2 . 2 Notice that, under Assumptions 3.3 and 3.4, there exist α¯ 1 , α 1 ∈ K∞ , such that α 1 (|X 1 |) ≤ Ui ∗ (X 1 ) ≤ α¯ 1 (|X 1 |). The control policy can be approximated by u ro1 = − fˆ1 (x, z) + 2r uˆ i ∗ +1 (x) −

gˆ 2 (x)ρ12 (|X 1 |2 )ζ 4

ρ 2 (|X 1 |2 )ζ

2 ρ 2 (ζ 2 )ζ − 2 − ζ − 1 4 2ρ (|x|2 ), 2

(54)

where X 1 = [x T , ζ ]T , and ρ1 (s) = 2ρ( 12 s). Next, define the approximation error as ero1 (X 1 ) = − f¯1 (x, z) + fˆ1 (x, z) + 2r u i ∗ +1 (x) − uˆ i ∗ +1 (x) 2 g¯ (x) − gˆ 12 (x) ρ12 (|X 1 |2 )ζ . (55) − 1 4 Then, the conditions for asymptotic stability are summarized in the following theorem. Theorem 4.2: Under Assumptions 3.3, 3.4, and 4.1, if γ1 > max{κ˜ 2 , κ˜ 1 ◦ λ−1 ◦ κ3 ◦ α −1 ¯ 1} 1 ◦α

functions ρ and ρ1 can all be replaced by zero, and the system becomes X˙ 1 = F1 (X 1 ) + G 1 u o1 where F1 (X 1 ) =

(57)

f (x) + g(x)ζ + g(x)ξ 0 = , G 1 1 −∇Vi ∗ (x)T g(x)

and u o1 = − 2 ζ 2 . As a result, it can be concluded that the control policy u = u o1 is an approximate optimal control policy with respect to the cost functional ∞ 1 J1 (X 1 (0), u) = Q 1 (x, ζ ) + 2 u 2 dt (58) 2

0 with X 1 (0) = [x 0T , z 0 − u i ∗ (x 0 )]T and Q 1 (x, ζ ) = Q (x) + 2 (1/4r) ∇Vi ∗ (x)T g(x) + ( 2 /2)ζ 2 . Remark 4.2: Like in the matched case, by selecting large enough d1 in Theorem 4.2, semiglobal stabilization is achieved. C. RADP Algorithm With Unmatched Dynamic Uncertainty The RADP algorithm with unmatched dynamic uncertainty is given in Algorithm 2.

(56)

and if the following implication holds for some constant d1 > 0 : 0 < Ui ∗ (X 1 ) ≤ d1 ⇒ max{|ero1(X 1 )|, |ero (x)|} < γ1 (|X 1 |) then the closed-loop system comprised of (38)–(40), and (54) is asymptotically stable at the origin. In addition, there exists σ1 ∈ K∞ , such that 1,i ∗ = {(w, X 1 ) : max [σ1 (Ui ∗ (X 1 )), W (w)] ≤ σ1 (d1 )} is an estimate of the region of attraction. Proof: See the Appendix. Remark 4.1: In the absence of the dynamic uncertainty (i.e., = 0, 1 = 0, and the w-system is absent), the smooth

V. A PPLICATIONS In this section, we apply the proposed online RADP schemes to the design of robust optimal control policies for a jet engine and a one-machine power system. A. Jet Engine Consider the following dynamic model for jet engine surge and stall [30], [21] described by ˙ = − + C () − 3R 1 ˙ = ( − T ()) β2 R˙ = σ R 1 − 2 − R

(60) (61) (62)

888


Fig. 2.

Approximated cost function.

Fig. 4.

Trajectory of the mass flow.

Fig. 3.

Trajectory of the normalized rotating stall amplitude.

Fig. 5.

Trajectory of the plenum pressure rise.

where is the scaled annulus-averaged flow, is the plenum pressure rise, and R > 0 is the normalized rotating stall amplitude. Functions C () and T () are the compressor and throttle characteristics, respectively. According to [30], T is assumed to take the form of √ (63) T () = γ − 1 and C () is assumed to be satisfying C (0) = 1 + C0 with C0 a constant describing the shutoff pressure rise. The equilibrium of the system is e + 2. R e = 0, e = 1, e = C (e ) = C0

Performing the following state and input transformations [21] φ = − e ψ = − e 1 u = 2 φ − β 2 γ ψ + C0 + 2 + 2 β system (60)–(62) can be converted to R˙ = −σ R 2 − σ R 2φ + φ 2 , R(0) ≥ 0 φ˙ = C (φ + 1) − 2 − C0 − (ψ + 3Rφ + 3R) ψ˙ = u.

(64) (65) (66)

(67) (68) (69)

Notice that this system is in the form of (38)–(40), if we choose w = R, x = φ, and z = ψ. The function C and the constant σ are assumed to be uncertain, but satisfy 1 1 (70) − φ 3 − 2φ 2 ≤ C (s + 1) − 2 − C0 ≤ − φ 3 2 2 and 0.1 < σ < 1. The initial stabilizing policy and the initial virtual control policies are selected to be u = 6φ − 2ψ and

Fig. 6. One-machine infinite-bus synchronous generator with speed governor.

ψ = 3φ, with V (R, φ, ψ) = R + (1/2)φ 2 + (1/2)(ψ − 3φ)2 a Lyapunov function of the closed-loop system. For simulation purpose, we set σ = 0.3, β = 0.702, C0 = 1.7, and C (φ + 1) = C0 + 2 − (3/2)φ 2 − (1/2)φ 3 [30]. We set Q(φ) = 4(φ 2 + φ 3 + φ 4 ) and r = 1. For robust redesign, we set ρ(s) = 0.01s. The basis functions are selected to be polynomials of φ and ψ with order less or equal to four. The exploration noise is set to be e = 10 cos(0.1t). The RADP learning started from the beginning of the simulation and finished at t = 10s, when the control policy is updated after six iterations and the convergence criterion (59) in Algorithm 1 with 1 = 10−6 is satisfied. The approximated cost functions before and after phase-one learning are shown in Fig. 2. The plots of state trajectories of the closed-loop system are shown in Figs. 3–5. B. One-Machine Infinite-Bus Power System The power system considered in this paper is a synchronous generator connected to an infinite bus, as shown in Fig. 6. A model for the generator with both excitation and power control loops can be written as follows [22]: δ˙ = ω

ω0 D ω+ ω˙ = − (Pm − Pe ) 2H 2H

(71) (72)


Fig. 7.

Trajectory of the dynamic uncertainty.

Fig. 8.

Trajectory of the deviation of the rotor angle.

889

Fig. 9.

Fig. 10.

Trajectory of the relative frequency.

Trajectory of the deviation of the mechanical power.

1 1 x d − x d 1 E˙q = − E q + Vs cos δ + Ef Td Td0 x d Td0 1 P˙m = − Pm − K G ω + u TG

(73) (74)

where δ, ω, Pe , Pm , E q , and u are the incremental changes of the rotor angle, the relative rotor speed, the active power delivered to the infinite bus, the mechanical input power, the EMF in the quadrature axis, and the control input to the system, respectively; x d , x T , and x L are the reactance of the direct axis, the transformer, and the transmission line, respectively. x d is the direct axis transient reactance, K G is is the the regulation constant, H is the inertia constant, Td0 direct axis transient short-circuit time constant, and Vs is the voltage on the infinite-bus. Define the following transformations: Vs (E − E q0 ) x d q x 1 = δ − δ0 x2 = ω w=

z = Pm − P0 ,

(75) (76) (77) (78)

where constants δ0 , P0 , and E q0 denote the steady-state values of the rotor angle, the mechanical power input, and the EMF, respectively. This system can be converted to x x1 1 + a3 sin (79) w˙ = −a1 w + a2 sin 2 2 x˙1 = x 2 (80)

x

x1 + a3 sin 2 2 +b3 [z − w sin(x 1 + a3 )]

x˙2 = −b1 x 2 − b2 cos

z˙ = −c1 z − c2 x 2 + u

1

(81) (82)

x )), where a1 = (1/Td ), a2 = ((x d − x d )/(Td0 ))((Vs2 )/(x d d a3 = δ0 , b1 = (D)/(2H ), b2 = (ω0 /H )((Vs )/(x d ))E q0 , b3 = (ω0 )/(2H ), c1 = (1/TG ), and c2 = K G . For simulation purpose, the parameters are specified as follows: D = 5, H = 4, ω0 = 314.159 rad/s, x T = 0.127, x L = 0.4853, x d = 1.863, x d = 0.257, Td0 = 0.5s, δ0 = 1.2566 rad, Vs = 1p.u., TT = 2s, K G = 1, K T = 1, and TG = 0.2s. In addition, notice that x ds = x T + x L + x d and = x + x + x . We set Q(x , x ) = 10x 2 + x 2 , r = 1, x ds T L 1 2 1 2 d and we pick ρ(s) = 1 for robust redesign. The basis functions are selected to be polynomials of x 1 , x 2 , and z with order less or equal to four. Suppose the bounds of the parameters are given as 0.5 < a1 ≤ 1, 0 < a2 , 1.5, 0 < a3 ≤ 1, 0.5 < b1 ≤ 1, 0 < b2 ≤ 150, 0 < b3 ≤ 50, 0 < c1 ≤ 1, and 0 < c2 ≤ 0.1. Then, we select the initial control policy to be u = −x 1 , with V (w, x 1 , x 2 , z) = w2 + x 12 + x 22 + z 2 the Lyapunov function of the closed-loop system. The exploration noise is set to be e = 0.001 sin(t). The initial virtual control policy is z = −x 1 . The algorithm stopped after nine iterations, when the stopping criterion in (59) with 1 = 0.01 is satisfied. The RADP learning is finished within 2 s. The initial cost function and the cost function we obtained from phase-one learning are shown in Fig. 11. It is worth pointing out that attenuating the oscillation in the power frequency is an important issue in power system control. From the simulation results shown in Figs. 7–10,

890


Next, consider an interconnected system described by x˙1 = f 1 (x 1 , x 2 , v) x˙ 2 = f 2 (x 1 , x 2 , v)

Fig. 11.

where, for i = 1, 2, x i ∈ Rni , v ∈ Rnv , f i : Rn1 ×Rn2 ×Rnv → Rni is locally Lipschitz. Assumption 1.1: For each i = 1, 2, there exists an ISS Lyapunov function Vi for the x i subsystem such that the following hold: 1) there exist functions α i , α¯ i ∈ K∞ , such that

Approximated cost function.

α i (|x i |) ≤ Vi (x i ) ≤ α¯ i (|x i |) ∀x i ∈ Rni ;

we see that the postlearning performance of the system is remarkably improved and the oscillation is attenuated.

∇V1 (x 1 )T f 1 (x 1 , x 2 , v) ≤ −α1 (V1 (x 1 ))

In this paper, robust and adaptive optimal control design has been studied for nonlinear systems with dynamic uncertainties. Both the matched and the unmatched cases are studied. We have presented for the first time a recursive, online, adaptive optimal controller design when dynamic uncertainties, characterized by ISS systems with unknown order and states/dynamics, are considered. We have achieved this goal by the integration of ADP theory and tools recently developed within the nonlinear control community. Systematic RADP-based online learning algorithms have been developed to obtain semiglobally stabilizing controllers with optimality properties. The effectiveness of the proposed methodology has been validated by its application to the robust optimal control policy designs for a jet engine and a one-machine power system. A PPENDIX R EVIEW OF ISS AND S MALL -G AIN T HEOREM In the Appendix, some important tools from modern nonlinear control are reviewed; see [10], [16], [18], [20], [27], [35], [36], and references therein for the details. See [19] for more recent developments in nonlinear systems and control. Consider the system (83)

where x ∈ Rn is the state, u ∈ Rm is the input, and f : Rn × Rm → Rn is locally Lipschitz. Definition 1.1 ( [35], [36]): The system (83) is said to be ISS with gain γ if, for any measurable essentially bounded input u and any initial condition x(0), the solution x(t) exists for every t ≥ 0 and satisfies |x(t)| ≤ β(|x(0)|, t) + γ (u)

(84)

where β is of class KL and γ is of class K. Definition 1.2 ( [37]): A continuously differentiable function V is said to be an ISS Lyapunov function for the system (83) if V is positive definite and proper, and satisfies the following implication: |x| ≥ χ(|u|) ⇒ ∇V (x)T f (x, u) ≤ −κ(|x|) where κ is positive definite and χ is of class K.

(85)

(88)

2) there exist class K functions χi and γi and a class K∞ function αi , such that

VI. C ONCLUSION

x˙ = f (x, u)

(86) (87)

(89)

if V1 (x 1 ) ≥ max{χ1 (V2 (x 2 )), γ1 (|v|)}, and ∇V2 (x 2 )T f2 (x 1 , x 2 , v) ≤ −α2 (V2 (x 2 ))

(90)

if V2 (x 2 ) ≥ max{χ2 (V1 (x 1 )), γ2 (|v|)}. With the ISS Lyapunov functions, the following theorem gives the small-gain condition, under which the ISS of the interconnected system can be achieved. Theorem 1.1 ( [16]): Under Assumption 1.1, if the following small-gain condition holds: χ1 ◦ χ2 (s) < s ∀s > 0

(91)

then the interconnected system (86) and (87) is ISS with respect to v as the input. Under Assumption 1.1 and the small-gain condition (91), let χˆ 1 be a function of class K∞ such that: 1) χˆ 1 (s) ≤ χ1−1 (s), ∀s ∈ [0, lim χ1 (s)); s→∞ 2) χ2 (s) ≤ χˆ 1 (s), ∀s ≥ 0. Then, as shown in [16], there exists a class K∞ function σ (s), which is continuously differentiable over (0, ∞) and satisfies (dσ /ds)(s) > 0 and χ2 (s) < σ (s) < χˆ1 (s), ∀s > 0. In [16], it is also shown that the function V12 (x 1 , x 2 ) = max{σ (V1 (x 1 )), V2 (x 2 )}

(92)

is positive definite and proper. In addition, we have V˙12 (x 1 , x 2 ) < 0

(93)

holds almost everywhere in the state space, whenever V12 (x 1 , x 2 ) ≥ η(|v|) > 0

(94)

for some class K∞ function η. P ROOF OF T HEOREM 3.1 To begin with, given uˆ i , let V˜i (x) be the solution of the following equation with V˜i (0) = 0 : ∇ V˜i (x)T f (x) + g(x)uˆ i (x) + Q(x) + r uˆ 2i (x) = 0 (95) and denote u˜ i+1 (x) = −

1 g(x)T ∇ V˜i (x)T . 2r


Lemma 1.1: For each i ≥ 0, we have V˜i (x),

lim

N1 ,N2 →∞

lim

N1 ,N2 →∞

891

Vî (x) =

uˆ i+1 (x) = u˜ i+1 (x), ∀x ∈ .

Proof: By definition V˜i (x(tk+1 )) − V˜i (x(tk )) tk+1 =− [Q(x) + r uˆ 2i (x) + 2r u˜ i+1 (x)vî ]dt. (96)

By the induction assumptions, we know ∞ uˆ i (x)2 − u i (x)2 dt = 0 lim N1 ,N2 →∞ t ∞ u i+1 (x) uˆ i (x) − u i (x) dt = 0. lim N1 ,N2 →∞ t

j =N1 +1 ∞

+

w˜ i, j

j =N2 +1

tk+1 tk

Since the weights are found using the least-squares method, we have l l 2 2 ei,k ≤ ξi,k . k=1

k=1

(101)

Finally, since |Vî (x) − Vi (x)| ≤ |Vi (x) − V˜i (x)| + |V˜i (x) − Vî (x)| and by the induction assumption, we have |Vi (x) − Vî (x)| = 0.

(102)

|u i+1 (x) − uˆ i (x)| = 0.

(103)

N1 ,N2 →∞

Similarly, we can show lim

N1 ,N2 →∞

The proof is thus complete. P ROOF OF T HEOREM 3.2 e¯ro (x) =

ero (x), 0,

Vi ∗ (x) ≤ d Vi ∗ (x) > d

(104)

and

4 4|i,l |2 2 2 = max ξi,k . |W¯ i | ≤ lδ δ 1≤k≤l Therefore, given any arbitrary > 0, we can find N10 > 0 and N20 > 0, such that when N1 > N10 and N2 > N20 , we have N1 ∞ |Vî (x)− V˜i (x)| ≤ |ci, j − cî, j ||φ j (x)|+ |ci, j φ j (x)| j =N1+1

(97) ≤ + = ∀x ∈ . 2 2 Similarly, |uˆ i+1 (x) − u˜ i+1 (x)| ≤ . The proof is complete. We now prove Theorem 3.1 by induction. 1) If i = 0, we have V˜0 (x) = V0 (x) and u˜ 1 (x) = u 1 (x). Hence, the convergence can directly be proved by Lemma 1.1. 2) Suppose for some i > 0, we have lim N1 ,N2 →∞ Vî−1 (x) = Vi−1 (x) and lim N1 ,N2 →∞ uˆ i (x) = u i (x), ∀x ∈ . By definition, we have |Vi (x(t)) − V˜i (x(t))| ∞ = r| uˆ i (x)2 − u i (x)2 dt| t ∞ u i+1 (x) uˆ i (x) − u i (x) dt| +2r | t ∞ u˜ i+1 (x) − u i+1 (x) vî dt| ∀x ∈ . +2r | t

|Vi (x) − V˜i (x)| = 0.

lim

Define

Then, under Assumption 3.1, it follows that

j =1

(100)

N1 ,N2 →∞

l T θi,k W¯ i = (ei,k − ξi,k )2 . W¯ iT θi,k

k=1

|u i+1 (x) − u˜ i+1 (x)| = 0

and

k=1

In addition, notice that l

lim

N1 ,N2 →∞

lim

2r φ j (x)vî dt.

(99)

In addition, by Assumption 3.1, we conclude

tk

c˜i, j and w˜ i, j be the constantweights such that V˜i (x) = Let ∞ ∞ ˜ i, j φ j (x). Then, by j =1 c˜i, j φ j (x) and u˜ i+1 (x) = j =1 w T ¯ (15) and (96), we have ei,k = θi,k Wi + ξi,k , where T W¯ i = c˜i,1 c˜i,2 · · · c˜i,N1 w˜ i,1 w˜ i,2 · · · w˜ i,N2 T − cî,1 cî,2 · · · cî,N1 wˆ i,1 wˆ i,2 · · · wˆ i,N2 ∞ ξi,k = c˜i, j φ j (x(tk+1 )) − φ j (x(tk ))

(98)

r u(x) = u i ∗ (x) + ρ 2 (|x|2 )u i ∗ +1 (x) + e¯ro (x). (105) 2 Then, along the solutions of (9), by completing the squares, we have 1 ( + e¯ro (x))2 V˙i ∗ ≤ −Q(x) + 2 ρ (|x|2 ) 4γ 2 − ( + e¯ro (x))2 = −(Q(x) − 2 |x|2 ) − ρ 2 (|x|2 ) 2 (|x|)} γ 2 − max{κ12 (|w|), κ22 (|x|), e¯ro ≤ −Q 0 (x) − 4 2 2 ρ (|x| ) where Q 0 (x) = Q(x) − 2 |x|2 is a positive definite function of x. Therefore, under Assumptions 3.3 and 3.4 and the gain condition (34), we have the following implication: ¯ −1 ◦κ1 ◦λ−1 (W (w)) Vi ∗ (x) ≥ α◦γ ⇒ |x| ≥ γ −1 ◦ κ1 ◦ λ−1 (W (w)) ⇒ γ (|x|) ≥ κ1 (|w|) ⇒ γ (|x|) ≥ max{κ1 (|w|) , κ2 (|x|) , e¯ro (|x|)} (106) ⇒ V˙i ∗ (x) ≤ −Q 0 (x). In addition, under Assumption 3.4, we have W (w) ≥ κ3 ◦ α −1 (Vi ∗ (x)) ⇒ W (w) ≥ κ3 (|x|) ⇒ ∇W (w)w (w, x) ≤ −κ4 (|w|). (107)

892


Finally, under the gain condition (34), it follows that

R EFERENCES

γ (s) > κ1 ◦ λ−1 ◦ κ3 ◦ α −1 ◦ α(s) ¯ ⇒ γ ◦ α¯ −1 (s ) > κ1 ◦ λ−1 ◦ κ3 ◦ α −1 (s ) ⇒ s > α¯ ◦ γ −1 ◦ κ1 ◦ λ−1 ◦ κ3 ◦ α −1 (s )

(108)

¯ Hence, the following small-gain condition where s = α(s). holds: ∀s > 0. (109) α¯ ◦ γ −1 ◦ κ1 ◦ λ−1 ◦ κ3 ◦ α −1 (s) < s Denoting χ1 = α¯ ◦ γ −1 ◦ κ1 ◦ λ−1 , and χ2 = κ3 ◦ α −1 , by Theorem 1.1, the system composed of (8), (9), and (105) is globally asymptotically stable at the origin. In addition, by Theorem 1.1, there exists a continuously differentiable class K∞ function σ (s), such that the set i ∗ = {(w, x) : max [σ (Vi ∗ (x)), W (w)] ≤ σ (d)} (110) is an estimate of the region of attraction of the closed-loop system. The proof is thus complete. P ROOF OF T HEOREM 4.2 Define

ero1 (X 1 ), 0, ero (x), e¯¯ro (x) = 0,

e¯ro1 (X 1 ) =

Ui ∗ (X 1 ) ≤ d1 Ui ∗ (X 1 ) > d1 Ui ∗ (X 1 ) ≤ d1 Ui ∗ (X 1 ) > d1 .

Along the solutions of (38)–(40) with the control policy u = − f¯1 (x, z) + 2r uˆ i ∗ +1 (x) − −

g¯ 2 (x)ρ12 (|X 1 |2 )ζ 4

ρ12 (|X 1 |2 )ζ

2 ρ 2 (ζ 2 )ζ − − 2 ζ − e¯ro1 (X 1 ) 4 2ρ 2 (|x|2 )

it follows that 1 U˙ i ∗ ≤ −Q 0 (x) − 2 ζ 2 2 2 (x)} γ 2 (|X 1 |)−max{κ˜ 12 (|w|), κ˜ 22 (|X 1 |), e¯¯ro − 1 1 2 2 4 ρ (|x| ) 2 (x)} γ 2 (|X 1 |)−max{κ˜ 12 (|w|), κ˜ 22 (|X 1 |), e¯¯ro − 1 1 2 2 4 ρ1 (|X 1 | ) −

2 (X )} γ12 (|X 1 |)−max{κ˜ 12 (|w|), κ˜ 22 (|X 1 |), e¯ro1 1 1 2 2 4 ρ1 (|X 1 | )

.

As a result Ui ∗ (X 1 ) ≥ α¯ 1 ◦γ1−1 ◦ κ˜ 1 ◦λ−1 (W (w)) 1 ⇒ U˙ i ∗ ≤ −Q 0 (x) − 2 |ζ |2 . 2 The rest of the proof follows the same reasoning as in the proof of Theorem 3.2.

[1] M. Abu-Khalaf and F. L. Lewis, “Neurodynamic programming and zerosum games for constrained control systems,” IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1243–1252, Jul. 2008. [2] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control,” Automatica, vol. 43, no. 3, pp. 473–481, Mar. 2007. [3] L. C. Baird, “Reinforcement learning in continuous time: Advantage updating,” in Proc. IEEE Int. Conf. Neural Netw., World Congr. Comput. Intell., vol. 4. Jul. 1994, pp. 2448–2453. [4] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, Cybern., vol. 13, no. 5, pp. 835–846, Oct. 1983. [5] R. Beard, G. Saridis, and J. Wen, “Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation,” Automatica, vol. 33, no. 12, pp. 2159–2177, Dec. 1997. [6] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton Univ. Press, 1957. [7] D. P. Bersekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific, 1996. [8] K. Doya, “Reinforcement learning in continuous time and space,” Neural Comput., vol. 12, no. 1, pp. 219–245, Jan. 2000. [9] R. Howard, Dynamic Programming and Markov Processes. Cambridge, MA, USA: MIT Press, 1960. [10] A. Isidori, Nonlinear Control Systems, vol. 2. New York, NY, USA: Springer-Verlag, 1999. [11] Y. Jiang and Z. P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, Oct. 2012. [12] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming with an application to power systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 7, pp. 1150–1156, Jul. 2013. [13] Y. Jiang and Z. P. Jiang, “Robust approximate dynamic programming and global stabilization with nonlinear dynamic uncertainties,” in Proc. 50th IEEE CDC-ECC, Orlando, FL, USA, Dec. 2011, pp. 115–120. [14] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming,” in Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. L. Lewis and D. Liu, Eds. New York, NY, USA: Wiley, 2012. [15] Z. P. Jiang and Y. Jiang, “Robust adaptive dynamic programming for linear and nonlinear systems: An overview,” Eur. J. Control, vol. 19, no. 5, pp. 417–425, Sep. 2013. [16] Z. P. Jiang, I. Mareels, and Y. Wang, “A Lyapunov formulation of the nonlinear small gain theorem for interconnected ISS systems,” Automatica, vol. 32, no. 8, pp. 1211–1215, Aug. 1996. [17] Z. P. Jiang and I. M. Y. Mareels, “A small-gain control method for nonlinear cascaded systems with dynamic uncertainties,” IEEE Trans. Autom. Control, vol. 42, no. 3, pp. 292–308, Mar. 1997. [18] Z. P. Jiang, A. R. Teel, and L. Praly, “Small-gain theorem for ISS systems and applications,” Math. Control, Signals, Syst., vol. 7, no. 2, pp. 95–120, 1994. [19] I. Karafyllis and Z. P. Jiang, Stability and Stabilization of Nonlinear Systems. New York, NY, USA: Springer-Verlag, 2011. [20] H. K. Khalil, Nonlinear Systems, 3rd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2002. [21] M. Krstic, I. Kanellakopoulos, and P. V. Kokotovic, Nonlinear and Adaptive Control Design. New York, NY, USA: Wiley, 1995. [22] P. Kundur, N. J. Balu, and M. G. Lauby, Power System Stability and Control. New York, NY, USA: McGraw-Hill, 1994. [23] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control, 3rd ed. New York, NY, USA: Wiley, 2012. [24] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Apr./Jun. 2009. [25] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 14–23, Feb. 2011. [26] F. L. Lewis and D. Liu, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. New York, NY, USA: Wiley, 2012. [27] R. Marino and P. Tomei, Nonlinear Control Design: Geometric, Adaptive, Robust. New York, NY, USA: Wiley, 1995.


[28] J. M. Mendel and R. W. McLaren, “Reinforcement learning control and pattern recognition systems,” in Adaptive, Learning and Pattern Recognition Systems: Theory and Applications, J. M. Mendel and K. S. Fu, Eds. New York, NY, USA: Academic, 1970, pp. 287–318. [29] M. Minsky, “Steps toward artificial intelligence,” Proc. IRE, vol. 49, no. 1, pp. 8–30, Jan. 1961. [30] F. K. Moore and E. M. Greitzer, “A theory of post-stall transients in axial compression systems—Part I: Development of equations,” J. Eng. Gas Turbines Power, vol. 108, no. 1, pp. 68–76, Jan. 1986. [31] J. J. Murray, C. J. Cox, and G. G. Lendaris, “Adaptive dynamic programming,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 32, no. 2, pp. 140–153, May 2002. [32] M. J. D. Powell, Approximation Theory and Methods. Cambridge, U.K.: Cambridge Univ. Press, 1981. [33] L. Praly and Y. Wang, “Stabilization in spite of matched unmodeled dynamics and an equivalent definition of input-to-state stability,” Math. Control, Signals, Syst., vol. 9, no. 1, pp. 1–33, 1996. [34] G. N. Saridis and C.-S. G. Lee, “An approximation theory of optimal control for trainable manipulators,” IEEE Trans. Syst., Man, Cybern., vol. 9, no. 3, pp. 152–159, Mar. 1979. [35] E. D. Sontag, “Smooth stabilization implies coprime factorization,” IEEE Trans. Autom. Control, vol. 34, no. 4, pp. 435–443, Apr. 1989. [36] E. D. Sontag, “Further facts about input to state stabilization,” IEEE Trans. Autom. Control, vol. 35, no. 4, pp. 473–476, Apr. 1990. [37] E. D. Sontag and Y. Wang, “On characterizations of the input-to-state stability property,” Syst. Control Lett., vol. 24, no. 5, pp. 351–359, Apr. 1995. [38] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998. [39] R. S. Sutton, “Learning to predict by the method of temporal difference,” Mach. Learn., vol. 3, no. 1, pp. 9–44, Aug. 1988. [40] A. Teel and L. Praly, “Tools for semiglobal stabilization by partial state and output feedback,” SIAM J. Control Optim., vol. 33, no. 5, pp. 1443–1488, Sep. 1995. [41] J. Tsinias, “Partial-state global stabilization for general triangular systems,” Syst. Control Lett., vol. 24, no. 2, pp. 139–145, Jan. 1995. [42] K. G. Vamvoudakis and F. L. Lewis, “Multi-player non zero sum games: Online adaptive learning solution of coupled Hamilton–Jacobi equations,” Automatica, vol. 47, no. 8, pp. 1556–1569, Aug. 2011. [43] D. Vrabie and F. Lewis, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Netw., vol. 22, no. 3, pp. 237–246, Apr. 2009. [44] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, Feb. 2009. [45] M. D. Waltz and K. S. Fu, “A heuristic approach to reinforcement learning control systems,” IEEE Trans. Autom. Control, vol. 10, no. 4, pp. 390–398, Oct. 1965. [46] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May 2009. [47] C. Watkins, “Learning from delayed rewards,” Ph.D. thesis, King’s College, Cambridge Univ., Cambridge, U.K., May 1989. [48] P. J. Werbos, The Elements of Intelligence. Namur, Belgium: Cybernetica, 1968, no. 3. [49] P. J. Werbos, “Beyond regression: New tools for prediction and analysis in the behavioral sciences,” Ph.D. thesis, Committee Appl. Math., Harvard Univ., Cambridge, MA, USA, 1974. [50] P. J. Werbos, “Neural networks for control and system identification,” in Proc. 28th IEEE Conf. Decision Control, vol. 1. Dec. 1989, pp. 260–265.

893

[51] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA, USA: MIT Press, 1991, pp. 67–95. [52] P. J. Werbos, “Approximate dynamic programming for real-time control and neural modeling,” in Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Eds. New York, NY, USA: Van Nostrand, 1992. [53] P. J. Werbos, “Intelligence in the brain: A theory of how it works and how to build it,” Neural Netw., vol. 22, no. 3, pp. 200–212, Apr. 2009. [54] H. Zhang, Q. Wei, and D. Liu, “An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games,” Automatica, vol. 47, no. 1, pp. 207–214, Jan. 2011. [55] Y. Zhang, P. Y. Peng, and Z. P. Jiang, “Stable neural controller design for unknown nonlinear systems using backstepping,” IEEE Trans. Neural Netw., vol. 11, no. 6, pp. 1347–1360, Nov. 2000.

Yu Jiang (S’11) was born in Xi’an, China. He received the B.Sc. degree in applied mathematics from Sun Yat-Sen University, Guangzhou, China, in 2006, and the M.Sc. degree in automation science and engineering from the South China University of Technology, Guangzhou, in 2009. Currently, he is pursuing the Ph.D. degree with the Control and Networks Laboratory, Department of Electrical and Computer Engineering, Polytechnic School of Engineering, New York University, Brooklyn, NY, USA. He was a Research Intern with the Mitsubishi Electric Research Laboratories, Cambridge, MA, USA, in 2013. His current research interests include adaptive dynamic programming, nonlinear optimal control, and control applications in biological/biomedical and power systems. Yu Jiang received the Shimemura Young Author Award (with Z. P. Jiang) at the 9th Asian Control Conference, Istanbul, Turkey, in 2013.

Zhong-Ping Jiang (M’94–SM’02–F’08) received the B.Sc. degree in mathematics from the University of Wuhan, Wuhan, China, in 1988, the M.Sc. degree in statistics from the University of Paris XI, Paris, France, in 1989, and the Ph.D. degree in automatic control and mathematics from Ecole des Mines de Paris, Paris, in 1993. He is currently a Professor of electrical and computer engineering with the Polytechnic School of Engineering, New York University, Brooklyn, NY, USA. He is the co-author of the books Stability and Stabilization of Nonlinear Systems (Springer, 2011) and Nonlinear Control of Dynamic Networks (CRC Press, Taylor & Francis, 2014). His current research interests include stability theory, robust, adaptive and distributed nonlinear control, adaptive dynamic programming, and their applications to information, mechanical and biological systems. Prof. Jiang is a Fellow of IFAC. He has served as an Associate Editor for numerous journals and is a Deputy-Editor-in-Chief for the newly launched Journal of Control and Decision. He is a recipient of the National Science Foundation (NSF) CAREER Award and the Outstanding Overseas Chinese Scholars Award from the NSF of China.

Robust adaptive dynamic programming with an application to power systems.

Value Iteration Adaptive Dynamic Programming for Optimal Control of Discrete-Time Nonlinear Systems.

Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems.

Adaptive dynamic output feedback neural network control of uncertain MIMO nonlinear systems with prescribed performance.

Dynamic neural network-based robust observers for uncertain nonlinear systems.

Neural-Based Adaptive Output-Feedback Control for a Class of Nonstrict-Feedback Stochastic Nonlinear Systems.

Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator.

Adaptive optimal control of highly dissipative nonlinear spatially distributed processes with neuro-dynamic programming.

Further results on global state feedback stabilization of high-order nonlinear systems with time-varying delays.

Robust stabilization of uncertain nonlinear slowly-varying systems: application in a time-varying inertia pendulum.

Decentralized Feedback Controllers for Robust Stabilization of Periodic Orbits of Hybrid Systems: Application to Bipedal Walking.

Finite-time stabilization for a class of stochastic nonlinear systems via output feedback.

Clipping in neurocontrol by adaptive dynamic programming.

Adaptive fuzzy output-feedback controller design for nonlinear systems via backstepping and small-gain approach.

Robust Adaptive Neural Tracking Control for a Class of Stochastic Nonlinear Interconnected Systems.

Adaptive Fuzzy Control of Strict-Feedback Nonlinear Time-Delay Systems With Unmodeled Dynamics.

Adaptive neural control of MIMO nonlinear systems with a block-triangular pure-feedback control structure.

Adaptive Output-Feedback Neural Control of Switched Uncertain Nonlinear Systems With Average Dwell Time.

Observer-Based Adaptive Neural Network Control for Nonlinear Systems in Nonstrict-Feedback Form.

Output-feedback adaptive neural control for stochastic nonlinear time-varying delay systems with unknown control directions.

Adaptive control for nonlinear pure-feedback systems with high-order sliding mode observer.

Composite neural dynamic surface control of a class of uncertain nonlinear systems in strict-feedback form.

Decentralized Output Feedback Adaptive NN Tracking Control for Time-Delay Stochastic Nonlinear Systems With Prescribed Performance.

Dynamic learning from adaptive neural network control of a class of nonaffine nonlinear systems.