This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS

1

Brief Papers Robust Adaptive Dynamic Programming With an Application to Power Systems Yu Jiang, Student Member, IEEE, and Zhong-Ping Jiang, Fellow, IEEE

Abstract— This brief presents a novel framework of robust adaptive dynamic programming (robust-ADP) aimed at computing globally stabilizing and suboptimal control policies in the presence of dynamic uncertainties. A key strategy is to integrate ADP theory with techniques in modern nonlinear control with a unique objective of filling up a gap in the past literature of ADP without taking into account dynamic uncertainties. Neither the system dynamics nor the system order are required to be precisely known. As an illustrative example, the computational algorithm is applied to the controller design of a two-machine power system. Index Terms— Nonlinear uncertain systems, optimal control, reinforcement learning.

I. I NTRODUCTION Adaptive control is a model-based approach to control uncertain systems [14], [18], [22]. It relies on the design of parameter adaptation laws, which is not trivial and the resulting adaptive system is highly nonlinear. It has also been found that conventional adaptive systems respond slowly to system parameter variations. By contrast, adaptive dynamic programming (ADP) [27]–[30] is a non-model-based method aiming to approximate, through learning and real-time data, optimal controllers with guaranteed stability for uncertain dynamic systems. In recent years, various ADP algorithms for feedback control designs emerged, and have been used in several applications (see [16], [25] for two review papers, [1], [2], [4], [17], [23], [26], [31], [32] for some recently developed results). In the past literature, optimality is the main concern in developing ADP-based control algorithms where a common assumption is that both the system order and the state measurements are fully known. Further research is needed to relax these two restrictive conditions. As is widely recognized, biological systems learn to achieve enhanced robustness (or greater chance of survival) through interacting with the unknown environment. However, due to the complexity of the Manuscript received December 23, 2011; revised February 23, 2013; accepted February 23, 2013. This work was supported in part by the U.S. National Science Foundation, under Grant DMS-0906659 and Grant ECCS1101401. The authors are with the Department of Electrical and Computer Engineering, Polytechnic Institute of New York University, Brooklyn, NY 11201 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2249668

real-world environment, biological systems may only be able to make decisions based on partial-state information. In order to capture and model this feature from biological learning, this brief proposes a new concept of robust adaptive dynamic programming, a natural extension of ADP to uncertain dynamic systems. It is worth noting that our research takes into account dynamic uncertainties with unmeasured state variables and unknown system order. As the first distinctive feature of the proposed robustADP framework, the control policy design issue is addressed from a point of view of robust control with disturbance attenuation. Specifically, by means of the popular backstepping approach [12], we will show that a robust control policy, or adaptive critic, can be synthesized to yield an arbitrarily small L 2 -gain with respect to the disturbance input. In addition, by studying the relationship between optimality and robustness, it is shown that in the absence of disturbance input, the robust control policy also preserves optimality with respect to some iteratively constructed cost function. Our study on the effects of dynamic uncertainties, or unmodeled dynamics, is motivated by engineering applications in situations where the exact mathematical model of a physical system is not easy to be obtained. The presence of dynamic uncertainty gives rise to interconnected systems for which the control policy design and robustness analysis become technically challenging. With this observation in mind, we will adopt notions of input-to-output stability and strong unboundedness observability introduced in the nonlinear control community; see, for instance, [3], [9], and [20]. We achieve the robust stability and suboptimality properties for the overall interconnected system, by means of Lyapunov and small-gain techniques [9], [10], and [12]. The rest of this brief is organized as follows. Section II formulates the problem and reviews the policy iteration technique. Section III investigates the relationship between optimality and robustness for a general class of partially linear, uncertain composite systems [19]. Section IV presents a robust-ADP scheme for partial-state feedback design. Section V demonstrates the proposed robust-ADP design methodology by means of a twomachine power system. Concluding remarks are contained in Section VI. II. P ROBLEM F ORMULATION AND P RELIMINARIES A. Problem Formulation Consider the following partially linear composite system:

2162-237X/$31.00 © 2013 IEEE

w˙ = f (w, y)

(1)

x˙ = Ax + B [z + 1 (w, y)] z˙ = E x + Fz + G [u + 2 (w, y)]

(2) (3)

y = Cx

(4)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON NEURAL NETWORKS

where [x T , z T ]T ∈ Rn × Rm is the system state vector; w ∈ Rnw is the state of the dynamic uncertainty; y ∈ Rq is the output; u ∈ Rm is the control input; A ∈ Rn×n , B ∈ Rn×m , C ∈ Rq×n , E ∈ Rm×n , F ∈ Rm×m , and G ∈ Rm×m are unknown constant matrices with the pair (A, B) stabilizable and G nonsingular; 1 (w, y) = D(w, y) and 2 (w, y) = H (w, y) are the outputs of the dynamic uncertainty with D, H ∈ Rm× p two unknown constant matrices; the unknown functions f : Rnw × Rq → Rnw and  : Rnw × Rq → R p are locally Lipschitz satisfying f (0, 0) = 0, (0, 0) = 0. In addition, assume the upper bounds of the norms of B, D, H , and G −1 are known. Our objective is to find online a robust optimal control policy that globally asymptotically stabilizes (1)–(4) at the origin. B. ADP for Continuous-Time Linear Systems Consider the system x˙ = Ax + Bu

(5)

where A, B are defined the same as in (2). By linear optimal control theory [15], the following cost:  ∞  J= (6) x T Qx + u T Ru dτ 0

is minimized under the control policy u = −K ∗ x, where K ∗ = R −1 B T P ∗ and P ∗ = P ∗T > 0 is the solution of the algebraic Riccati equation (ARE) P ∗ A + A T P ∗ + Q − P ∗ B R −1 B T P ∗ = 0

(7)

where Q = Q T ≥ 0, R = R T > 0, and the pair (A, Q 1/2 ) is assumed to be observable. Given K 0 such that A − B K 0 is Hurwitz, we can define two sequences {Pk } and {K k } by 0 = (A − B K k )T Pk + Pk (A − B K k ) + Q + K kT RK k

(8)

K k+1 = R

(9)

−1

B Pk . T

In [11], it has been shown that Ak = A − B K k is Hurwitz for all k ∈ Z+ , and lim Pk = P ∗ , lim K k = K ∗ . Notice that (8) k→∞ k→∞ and (9) are called policy evaluation and policy improvement in the policy iteration technique, respectively. In case when both A and B are unknown, (8) and (9) can be implemented online using the following equation [6]:  t +T t +T  T =2 (u + K k x)T RK k+1 xdτ x Pk x  t t  t +T − x T (Q + K kT RK k )xdτ (10) t

where A and B are not involved. It is proved in [6] that, under certain conditions, Pk and K k+1 can be uniquely determined. This scheme will be combined in this brief with tools from nonlinear control theory to design robust-ADP-based control policies for (1)–(4).

III. O PTIMALITY AND ROBUSTNESS In this section, we show the existence of a robust optimal control policy that globally asymptotically stabilizes the overall system (1)–(4). To this end, let us make a few assumptions on (1), which are often required in the literature of nonlinear control design [3], [8], and [12]. Assumption 1: The w-subsystem (1) has strong unboundedness observability (SUO) property with zero offset [9], and is input-to-output stable (IOS) with respect to y as the input and  as the output [9] and [20]. Assumption 2: There exist a continuously differentiable, positive definite, radially unbounded function U : Rnw → R+ , and a constant c ≥ 0 such that ∂U (w) (11) U˙ = f (w, y) ≤ −2||2 + c|y|2 ∂w for all w ∈ Rnw and y ∈ Rq . We now show that an arbitrarily small L 2 gain can be obtained for the subsystem (2)–(4). Given any arbitrarily small constant γ > 0, we can choose Q and R in (7) such that there exists a constant α ∈ [0, 1) satisfying α Q ≥ γ −1 C T C and α R −1 ≥ D D T . Define K ∗ = R −1 B T P ∗ , ξ = z + K ∗ x, Ac = A − B K ∗ , and let S ∗ > 0 be the symmetric solution of the ARE F¯ T S ∗ + S ∗ F¯ + W − S ∗ G R1−1 G T S ∗ = 0 α R1−1

D¯ D¯ T ,

(12)

where W > 0, ≥ F¯ = F + and D¯ = H + G −1 K ∗ B D. Further, define E¯ = E + K ∗ Ac − F K ∗ , ¯ M ∗ = R1−1 G T S ∗ , N ∗ = S ∗ E. The following theorem gives the small-gain condition for the robust asymptotic stability of the overall system (1)–(4). Theorem 3.1: Under Assumptions 1 and 2, the control policy   u ∗ = − (M ∗T R1 )−1 (N ∗ + RK ∗ ) + M ∗ K ∗ x − M ∗ z K ∗ B,

(13) globally asymptotically stabilizes the closed-loop system comprised of (1)–(4), if the small-gain condition holds γ c ≤ 1.

(14)

Proof: Define V (x, z, w) = x T P ∗ x + ξ T S ∗ ξ + U (w) .

(15)

Then, along the solutions of the closed-loop system comprised of (1)–(4) and (13), by completing squares, it follows that:   V˙ ≤ −(1 − α)x T Q + P ∗ B R −1 B T P ∗ x   −(1 − α)ξ T W + S ∗ G R1−1 G T S ∗ ξ. By Assumption 1, all solutions of the closed-loop system are globally bounded. Moreover, a direct application of LaSalle’s Invariance Principle [10] yields the global asymptotic stability (GAS) of the trivial solution of the closed-loop system. Next, we show that the control policy (13) is suboptimal, i.e., it is optimal with respect to some cost function in the absence of the dynamic uncertainty. Notice that, with  ≡ 0, (2)–(3) can be rewritten in a more compact matrix form X˙ = A1 X + G 1 v

(16)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS

3

 T

x where X = , v = u + G −1 E¯ + (S ∗ )−1 B T P ∗ x, T ξ   B Ac 0 A1 = , and G 1 = . G −(S ∗ )−1 B T P ∗ F¯ Proposition 1: Under the conditions in Theorem 3.1, the performance index 

∞

J1 =

 X T Q 1 X + v T R1 v dτ

(17)

0

for system (16) is minimized under the control policy   (18) v ∗ = u ∗ + G −1 E¯ + (S ∗ )−1 B T P ∗ x. ∗ ∗ Proof: It is easy to check that P¯ = block diag (P , S ∗ ) is the solution to the following ARE: A1T P¯ ∗ + P¯ ∗ A1 + Q 1 − P¯ ∗ G 1 R1−1 G 1T P¯ ∗ = 0

(19)



where Q 1 = block diag Q + K ∗T RK ∗ , W . Therefore, by linear optimal control theory [15], we obtain the optimal control policy v ∗ = −R1−1 G 1T P¯ ∗ X = −M ∗ ξ.

(20)

The proof is thus complete. Remark 1: It is of interest to note that Theorem 3.1 can be generalized to higher dimensional systems with a lower triangular structure, by a repeated application of backstepping and small-gain techniques in nonlinear control. Remark 2: The cost function introduced here is different from the ones used in game theory [1] and [24], where the policy iterations are developed based on the game algebraic Riccati equation (GARE). The existence of a solution of the GARE cannot be guaranteed when the input–output gain is arbitrarily small. Therefore, a significant advantage of our method versus the game-theoretic approach in [1] and [23] is that we are able to render the gain arbitrarily small.

 t +T t +T  =2 (z + 1 + K k x)T RK k+1 xdτ x Pk x  t t  t +T − x T (Q + K kT RK k )xdτ . (21) T

t

Using Kronecker product representation, (21) can be rewritten as t +T  vec(Pk ) xT ⊗ xT   tt +T T T =2 x ⊗(z + 1 + K k x) dτ (In ⊗ R)vec(K k+1 ) t  t +T T T − x ⊗ x dτ vec(Q + K kT RK k ). (22) t

For any φ ∈ Rnφ , ϕ ∈ Rnψ , and some sufficiently large l > 0, we define the operators δφψ : Rnφ × Rnψ → Rl×nφ nψ and Iφψ : Rnφ × Rnψ → Rl×nφ nψ such that T φ ⊗ ψ|tt10 φ ⊗ ψ|tt21 · · · φ ⊗ ψ|ttll−1 T  t  tl t φ ⊗ψdτ = t01 φ ⊗ ψdτ t12 φ ⊗ ψdτ · · · tl−1

δφψ = Iφψ



where 0 ≤ t0 < t1 < · · · < tl are constants. Then, (22) implies the following matrix form of linear equations: k vec



Pk K k+1



= k

(23)

where k ∈ Rl×n(n+m) and k ∈ Rl are defined as   k = δx x − 2Ix x (In ⊗ K kT R) − 2(Ix z + Ix1 )(In ⊗ R)

k = −Ix x vec(Q + K kT RK k ). Given K k such that A− B K k is Hurwitz, if there is a unique pair of matrices (Pk , K k+1 ), with Pk = PkT , satisfying (23), we are able to replace (8) and (9) with (23). In this way, the iterative process does not need the knowledge of A or B. Next, we approximate the matrices S ∗ , M ∗ , and N ∗ , which also appear in (13).

IV. ROBUST-ADP D ESIGN In this section, we develop a novel robust-ADP scheme to approximate the robust optimal control policy (13). This scheme contains two learning phases. Phase-one computes the matrices K ∗ and P ∗ . Then, based on its results, the second learning phase further computes the matrices S ∗ , M ∗ , and N ∗ . It is worth noticing that the knowledge of A, B, E, and F is not required in our learning algorithm. In addition, we will analyze the robust asymptotic stability of the overall system under the approximated control policy obtained from our algorithm.

B. Phase-Two Learning For the matrix K k ∈ Rm×n obtained from phase-one learning, let us define ξˆ = z + K k x. Then ξ˙ˆ = E k x + Fk ξˆ + G(u + 2 ) + K k B1

where E k = E + K k (A − B K k ) − F K k , Fk = F + K k B. Similarly as in phase-one learning, we seek the online implementation of the following iterative equations: T 0 = Sk, j Fk, j + Fk,T j Sk, j + W + Mk, j R1 Mk, j

A. Phase-One Learning Assume all the conditions in Theorem 3.1 are satisfied. Along the trajectories of (2), it follows that:

(24)

Mk, j +1 =

R1−1 G T Sk, j

(25) (26)

where Fk, j = Fk − G Mk, j , and we assume there exists Mk,0 such that Fk,0 is Hurwitz.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS

Now, along the solutions of (24), we have  t +T t+T   T ˆξ T Sk, j ξˆ  ˆ ξˆ T W + Mk, =− j R1 Mk, j ξ dτ t t  t +T +2 (uˆ + Mk, j ξˆ )T R1 Mk, j +1 ξˆ dτ t  t +T  t +T ξˆ T Nk, j xdτ + 2 1T L k, j ξˆ dτ +2 t

Algorithm 1 Robust-ADP Algorithm

t

where uˆ = u + 2 , Nk, j = Sk, j E k and L k, j = B T K kT Sk, j . Then, we obtain the following linear equations that can be used to approximate the solution to the ARE (12):

(27) k, j vec Sk, j Mk, j +1 Nk, j L k, j = k, j where k, j ∈ Rl×m(n+m) and k, j ∈ Rl are defined as  T k, j = δξˆ ξˆ −2Iξˆ ξˆ(Im ⊗ Mk, j R1)  −2Iξˆ uˆ (Im ⊗ R1 ) −2Ix ξˆ −2Iξˆ 1 ,

Then,

k, j =−Iξˆ ξˆ vec(Wk ). 2 Rl×m ,

Notice that δξˆ ξˆ , Iξˆ uˆ , Iξˆ ξˆ , Iξˆ 1 ∈ Ix ξˆ ∈ be obtained by     δξˆ ξˆ = δzz + 2δx z K kT ⊗ Im + δx x K kT ⊗ K kT   Iξˆ uˆ = Iz uˆ + Ix uˆ K kT ⊗ Im     Iξˆ ξˆ = Izz + 2Ix z K kT ⊗ Im + Ix x K kT ⊗ K kT

Rl×nm

can

K kT )

Ix ξˆ = Ix z + Ix x (In ⊗   Iξˆ 1 = Ix1 K kT ⊗ Im + Iz1 . Clearly, (27) does not rely on the knowledge of E, F, or G. C. Implementation Issues Similar as in previous policy-iteration-based algorithms, an initial stabilizing control policy is required in the learning phases. Here, we assume there exist an initial control policy u 0 = −K x x − K z z and a positive definite matrix P¯ = P¯ T satisfying P¯ > P¯ ∗ , such that along the trajectories of the closed-loop system comprised of (1)–(4) and u 0 , we have  d  T ¯ (28) X P X + U (w) ≤ −|X|2 dt where  > 0 is a constant. Notice that this initial stabilizing control policy u 0 can be obtained using the idea of gain assignment [9]. In addition, to satisfy the rank condition in Lemma 1 below, additional exploration noise may need to be added into the control signal. The robust-ADP scheme can thus be summarized in Algorithm 1. D. Convergence Analysis Lemma 1: Suppose Ak and Fk, j are Hurwitz and there exists an integer l0 > 0, such that the following holds for all l ≥ l0 :

rank Ix x Ix z Izz Ix uˆ Iz uˆ Ix1 Iz1 n(n + 1) m(m + 1) + + 3mn + 2m 2 . = (29) 2 2

1) there exist unique Pk = PkT and K k+1 satisfying (23); T , M 2) there exist unique Sk, j = Sk, k, j +1 , Nk, j , L k, j j satisfying (27). Proof: See the Appendix. Theorem 4.1: Given K 0 , Mk,0 , such that A0 and Fk,0 are Hurwtiz, under the rank condition in Lemma 1, we have: 1) Ak is Hurwitz, lim Pk = P ∗ , lim K k = K ∗ ; k→∞

2) Fk, j is Hurwitz,

k→∞

lim Sk, j = S ∗ ,

k, j →∞

¯ and lim L k, j lim Nk, j = S ∗ E,

k, j →∞

k, j →∞

lim Mk, j = k, j →∞ = B T K ∗T S ∗ .

M ∗,

Proof: If Pk = PkT is the solution of (23), K k+1 is uniquely determined by R −1 B T Pk . By (21), we know Pk and K k+1 satisfy (23). Together with Lemma 1, we know that Pk = PkT and K k+1 satisfying (23) must be the unique solution of (8) and (9). Therefore, the iteration in (23) is equivalent to (8) and (9). Hence, property 1) of Theorem 4.1 is proved by means of the Theorem of [11]. Following the same reasoning, property 2) of Theorem 4.1 can also be proved.

Define v˜ = u˜ + G −1 E¯ + (S ∗ )−1 B T P ∗ x. Then, the following theorem shows that the robust asymptotic stability for the overall system (1)–(4) can be achieved under the approximated control policy u˜ obtained from Algorithm 1. Theorem 4.2: There exists a sufficiently small threshold  > 0 for Algorithm 1, such that the control policy u˜ obtained from Algorithm 1 globally asymptotically stabilizes (1)–(4) at the origin. Proof: By Theorem 4.1, for any constant β ∈ (0, 1) there exists  > 0, such that    T  v˜ R1 v˜ − v˜ ∗T R1 v˜ ∗ + 2 ξ T S ∗ G(v˜ − v ∗ )   1−α ≤ β X T Q 1 X.

(30)

Now, consider the function V as defined in Theorem 3.1. Along the solutions of the closed-loop system comprised

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS

Generator 1 120 Rotor Angle (degree)

of (1)–(4) and u, ˜ we have   V˙ ≤ −(1 − α) X T Q 1 X + v˜ T R1 v˜ + 2ξ T S ∗ G(v˜ − v ∗ )   (31) +(1 − α) v˜ T R1 v˜ − v˜ ∗T R1 v˜ ∗   ≤ −(1 − α)(1 − β) X T Q 1 X + v˜ T R1 v˜ .

5

80 60 40 20 0

where μ3 = (1 + μ2 ) max{μ1 , 1/(1 − α)(1 − β)}.

where, for the i th generator, δi , ωi , Pmi , and Pei are the deviations of rotor angle, relative rotor speed, mechanical input power, and active power, respectively. The control signal u i represents deviation of the valve opening. Hi , Di , ω0 , ki , and Ti are constant system parameters. The active power Pei is defined as E1 E2 [sin(δ1 − δ2 ) − sin(δ10 − δ20 )] (36) X and Pe2 = −Pe1 , where δ10 and δ20 are the steady state angles of the first and second generators, E 1 and E 2 are two constants. The second synchronous generator is treated as the dynamic uncertainty, and it has a fixed controller u 2 = −a1 δ2 − a2 ω2 − a3 Pm2 , with a1 , a2 , and a3 its feedback gains. Our goal is to design a robust optimal control policy u 1 for the interconnected power system. For simulation purpose, the parameters are specified as follows: D1 = 1, H1 = 3, ω0 = 314.159 rad/s, T1 = 5s, δ10 = 1 rad, δ20 = 1.2 rad, E 1 = 2, E 2 = 3, D2 = 1, T2 = 5, X = 15, k2 = 0, H2 = 3, a1 = 0.2236, a2 = −0.2487, and a3 = −7.8992. Weighting matrices are Q = block diag(5, 0.0001), R = 1, W = 0.01, and R1 = 100. The exploration noise we employed for this simulation is the sum of sinusoidal functions with different frequencies. In the simulation, two generators were operated on their steady states from t = 0s to t = 1s. An impulse disturbance on the load was simulated at t = 1s, and the overall system started to oscillate. The robust-ADP algorithm was applied to the first generator from t = 2s to t = 3s. Convergence Pe1 = −

2

3

4

5 time (sec)

6

7

8

9

10

Robust ADP Unlearned 75

70

65

Controller updated

Oscillation started 60

0

Fig. 1.

1

2

3

4

5 time (sec)

6

7

8

9

10

Trajectories of the rotor angle. Generator 1

50.2 Robust ADP Unlearned

Frequency (Hz)

Oscillation started 50.1

50

49.9

Controller updated 49.8

0

1

2

3

4

5 time (sec)

6

7

8

9

10

Generator 2 50.2 Robust ADP Unlearned Frequency (Hz)

(33) δ˙i = ωi Di ω0 ω˙ i = ω + (34) (Pmi + Pei ) 2Hi 2Hi 1  P˙mi = i = 1, 2 (35) (−Pmi − ki ωi + u i ) , Ti

1

Generator 2

V. A PPLICATION : S YNCHRONOUS G ENERATORS The power system considered in this brief is an interconnection of two synchronous generators described by [13]

Controller updated 0

80 Rotor Angle (degree)

Finally, the GAS property can be proved using LaSalle’s Invariance Principle [10]. ∞

Remark 3: Define J1 = 0 X T Q 1 X + v˜ T R1 v˜ dt, and let J1∗ be the cost associated with the optimal control policy v ∗ in the absence of the dynamic uncertainty. Further, let μ1 > 0, and μ2 > 0 be constants such that X T Q 1 X + v˜ T R1 v˜ ≤ μ1 |X|2 , ∀X ∈ Rn+m and μ2 P¯ ∗ ≥ P¯ − P¯ ∗ . Then, we have the following inequality:

(32) J1 ≤ μ3 J1∗ + U (w(0))

Robust ADP Unlearned

Oscillation started

100

50.1

50

49.9

Controller updated

Oscillation started 49.8

0

Fig. 2.

1

2

3

4

5 time (sec)

6

7

8

9

10

Trajectories of the angular velocity.

is attained after six iterations of phase-one learning followed by ten iterations of phase-two learning, when the stopping criterions |Pk − Pk−1 | ≤ 10−6 and |Sk, j − Sk, j −1 | ≤ 10−6 are both satisfied. The linear control policy formulated after the robust-ADP algorithm is as follows: u˜ 1 = −256.9324δ1 − 44.4652ω1 − 153.1976Pm1. The ideal robust optimal control policy is given for comparison as follows: u ∗1 = −259.9324δ1 − 44.1761ω1 − 153.1983Pm1. Trajectories of the output variables and convergence of the feedback gain matrices are shown in Figs. 1 and 2. The new control policy for Generator 1 is applied from t = 3s to the end of the simulation. It can be seen that oscillation has been significantly reduced after robust-ADPbased online learning.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS

VI. C ONCLUSION We proposed a framework of robust-ADP to compute globally asymptotically stabilizing control policies with suboptimality and robustness properties in the presence of dynamic uncertainties. A learning algorithm was provided for the online computation of partial-state feedback control laws. The novel control scheme was developed by integration of ADP, and some tools developed within the nonlinear control community. Unlike previous ADP schemes in the past literature, the robustADP framework can handle systems with dynamic uncertainties of which the state variables and the system order are not precisely known. As an illustrative example, the proposed algorithm was applied to the robust optimal control for a twomachine power system. It is interesting to extend this result to multimachine power systems [7] and electric smart grids with distributed generators. A PPENDIX P ROOF OF L EMMA 4.1 The proof of 1) has been given in [6], and is restated here to make this brief self-contained. Actually, we only need to show that given any constant matrices P = P T ∈ Rn and K ∈ Rn×m , if

k vec P K = 0 (37) we will have P = 0 and K = 0. By definition, we have

k vec P K = Ix x vec(Y )+2(Ix z +Ix1 )vec(Z )

(38)

where Y = AkT P + P Ak + K kT (B T P − RK ) + (P B − K T R)K k (39) (40)

Z = B P − RK . T

1

Notice that Y is symmetric, we can define x¯ ∈ R 2 n(n+1) , 1 1 ˆ Y ∈ R 2 n(n+1) , and Ix¯ ∈ Rl× 2 n(n+1) , such that T  (41) x¯ = x 12 , x 1 x 2 , · · · , x 1 x n , x 22 , x 2 x 3 , · · · , x n−1 x n , x n2 T  2 , 2y12 , · · · , 2y1n , y22 , 2y23 , · · · , 2yn−1,n , yn2 (42) Yˆ = y11   T  t2  tl t ¯ ¯ · · · tl−1 xdτ ¯ . (43) Ix¯ = t01 xdτ t1 xdτ Now, (37) and (38) imply 0 = Ix¯ Yˆ + (Ix z + Ix1 )vec(2Z ).

Finally, since Ak is Hurwitz for each k ∈ Z+ , the only matrices P = P T and K simultaneously satisfying (39) and (40) are P = 0 and K = 0. Now we prove 2). Similarly, suppose there exist some constant matrices S, M, L ∈ Rm×m with S = S T , and N ∈ Rm×n satisfying

k, j vec S M N L = 0. Then, we have  T T 0 = Iξˆ ξˆ vec S Fk, j + Fk,T j S + Mk, j (G S − R1 M)  ×(SG − M T R1 )Mk, j + Iξ uˆ 2vec(G T S − R1 M) +Ixξ 2vec(S E k − N) + Iξˆ 1 2vec(B T K kT S − L). By definition, it holds   Ix x , Iξˆ ξˆ , Ix ξˆ , Iξˆ uˆ , Ix uˆ , Ix1 , Iξˆ 1

= Ix x , Ix z , Izz , Ix uˆ , Iz uˆ , Ix1 , Iz1 Tn where Tn is a nonsingular matrix. Therefore 1 m(m + 1) + 2m 2 + mn 2   ≥ rank Iξˆ ξˆ Iξˆ uˆ Ix ξˆ Iξˆ 1   ≥ rank Ix x , Iξˆ ξˆ , Ix ξˆ , Iξˆ uˆ , Ix uˆ , Ix1 , Iξˆ 1 1 − n(n + 1) − 2mn 2

= rank Ix x , Ix z , Izz , Ix uˆ , Iz uˆ , Ix1 , Iz1 1 − n(n + 1) − 2mn 2 1 = m(m + 1) + 2m 2 + mn. 2 Following the same reasoning from (38) to (44), we obtain T T 0 = S Fk, j + Fk,T j S + Mk, j (G S − R1 M)

+(SG − M T R1 )Mk, j

(45)

0 = G T S − R1 M

(46)

0 = SE − N 0 = B Kk S − L

(47) (48)

where [S, E, M, L] = 0 is the only possible solution. R EFERENCES

(44)

Under the rank condition in Lemma 1, we have

rank Ix x Ix z + Ix1

≥ rank Ix x , Ix z , Izz , Ix uˆ , Iz uˆ , Ix1 , Iz1 1 −2mn − m(m + 1) − 2m 2 2 1 = n(n + 1) + mn 2

which implies Ix¯ Ix z + Ix1 has full column rank. Hence, Y = Y T = 0 and Z = 0.

[1] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control,” Automatica, vol. 43, no. 3, pp. 473–481, Jan. 2007. [2] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for MIMO system based on adaptive dynamic programming,” IEEE Trans. Neural Netw., vol. 22, no. 7, pp. 1133–1148, Jul. 2011. [3] A. Isidori, Nonlinear Control Systems. New York, USA: Springer-Verlag, 1999. [4] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming,” Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. L. Lewis and D. Liu, Eds. New York, USA: Wiley, 2012, pp. 281–302. [5] Y. Jiang and Z. P. Jiang, “Robust approximate dynamic programming and global stabilization with nonlinear dynamic uncertainties,” in Proc. Joint IEEE Conf. Decision Control Eur. Control Conf., Dec. 2011, pp. 115–120.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS

[6] Y. Jiang and Z. P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, 2012. [7] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming for large-scale systems with an application to multimachine power systems,” IEEE Trans. Circuits Syst. II Exp. Briefs, vol. 59, no. 10, pp. 693–697, Oct. 2012. [8] Z. P. Jiang and L. Praly, “Design of robust adaptive controllers for nonlinear systems with dynamic uncertainties,” Automatica, vol. 34, no. 7, pp. 825–840, 1998. [9] Z. P. Jiang, A. R. Teel, and L. Praly, “Small-gain theorem for ISS systems and applications,” Math. Control Signals Syst., vol. 7, no. 2, pp. 95–120, 1994. [10] H. K. Khalil, Nonlinear Systems, 3rd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2002. [11] D. Kleinman, “On an iterative technique for Riccati equation computations,” IEEE Trans. Autom. Control, vol. 13, no. 1, pp. 114–115, Feb. 1969. [12] M. Krstic, I. Kanellakopoulos, and P. V. Kokotovic, Nonlinear and Adaptive Control Design. New York, USA: Wiley, 1995. [13] P. Kundur, N. J. Balu, and M. G. Lauby, Power System stability and Control. New York, USA: McGraw-Hill, 1994. [14] P. A. Ioannou and J. Sun, Robust Adaptive Control. Upper Saddle River, NJ, USA: Prentice-Hall, 1996. [15] F. L. Lewis and V. L. Syrmos, Optimal Control. New York, USA: Wiley, 1995. [16] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Trans. Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Aug. 2009. [17] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data,” IEEE Trans. Syst. Man, Cybern. B, vol. 41, no. 1, pp. 14–23, Feb. 2011. [18] I. Mareels and J. W. Polderman, Adaptive Systems: An Introduction. Cambridge, MA, USA: Birkhäuser, 1996. [19] A. Saberi, P. V. Kokotovic, and H. J. Sussmann, “Global stabilization of partially linear composite systems,” SIAM J. Control Optim., vol. 2, no. 6, pp. 1491–1503, 1990. [20] E. D. Sontag, “Input to state stability: Basic concepts and results,” Nonlinear and Optimal Control Theory, P. Nistri and G. Stefani, Eds. New York, USA: Springer-Verlag, 2007, pp. 163–220.

7

[21] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998. [22] G. Tao, Adaptive Control Design and Analysis. New York, USA: Wiley, 2003. [23] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, Feb. 2009. [24] D. Vrabie and F. L. Lewis, “Adaptive dynamic programming algorithm for finding online the equilibrium solution of the two-player zero-sum differential game,” in Proc. IEEE Joint Conf. Neural Netw., Jul. 2010, pp. 1–8. [25] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May 2009. [26] F.-Y. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming for finite horizon optimal control of discrete-time nonlinear systems with epsilon-error bound,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36, Jan. 2011. [27] P. J. Werbos, “Beyond regression: New tools for prediction and analysis in the behavior sciences,” Ph.D. dissertation, Committee Appl. Math. Harvard Univ., Cambridge, MA, USA, 1974. [28] P. J. Werbos, “Neural networks for control and system identification,” in Proc. Conf. Decision Control, vol. 1. Dec. 1989. pp. 260–265. [29] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA, USA: MIT Press, 1991, pp. 67–95. [30] P. J. Werbos, “Approximate dynamic programming for real-time control and neural modeling,” in Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds. New York, USA: Van Nostrand, 1992. [31] H. Xu, S. Jagannathan, and F. L. Lewis, “Stochastic optimal control of unknown linear networked control system in the presence of random delays and packet losses,” Automatica, vol. 48, no. 6, pp. 1017–1030, Jun. 2012. [32] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009.

Robust adaptive dynamic programming with an application to power systems.

This brief presents a novel framework of robust adaptive dynamic programming (robust-ADP) aimed at computing globally stabilizing and suboptimal contr...
402KB Sizes 0 Downloads 1 Views