IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

1513

Adaptive Optimal Control of Unknown Constrained-Input Systems Using Policy Iteration and Neural Networks Hamidreza Modares, Frank L. Lewis, Fellow, IEEE, and Mohammad-Bagher Naghibi-Sistani Abstract— This paper presents an online policy iteration (PI) algorithm to learn the continuous-time optimal control solution for unknown constrained-input systems. The proposed PI algorithm is implemented on an actor–critic structure where two neural networks (NNs) are tuned online and simultaneously to generate the optimal bounded control policy. The requirement of complete knowledge of the system dynamics is obviated by employing a novel NN identifier in conjunction with the actor and critic NNs. It is shown how the identifier weights estimation error affects the convergence of the critic NN. A novel learning rule is developed to guarantee that the identifier weights converge to small neighborhoods of their ideal values exponentially fast. To provide an easy-to-check persistence of excitation condition, the experience replay technique is used. That is, recorded past experiences are used simultaneously with current data for the adaptation of the identifier weights. Stability of the whole system consisting of the actor, critic, system state, and system identifier is guaranteed while all three networks undergo adaptation. Convergence to a near-optimal control law is also shown. The effectiveness of the proposed method is illustrated with a simulation example. Index Terms— Input constraints, neural networks, optimal control, reinforcement learning, unknown dynamics.

I. I NTRODUCTION

T

HE theory of optimal control is concerned with finding a control policy that steers a dynamical system to a desired target in an optimal way, with respect to a cost function. Traditional optimal control design methods are generally offline and require complete knowledge of the system dynamics [1]. On the other hand, adaptive control [2], [3] covers efficient techniques for control of uncertain systems. However, classical adaptive control methods are generally far from optimal. During the last few decades, reinforcement learning (RL) [4]–[6] has successfully provided a means to design adaptive controllers in an optimal manner. A class of iterative RL-based adaptive optimal controllers, called approximate dynamic programming (ADP), was first developed by Werbos

Manuscript received April 6, 2013; revised July 5, 2013; accepted July 26, 2013. Date of publication August 21, 2013; date of current version September 27, 2013. This work was supported in part by the NSF under Grant ECCS-1128050, in part by the ARO under Grant W91NF-05-1-0314, in part by the AFOSR under Grant FA9550-09-1-0278, in part by the China NNSF under Grant 61120106011, and in part by the China Education Ministry Project 111 under Grant B08015. H. Modares and M.-B. Naghibi-Sistani are with the Department of Electrical Engineering, Ferdowsi University of Mashhad, Mashhad, Iran (e-mail: [email protected]; [email protected]). F. L. Lewis is with University of Texas at Arlington Research Institute, Fort Worth, TX 76118, USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2276571

[7]–[9] for discrete-time (DT) systems. ADP controllers are implemented on actor–critic structures [10] where two coupled learning networks, called critic and actor, are tuned online to approximate the optimal value function and optimal control solution. To obviate the need to know the system dynamics, Werbos [7]–[9] employed a third network in conjunction with the actor–critic structure to approximate the unknown DT system dynamics. Owing to the universal approximation, learning, and adaptation abilities of neural networks (NNs), they have been efficiently used in ADP controllers to approximate the value function and the unknown DT system dynamics. Until recently, the convergence of the NN-based ADP controllers was established on condition that the iterative control of each iteration can be accurately obtained. This condition is not practical because of the existence of the NN function approximation errors. However, recently a novel and effective ADP method based on finite approximation error was presented in [11], which established the convergence conditions considering the NN function approximation errors. The performance of ADP-based controllers has also been successfully verified on some interesting real-world applications such as cellular networks [12], grasping control of a three-finger gripper [13], turbogenerators [14], and engine torque, and air–fuel ratio control [15]. A survey of ADP-based feedback control designs can be found in [16]–[19]. Extensions of the RL-based controllers to continuous-time (CT) systems have been considered by many researchers [20]–[30]. The existing RL algorithms for CT nonlinear systems either lack a rigorous stability analysis [20]–[22] and/or require at least partial knowledge of the system dynamics [21]–[28]. To obviate the need for complete knowledge of the system dynamics, two different approaches were presented. In the first approach, a partially model-free RL algorithm, namely the integral reinforcement learning (IRL), was introduced in [25] and [26] by developing a novel Bellman equation. The extension of the IRL algorithm for systems with a discounted performance function was presented in [27]. In the second approach, presented in [28], an NN was employed in conjunction with an actor–critic controller to identify the unknown system drift dynamics. It was shown that the state estimation of the NN identifier converges to the true state. However, it was not guaranteed that the NN identifier weights stay bounded in a compact neighborhood of their ideal values. It is shown in this paper that the convergence of the identifier weights is a crucial task to be accomplished to ensure the convergence of the policy iteration (PI) algorithm to a nearoptimal control solution. Moreover, this method still requires

2162-237X © 2013 IEEE

1514

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

complete knowledge of the input dynamics. To our knowledge, no RL-based algorithm has been designed in the literature to learn the optimal control solution for completely unknown CT nonlinear systems. For linear systems, however, Lee et al. [29] and Jiang and Jiang [30] used the IRL idea to present online RL algorithms for completely unknown systems. Another important issue that is not considered in the existing RL-based controllers for uncertain CT systems [25]–[30] is the amplitude limitation on the control inputs. In fact, these methods offer no guarantee of keeping the control inputs in their permitted bounds during and after learning. This may result in performance degradation or even system instability. For constrained-input systems with completely known dynamics, an offline PI algorithm in [31] and an online PI algorithm in our recent work [32] were presented to solve the optimal H2 and H∞ control problems, respectively. However, system uncertainties result in a difficulty and tremendous challenge in obtaining tuning laws for the actor and identifier NNs weights to guarantee the stability of the whole system while converging to a near-optimal solution. This paper investigates a new actor–critic-based PI algorithm for nonlinear systems, as an approach to address these limitations, including input constraints and completely unknown dynamics. A suitable nonquadratic cost functional is used to encode the input constraints into the optimization problem. To avoid the requirement of knowing the system dynamics, inspired by the work of Werbos [7]–[9], a novel NN identifier is developed to identify unknown system dynamics. It is shown here how the identifier weights estimation error affects convergence of the critic NN to the optimal solution. To guarantee convergence of the identifier weights to their true values, however, the existing system identifiers generally need the standard persistence of excitation condition to be satisfied, which is very difficult or even impossible to check online. To overcome this problem, a new NN identifier structure is developed which guarantees exponential convergence of the identifier weights to a small region around their ideal values, provided that an easy-tocheck persistence of excitation condition is satisfied. To this end, the experience replay technique, which was introduced in the context of RL [33]–[35] and was used later in the context of adaptive control [36], [37], is employed for updating NN identifier weights. That is, recorded past experiences are used simultaneously with current data for the adaptation of the identifier weights. It is also shown that this technique decreases the error bound of the identifier weights. The learning of the actor, critic, and NN identifier is continuous and simultaneous, and it is shown how tuning of the system model influences tuning of the actor and critic networks. The stability of the overall system and the boundness of the actor and critic NNs weights are assured by using the Lyapunov theory. The main contributions of this paper include the following. 1) Unlike the existing RL algorithms for CT systems, which require complete or at least partial knowledge about the system dynamics, the proposed method does not require any knowledge about the system dynamics. To this end, an actor–critic algorithm along with a novel

NN identifier is developed to learn the optimal control solution. 2) It is shown that the convergence of the identifier weights to their true values is essential to ensure the convergence of the RL algorithm to a near-optimal control solution. Accordingly, the technique of the experience replay is used to develop a novel NN identifier that guarantees convergence of the identifier weights to their true values, provided that an easy-to-check persistence of excitation condition is satisfied. 3) In order to avoid the performance degradation or even system instability, unlike the existing RL methods for uncertain CT systems, the actuators saturation is taken into account in an adaptive optimal control design technique. This paper is organized as follows. The next section provides preliminary definitions. An overview of the optimal control of constrained-input CT systems is given in Section III. A novel system identifier is developed to identify the unknown system models in Section IV. The proposed online PI algorithm is presented in Section V. Sections VI and VII present the simulation results and conclusion, respectively. II. N OTATIONS AND DEFINITIONS Throughout the paper, R denotes the real numbers, Rn denotes the real n vectors, and Rm×n denotes the real m × n matrices. For a scalar v, |v| indicates the absolute value of v. For a vector x, x indicates the Euclidean norm of x. For a matrix M, M indicates the induced 2-norm of M and λmin (M) denotes the minimum eigenvalue of M. Definition 1 (Uniformly Ultimately Bounded (UUB) Stability [38]): Consider the nonlinear system x˙ = f (x, t)

(1)

Rn .

with state x(t) ∈ The equilibrium point x 0 is said to be UUB stable if there exists a compact set  ⊂ Rn so that for all x ⊂ , there exists a bound Band a time T (B, x 0 ) such that x(t) − x 0  ≤ B for all t ≥ t0 + T . Definition 2 (Zero-State Observability [39]): System (1) with measured output y = h(x) is zero-state observable if y(t) ≡ 0 ∀t ≥ 0 implies that x(t) ≡ 0 ∀t ≥ 0. Definition 3 (Persistently Exciting (PE) [39]): The bounded vector signal z(t) is PE over the interval [t, t + T1 ] if there exists γ1 > 0 and γ2 > 0 such that for all t  t +T1 γ1 I ≤ z(τ )z T (τ )dτ ≤ γ2 I. (2) t

III. BACKGROUND : O PTIMAL C ONTROL AND P OLICY I TERATION FOR C ONSTRAINED I NPUT S YSTEMS In this section, the optimal control problem for CT nonlinear systems with input constraints is formulated, and an offline PI algorithm is given for solving it. Consider the affine CT dynamical systems described by x˙ = f (x) + g(x)u(x)

(3)

where x ∈ Rn is a measurable system state vector, f (x) ∈ Rn is the drift dynamics of the system, g(x) ∈ Rn×m is the input

MODARES et al.: ADAPTIVE OPTIMAL CONTROL OF UNKNOWN CONSTRAINED-INPUT SYSTEMS

dynamics of the system, and u(x) ∈ Rm is the control input. We denote u = {u | u ∈ Rm , |u i (x)| ≤ λ, i = 1, . . . , m} as the set of all inputs satisfying the input constraints, where λ is a known saturating bound for the actuators. It is assumed that f (0) = 0 and f (x) + g(x)u is Lipschitz, and that the system is stabilizable. Define the performance index  ∞ (Q(x(τ )) + U (x(τ ))dτ (4) V (x(t)) = t

where Q(x) is a positive-definite monotonically increasing function, and U (u) is a positive definite integrand function. In this general performance index, if there exist no constraints on the system states and the control inputs, then Q(x) and U (u) functions are usually chosen as quadratic forms Q(x) = x T Sx, S = S T > 0 and U (u) = u T Ru, R = R T > 0. Assumption 1: The performance functional (4) satisfies zero-state observability. Definition 4 (Admissible Control [31], [40]): A control policy μ(x)is said to be admissible with respect to (4) on , denoted by μ(x) ∈ π(), if μ(x) is continuous on , μ(0) = 0, u(x) = μ(x) stabilizes (3) on , and V (x) is finite ∀x ∈ . To deal with the input constraints, i.e., to ensure that u(x) ∈ u , the following generalized nonquadratic functional was employed in the literature [31], [41], [42]:  u U (u) = 2 (λβ −1 (v/λ))T Rdv (5) 0

where v ∈ Rm , β(·) ∈ Rm is a continuous one-to-one bounded function satisfying |β(·)| ≤ 1 with β(0) = 0, R = diag(r1 , . . . , rm ) > 0 and assumed to be diagonal for simplicity of analysis. Moreover, β(·) is a monotonic odd function and its first derivative is bounded by a constant. In this paper, we use the well-known hyperbolic tangent β(·) = tanh(·), and therefore  u U (u) = 2 (λ tanh−1 (v/λ))T Rdv. (6) 0

Using (6) in (4) and differentiating V along the system trajectories, the following Bellman equation is obtained:  u Q(x) + 2 (λ tanh−1 (v/λ))T Rdv + ∇V T (x)( f (x) 0

+ g(x)u(x)) = 0

(7)

where ∇V (x) = ∂ V (x)/∂ x ∈ Rn . Let V ∗ (x) be the optimal cost function defined as  ∞ ∗ min (Q(x(τ )) + U (u(τ )))dτ. (8) V (x(t)) = u(τ ) ∈ π() t≤τ 0, to the right-hand side of the system (19) results in

In the following, we describe how the experience replay technique can be used for adaptation of the identifier weights to provide an easy-to-check condition for convergence of the NN identifier weights error to a small region around zero. To develop an adaptive law that does not require the system to be stable, each side of (21) is divided by a normalizing signal n s to obtain

x˙ = −Ax + φ ∗ z(x, u) + Ax + εT .

¯ ¯ + ε¯ x¯ = φ ∗ h(x) + a l(x)

(20)

In the following Lemma, we will develop a filtered regressor form for (19). A different filtered regressor approach was used in [44] for dynamic neural networks. Lemma 1: The system (3), (19) can be expressed as x = φ ∗ h(x) + al(x) + ε ˙ h(x) = −ah(x) + z(x, u), h(0) = 0 ˙l(x) = −Al(x) + x, l(0) = 0

(21) (22)

where h(x) ∈ Rd is the filtered regressor version of z(x, u), t ε(t) = e−At x(0) + 0 e−A(t −τ ) εT dτ , and x(0) is the initial state of the system (19). Proof: Equation (21) can be written as x˙ i =

−ax i + φi∗ z(x, u) + ax i

+ εT i , i = 1, . . . , n

(23)

(29)

where n s = and x¯ = x/n s , h¯ = h/n s , l¯ = l/n s and ε¯ = ε/n s are the normalized versions of x, h, l and ε, respectively. Based on Lemma 1 and (29), consider the identifier weights estimator of the form 1 + hT h

+ lT l

ˆ¯ = φ(t) ˆ h(x(t)) ¯ ¯ x(t) + a l(x(t))

(30)

ˆ ˆ where φ(t) = [θˆ (t)ψ(t)] is the estimated values of the identifier weights matrix φ ∗ at time t. Since the control input u is bounded, εT in (17) and consequently ε in (21) and ε¯ in (29) are bounded. Define the state estimation error as ˆ¯ − x(t). e(t) = x(t) ¯

(31)

Using (29)–(31), the state estimation error is given by ¯ ˜ h(x(t)) e(t) = φ(t) − ε¯

(32)

MODARES et al.: ADAPTIVE OPTIMAL CONTROL OF UNKNOWN CONSTRAINED-INPUT SYSTEMS

˜ ˆ − φ ∗ is the identifier weights estimation where φ(t) = φ(t) error at time t. The idea of experience replay is used to provide an easyto-check persistence of excitation condition. This excitation condition is based on stored or recoded past data. Let ¯ ¯ Z = [h(x(t 1 )), . . . , h(x(t p ))].

(33)

denote the recorded data collected and stored in the history stack at past times t1 , . . . , t p . Define ˆ¯ t j ) − x(t ¯ j ), e(t j ) = x(t,

j = 1, . . . , p

(34)

as the state estimation error obtained for the j th stored sample, where ¯ ¯ ˆ¯ t j ) = φ(t) ˆ h(x(t x(t, j )) + a l(x(t j ))

(35)

is the state estimation at time t j using the current estiˆ mated weight matrix φ(t). Substituting x(t ¯ j )from (29) in (34) yields ¯ ˜ h(x(t e(t j ) = φ(t) j )) − ε¯ (t j ).

(36)

The proposed experience replay based learning algorithm for the identifier weights is given by ˙ˆ T ¯ φ(t) = −e(t)h(x(t)) −

p 

T ¯ e(t j )h(x(t j ))

(37)

j =1

where  > 0 denotes a positive definite learning rate matrix, and p is the number of data points stored in the history stack. Remark 3: In the experience replay tuning law (37), the second term depends on the history stack of the previous estimation errors and normalized filtered regressors. The next theorem shows that the experience replay can decrease the error bound of the identifier weights. We show in the next theorem that using the experience replay update law (37), φ˜ converges to a small region around zero, if the following condition is satisfied. Condition 1: The recorded data Z in (33) contains as many linearly independent elements as the dimension of the basis of the uncertainty h(x) in (21). That is, rank(Z ) = d. Remark 4: This condition is related to the persistence of excitation condition. However, in contrast to the standard persistence of excitation condition, which is very difficult or even impossible to verify online, Condition 1 can easily be checked online. Theorem 1: Consider the system (19). Let the online identifier weights estimation law be given by the update law (37) with filtered regressor (22). Then, if the recorded data points satisfy Condition 1: a) for εT = 0 (no reconstruction error), the identifier weights estimation error converges to zero exponentially fast; b) for bounded model approximation error, the identifier weights estimation error is UUB.

1517

Proof: Using (32), (36), and (37), one has  ˙ˆ T T ¯ ¯ − e(t j )h(x(t φ(t) = −e(t)h(x(t)) j )) p

j =1



T ˜ ⎣h(x(t)) ¯ ¯ = − φ(t) h(x(t)) +

p 

⎤ T ¯ ¯ h(x(t j ))h(x(t j )) ⎦

j =1

⎡ T ¯ +  ⎣ε¯ (t)h(x(t)) +

p 



T ¯ ε¯ (t j )h(x(t j )) ⎦.

(38)

j =1

Consider the Lyapunov function as 1 ˜ ˜ T ). tr(φ(t) −1 φ(t) (39) 2 Differentiating Vφ along the trajectories of (37) and noting ˙ˆ ˙˜ that φ(t) = φ(t) results in Vφ =

−1 ˜ ˜ φ(t)T ) V˙φ = tr(φ(t)     p  T¯ T ¯ ¯ ¯ ˜ T ˜ = −tr φ(t) h(x(t)) + h(x(t h(x(t)) j )) h(x(t j )) φ(t) j =1

   p  T T ˜ ¯ ¯ φ(t)T . + tr ε¯ (t)h(x(t)) + ε¯ (t j )h(x(t j ))

(40)

j =1

If Condition 1 is satisfied, then p 

T ¯ ¯ h(x(t j )) h(x(t j )) > 0.

j =1

Therefore, for ε¯ = 0, we obtain V˙φ < 0. This concludes that ˜ φ(t) converges to zero exponentially fast and completes the proof of part a). For the proof of part b), define T ¯ ¯ B¯ = h(x(t)) h(x(t)) +

p 

T¯ ¯ h(x(t j )) h(x(t j ))

(41)

j =1 T ¯ + ε N = ε¯ (t)h(x(t))

p 

T ¯ ε¯ (t j )h(x(t j ))

(42)

j =1

where B¯ is positive definite as long as Condition 1 is satisfied. Then, using (41) and (42) in (40), it can be concluded that V˙φ is negative if ε N  ˜ ≥ . (43) φ ¯ λmin ( B) Using the definitions of ε N , ε¯ , ε, and εT in (42), (29), (21), and (17), the following bound for ε N is obtained: ε N  ≤

 p + 1 ∗ ε f + λεg∗ an s

(44)

where n s is the normalization signal defined in (29) and a is the identifier design parameter. From (44), it is obvious that one can decrease ε N by choosing a larger design parametera. Using (44) in (43), it can be concluded that V˙φ is negative if ˜ > ( p + 1) φ

ε∗f + λεg∗ aλmin (B)

(45)

1518

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

¯ or equivalently where B = n s B, B = h(x(t))T h(x(t)) +

p 

h(x(t j ))T h(x(t j )).

(46)

j =1

In other words, V˙φ is negative outside of the compact set   ε∗f + λεg∗ ˜ ˜  = φ : φ ≤ ( p + 1) . (47) aλmin (B) This completes the proof of part (b). ¯ < 1, and ε¯ is bounded, the state Note that since h estimation error in (32) is also bounded. Remark 5: The NN weights estimation error is bounded even if the system is unstable. As will be shown in the next section, the identifier weights estimation error affects the convergence of the critic NN weights. To decrease this error bound, one can choose a large identifier design parameter a. Moreover, as can be seen from (47), one can record data points that maximize λmin (B) to decrease the error bound.

An online PI algorithm is now given to learn the optimal control solution for unknown constrained-input CT systems using the NN identifier introduced in the previous section. The offline PI Algorithm 3.1 is used to motivate the structure of this online PI algorithm. The learning structure uses the value function approximator with two NNs, an actor, and a critic NN, which approximate the Bellman equation (13) and its corresponding policy (14). Instead of sequentially updating the critic and actor NNs, as in Algorithm 3.1, both networks are updated simultaneously in real time. We call this synchronous online PI. Fig. 1 shows the block diagram of the proposed PI-based adaptive optimal controller. Before presenting our main results in Section V-C, in the following two subsections the critic NN structure and its tuning and convergence are presented. A. Value Function Approximation Using Neural Networks Assuming the value function is a smooth function, then according to the Weierstrass high-order approximation theorem [45], there exists a single-layer NN such that the solution V and its gradient can be uniformly approximated as ∇V (x) = ∇σ (x)W1 + ∇εv (x) T

(48) (49)

where σ (x) ∈ Rl provides a basis function vector, ε(x) is the approximation error, W1 ∈ Rl is a constant parameter vector, and l is the number of neurons. Equation (48) defines a critic NN with weights W1 . Based on Assumption 2, we have εv (x) ≤ bε , ∇εv (x) ≤ bεx σ (x) ≤ bσ , ∇σ (x) ≤ bσ x .

(50) (51)

Assuming that the optimal value function is approximated by (48) and using its gradients (49) in (12), the HJB equation can be written as Q + W1T ∇σ f + λ2 R¯ ln(1 − tanh2 (D)) + εHJB = 0

Proposed PI-based adaptive optimal control law.

where D = (1/2λ)R −1 g T ∇σ T W1 and

V. O NLINE P OLICY I TERATION A LGORITHM FOR U NKNOWN S YSTEM DYNAMICS

V (x) = W1T σ (x) + εv (x)

Fig. 1.

(52)

εHJB = λ2 R¯ ln(1 − tanh2 (D + 0.5λR −1 g T ∇ε)) − λ2 R¯ ln(1 − tanh2 (D)) + ∇ε T f

(53)

is the residual error due to the function reconstruction error. For each constant εh , we can construct an appropriate NN so that sup∀x εHJB  ≤ εh [31]. Note that in (52) and in the sequel, the variable x is dropped for ease of exposit. B. Tuning and Convergence of the Critic NN Weights This subsection presents tuning and convergence of the critic NN weights for a fixed control policy, in effect designing an observer for the unknown value function. The effect of the NN identifier weights estimation error on the convergence of the critic NN is shown. For a given control policy, the critic NN is used to approximate the value function [8], [9]. Consider a fixed control policy u(x) and assume that its corresponding value function is approximated by (48). Then, using (49), the Bellman equation (13) becomes  u (λ tanh−1 (v/λ))T Rdv + W1T ∇σ ( f + gu) = ε B Q+2 0

(54) where the residual error due to the function reconstruction error is (55) ε B = −∇ε T ( f + gu). Under Assumption 2, this residual error is bounded on the compact set , i.e. sup∀x∈ ε B  ≤ εmax . The ideal weights of the critic NN, i.e.,W1 , which provides the best approximation solution for (54), and also the ideal identifier weights θ ∗ and ψ ∗ in (17) that best describe the functions f and g are unknown and must be approximated in real time. Hence, the output of the critic NN and the

MODARES et al.: ADAPTIVE OPTIMAL CONTROL OF UNKNOWN CONSTRAINED-INPUT SYSTEMS

approximate Bellman equation can be written as Vˆ (x) =

Wˆ 1T σ,

(56)



u

δ = Q+2

(λ tanh−1 (v/λ))T Rdv (57)

where Wˆ 1 , θˆ and ψˆ are current estimated values of W1 , ψ ∗ , respectively. Note that the error δ is the CT counterpart of the temporal difference (TD) [4]. The problem of finding the value function is now converted to adjusting the critic NN weights such that the TD δ is minimized. Consider the objective function θ ∗ and

1 E = δ2 . (58) 2 From (57) and using the chain rule, the gradient descent algorithm for E is given by ˙ Wˆ 1 = −

∂E α1 βˆ = −α1 δ ˆ 2 ∂ Wˆ 1 ˆ 2 (1 + βˆ T β) (1 + βˆ T β)

(59)

ˆ 2 is used for normalˆ u), (1 + βˆ T β) where βˆ = ∇σ (θˆ ξ + ψς ization, and α1 > 0 is the learning rate. Theorem 2: Let u(x) be any admissible bounded control policy and consider the adaptive law (59) for tuning the critic NN weights, along with (37) for tuning the identifier NN ˆ ˆ is PE, then: weights. If β¯ = β/(1 + βˆ T β) a) for ε B =0 and εT =0 (no reconstruction error), the critic weights estimation error converges to zero exponentially fast; b) for bounded reconstruction error, the critic weights estimation error converges exponentially to a residual set. Proof: Using (54)  u (λ tanh−1 (v/λ))T Rdv = −W1T ∇σ ( f + gu) + ε B . Q+2 0

(60) Putting (60) in (57) and doing some manipulations, the TD error becomes ˆ u) − W1T ∇σ (φz ˜ + εT ) + ε B δ = −W˜ 1T ∇σ (θˆ ξ + ψς

(61)

where W˜ 1 = W1 − Wˆ 1 is the critic weights estimation error, and z and φ are defined in (19). Substituting (61) in (59) and ˆ and m s = 1 + βˆ T β, ˆ the critic ˆ denoting β¯ = β/(1 + βˆ T β) weights error dynamics is obtained by  β¯ T ˜ + εT ) + ε B . (62) W1 ∇σ (φz W˙˜ 1 = −α1 β¯ β¯ T W˜ 1 + α1 ms Viewing (62) as a linear time-varying system with input ˜ + εT ) + ε B , the solution W˜ 1 is given by W1T ∇σ (φz W˜ 1 (t) = ϕ(t, t0 )W˜ 1 (0)  t  β¯ T ˜ + εT ) + ε B dτ W1 ∇σ (φz + α1 ϕ(t, τ ) m s t0 (63) with the state transition matrix defined as ∂ϕ(t, t0 ) = −α1 β¯ β¯ T ϕ(t, t0 ), ϕ(t, t0 ) = I. ∂t

For the time-varying system (64), the equilibrium state is exponentially stable provided that β¯ is PE [39]. Therefore, if β¯ is PE, it gives α1 ˜ + εT ) bσ x W1 (zφ m s η2 α1 ε B  (65) + m s η2

W˜ 1  ≤ η1 e−η2 t +

0

ˆ u) + Wˆ 1T ∇σ (θˆ ξ + ψς

1519

(64)

for some η1 , η2 > 0. For ε B = 0 and εT = 0, φ˜ converges to zero exponentially fast, as was proved in Theorem 1, and (65) becomes α1 η3 bσ x W1 ze−η4 t W˜ 1  ≤ η1 e−η2 t + m s η2

(66)

for some η3 , η4 > 0. Based on Assumption 2 and boundness of the control input, z is bounded. Therefore, W˜ 1 converges to zero exponentially fast and this completes the proof of a). For ε B = 0 and εT = 0, using ε B  ≤ εmax and (47), and the definition of εT in (17), (65) becomes ε∗f + λεg∗ α1 −η2 t ˜ W1  ≤ η1 e + bσ x W1 z( p + 1) m s η2 an s λmin (B)

∗  α1 α1 ∗ bσ x W1  ε f + λεg + εmax . (67) + m s η2 m s η2 This completes the proof of b). Remark 6: The effect of the identifier weights estimation error φ˜ in the convergence of the critic NN is the appearance of the second term in (65). It can be seen from this equation that it is essential to design a system identifier that guarantees the boundness of its weights to ensure convergence of the critic NN weights to their true values. As was shown in Theorem 1, the bound of the proposed NN identifier weights error can be decreased by increasing the identifier design parameter a and also recording efficient data in the history stack. Remark 7: The requirement for the regressor β¯ to be PE is crucial for a proper convergence of the critic NN. Future efforts can focus on obtaining an easy-to-check persistence of excitation condition by employing the experience replay to reuse recorded TD errors for learning of the critic NN weights. C. Actor NN and the Proposed Synchronous PI Algorithm This subsection presents our main results. An online PI algorithm is given that involves simultaneous and synchronous tuning of the actor, critic, and identifier NNs. First, the actor NN structure is developed. In the policy improvement step (14) of Algorithm 3.1, an improved control policy is found according to the current estimated value function. Assume that Wˆ 1 and ψˆ are the current estimates for the optimal critic NN weights in (56) and the ideal parameters of the system input dynamics, respectively. Then according to (14) one can update the control policy as   1 −1 ˆ T R (ψς ) ∇σ T Wˆ 1 . u 1 = −λ tanh (68) 2λ However, this policy improvement does not guarantee the stability of the system. Therefore, to ensure stability in a

1520

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

Lyaponuv sense (as will be discussed later), the following modified policy update law is used:   1 −1 T T ˆ ˆ R (ψς ) ∇σ W2 uˆ 1 = −λ tanh (69) 2λ where Wˆ 2 is the actor NN weights which provides the current estimated value of W1 . Define the actor NN estimation error as W˜ 2 = W1 − Wˆ 2 .

(70)

Assumption 3: [19] The following assumptions are considered on the system dynamics: a)  f (x) ≤ b f x, b) g(x) is bounded by a constant, i.e. g(x) ≤ bg . We now present the main theorem which provides the tuning laws for the actor and critic NNs that ensure convergence of the proposed PI algorithm to a near-optimal control law, while guaranteeing stability. Define ˆ )T ∇σ T Wˆ 2 (71) Dˆ = (1/2λ)R −1 (ψς T 2 ¯ 2 ˆ ˆ ˆ ˆ ˆ U = W2 ∇σ ψς λ tanh( D) + λ R ln(1 − tanh ( D)). (72) Theorem 3: Given the system (17) with θ ∗ and ψ ∗ unknown. Let the NN identifier weights be updated by (37) along with the filtered regressor (22). Let the tuning for the critic NN be provided by βˆ1 (Q + Uˆ + Wˆ 1T βˆ1 ) W˙ˆ 1 = −α1 (1 + βˆ T βˆ1 )2

(73)

1

ˆ uˆ 1 ). Let the control law be given by where βˆ1 = ∇σ (θˆ ξ + ψς (69) and the actor NN be tuned as  ˙ˆ ˆ ˆ λ tanh( D) W2 = −α2 Y1 Wˆ 2 + ∇σ ψς β¯1T ˆ − sgn( D)] ˆ ˆ λ[tanh( D) − ∇σ ψς Wˆ 1 m s1

 (74)

where β¯1 = βˆ1 /(1 + βˆ1T βˆ1 ), m s1 = 1 + β¯1T β¯1 , and Y1 > 0. Let β¯1 be PE and Assumptions 1–3 and Condition 1 hold. Then the closed-loop system state and the critic and actor NN weights estimation errors are UUB. Proof: See Appendix. Remark 8: The actor NN and critic NN weights are updated based on the TD error δ in (57), whereas the NN identifier weights are estimated using the system state estimation error e in (31), without requiring the system to be stable and hence is decoupled from the design of the actor and critic NNs.

Convergence of the system parameters.

chosen the same as those in [31]. That is, the nonquadratic cost functional is chosen as   ∞  u V (x 1 , x 2 ) = tanh−1 (v)dv dt (76) x 12 + x 22 + 2 0

0

and critic NN is chosen as V (x 1 , x 2 ) = Wc1 x 12 + Wc2 x 22 + Wc3 x 1 x 2 + Wc4 x 14 + Wc5 x 24 + Wc6 x 13 x 2 + Wc7 x 12 x 22 + Wc8 x 1 x 23 + Wc9 x 16 + Wc10 x 26 + Wc11 x 15 x 2 + Wc12 x 14 x 22 + Wc13 x 13 x 23 + Wc14 x 12 x 24 + Wc15 x 1 x 25 + Wc16 x 18 + Wc17 x 28 + Wc18 x 17 x 2 + Wc19 x 16 x 22 + Wc20 x 15 x 23 + Wc21 x 14 x 24 + Wc22 x 13 x 25 + Wc23 x 12 x 26 + Wc24 x 1 x 27 .

(77)

To show that using the experience replay update law (37), the identifier weights converge to their true values, it is assumed that the system structure is known, but the system parameters vector P is unknown. The proposed PI algorithm is implemented as in Theorem 3. The number of stored data in the history stack is considered to be 15. The learning rate for the NN identifier weights update law in (37) is selected as  = 15I . The learning rates for the actor and critic are chosen as α1 = 10 and α2 = 10, respectively. The design parameter in (74) is chosen as Y1 = 40I . The actor and critic NNs weights are initialized randomly in the range [−5, 5], while all the system parameters are initialized in the range [0, 2]. In fact, the initial critic and actor NN weights are chosen as W I = [Wc1 , . . . , Wc24 ] = [2.92, 1.84, 0.96, 1.96, 1.39, −1.07, −1.42,

VI. S IMULATION R ESULTS Consider the following nonlinear system [26]:

−1.00, −1.57, −0.88, 1.14, 0.76, 3.69, 3.02, 2.66, −0.35, 1.22, −1.30, −3.35,

x˙1 = p1 x 1 + p2 x 2 + p3 x 1 (x 12 + x 12 ) x˙2 = p4 x 1 + p5 x 2 + p6 x 2 (x 12 + x 12 ) + p7 u

Fig. 2.

(75)

with p = [1, 1, −1, −1, 1, −1, 1]. The control input is limited as |u| ≤ 1.To compare our results with the results obtained by the offline PI method in [31], the cost functional and the form of the critic NN is

−0.12, 2.12, −0.01, 2.49, 2.13].

(78)

Fig. 2 shows how the system identifier weights converge to their true values. Fig. 3 shows convergence of the first 10 parameters of the critic. In fact, the critic NN weights

MODARES et al.: ADAPTIVE OPTIMAL CONTROL OF UNKNOWN CONSTRAINED-INPUT SYSTEMS

Fig. 3.

Fig. 4.

1521

Convergence of the critic NN weights. Fig. 5.

The system states for the fixed initial control law.

Fig. 6.

The control effort for the fixed initial control law.

State trajectory during online learning.

converges to W F = [Wc1 , . . . , Wc24 ] = [7.50, 5.65, 3.91, 0.09, 2.06, −1.90, −1.78, −1.02, 2.07, 0.10, 2.32, 2.48, 2.62, 2.29, 0.58, 1.90, 1.13, −1.13, −1.31, −1.09, −1.48, −0.94, 1.21, 0.76].

(79)

Then, using (69) the control law is given by ⎞ ⎛ 2.82x 1 + 3.91x 2 + 4.12x 23 − 0.95x 13 − ⎟ ⎜ ⎜1.80x 12 x 2 − 1.54x 1 x 22 − 0.32x 25 + 1.16x 15 + ⎟ ⎟ ⎜ ⎟ ⎜2.48x 14 x 2 + 3.93x 13 x 22 + 4.58x 12 x 23 + ⎟ u = − tanh⎜ ⎜1.44x x 4 + 4.51x 7 − 0.56x 7 − 1.31x 6 x − ⎟ . 1 2 ⎟ ⎜ 2 1 1 2 ⎟ ⎜ ⎠ ⎝1.63x 15 x 22 − 2.96x 14 x 23 − 2.23x 13 x 24 + 3.628x 12 x 25 + 2.645x 1 x 26 (80) The states of the system during online simulation are shown in Fig. 4, where a probing noise is added to the control input to ensure the PE condition is satisfied. After 200 s, the PE condition is no longer required and is thus removed. After that, the states remain very close to zero, as required. To show the improvement of the control law due to the proposed online learning and its convergence to a nearoptimal control law, the performance of the control law (80) found at the end of the learning phase is compared with the

Fig. 7. Solid lines: the system states for the control law (80). Dashed lines: the system states for the control law of [31].

performance of the initial control law which the learning is started from and the near-optimal control law found using the offline method of [31]. Figs. 5 and 6 show the performance of the initial control law, starting from a specific initial condition. Note that the initial control law is obtained by putting the initial weights W I [see (78)] in (69). The performance of the control law (80) (found at the end of the online learning) and the near-optimal control law found in [31] are shown in Figs. 7 and 8. From Figs. 5 and 6, it is obvious that for the initial control law, the system states and the control input are

1522

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

ˆ where Dˆ is defined in (71) and ε0 (x) = ∇ε T ( f −gλ tanh( D)). Using (50) and Assumption 3 ˆ ε0  = ∇ε T ( f − gλ tanh( D)) ≤ bεx b f x + λbεx bg . (A.4) From the HJB equation (52), one has W1T ∇σ f = −Q − U + W1T ∇σ gλ tanh(D) + εHJB

(A.5)

where U = W1T ∇σ gλ tanh(D) + λ2 R¯ ln(1 − tanh2 (D)) Fig. 8. Solid line: the control effort for the control law (80). Dashed line: the control effort for the control law of [31].

oscillatory. Comparing Figs. 5 and 6 with Figs. 7 and 8, it is clear that starting from an oscillatory control law, the control law is improved during learning and it successively converges to a near-optimal control law with a good performance. Finally, form Figs. 7 and 8 it is obvious that the performance of the control law (80) is slightly better than those obtained in [31], as both the control effort and the system states converge to zero faster for the proposed control law. These results confirm that the proposed method converges to a near-optimal control solution as the theoretical results suggested. VII. C ONCLUSION An adaptive algorithm that converges to the optimal state feedback laws for unknown CT systems in the presence of input constraints was presented. The learning algorithm was implemented on an actor–critic structure. To eliminate the need for system dynamics, a novel system identifier network was used in conjunction with the actor–critic structure. This has inspired by the work of Werbos [8]–[10] for DT systems. The proposed algorithm was capable of learning the system dynamics and the optimal policy online and simultaneously. Also, the stability of the overall system was guaranteed. A successful example was presented.

is the cost (6) for the approximate optimal control input u = −λ tanh((1/2λ)R −1 g T ∇σ T W1 ). Substituting (A.5) in (A.3), V˙ becomes V˙ = −Q − U + W1T ∇σ gλ tanh(D) ˆ + εHJB + ε0 . − W1T ∇σ gλ tanh( D)

(A.7)

For the third term of (A.7), based on (51) and Assumption 3, we have (A.8) W1T ∇σ gλ tanh(D) ≤ λbg bσ x W1 . ˆ + ψς ˜ + εg , and the fact Also, using W1 = Wˆ 2 + W˜ 2 , g = ψς x T tanh(x) > 0 ∀x, for the fourth term of (A.7) one has ˆ > W˜ 2T ∇σ ψς ˆ ˆ λ tanh( D) W1T ∇σ gλ tanh( D) ˆ (A.9) ˜ + εg )λ tanh( D). + W1T ∇σ (ψς Using (A.4), (A.8), (A.9), and boundness of εHJB , i.e., sup∀x εHJB  ≤ εh , (A.7) becomes ˆ ˆ λ tanh( D) V˙ < −Q + bεx b f x − W˜ 2T ∇σ ψς (A.10) + λbεx bg + λbg˜ bσ x W1  + εh ˜ + εg . Note where bg˜ is defined as the bound for g˜ = ψς that as was shown in Theorem 1, the identifier weights estimation error ψ˜ is bounded, regardless of the stability of the system. Denoting k1 = bεx b f and k2 = λbg˜ bϕx + λbεx bg + εh , and noting that since Q(x) > 0 there exists q such that x T q x < Q(x) for x ∈ , (A.10) becomes ˆ ˆ λ tanh( D). (A.11) V˙ < −x T q x + k1 x + k2 − W˜ 2T ∇σ ψς

A PPENDIX P ROOF OF T HEOREM 3 Before starting the proof, note that whenever needed, θ ∗ ξ + ε f and ψ ∗ ς + εg are used as equivalent to the functions f and g, respectively. Consider the Lyapunov function 1 1 J (t) ≡ V (x) + W˜ 1T α1−1 W˜ 1 (t) + W˜ 2T α2−1 W˜ 2 2 2       J1 (t )

(A.6)

(A.1)

J2 (t )

where V is the optimal value function. The derivative of J is given by (A.2) J˙ ≡ V˙ + J˙1 + J˙2 . The first term of (A.2) is 

V˙ = W1T ∇σ + ∇ε T ( f + g uˆ 1 ) ˆ + ε0 = W1T ∇σ f − W1T ∇σ gλ tanh( D)

For the second term of (A.2), using (73), the HJB equation (52), and (A.6), one has J˙1 = W˜ 1T

βˆ1 Q + Uˆ + Wˆ 1T βˆ1 − Q − U T (1 + βˆ βˆ1 )2

1  − W1T ∇σ (θ ∗ ξ − ψ ∗ ς λ tanh(D) + εT ) + εHJB (A.12)

ˆ λ tanh( D)) ˆ and Uˆ is the cost (6) for where βˆ1 = ∇σ (θˆ ξ − ψς the control input uˆ 1 in (69). Adding and subtracting the term W˜ 1T (βˆ1 /(1 + βˆ1T βˆ1 )2 )× T ˆ to the right-hand side of (A.12) ˆ λ tanh( D)) W1 ∇σ (θˆ ξ − ψς yields J˙1 = W˜ 1T

βˆ1 Uˆ − U − W˜ 1T βˆ1 (1 + βˆ T βˆ1 )2 1

(A.3)

+ W1T ∇σ gλ tanh(D) − W1T ∇σ θ˜ ξ  ˆ + W1T ∇σ ε f + εHJB . (A.13) ˆ λ tanh( D) − W1T ∇σ ψς

MODARES et al.: ADAPTIVE OPTIMAL CONTROL OF UNKNOWN CONSTRAINED-INPUT SYSTEMS

For the first term in the right-hand side of (A.13), using (72) and (A.6), one has ˆ + λ2 RLn(1 ˆ ¯ ˆ λ tanh( D) − tanh2 ( D)) Uˆ − U = Wˆ 2T ∇σ ψς T 2 ¯ 2 − W1 ∇σ gλ tanh(D) − λ RLn(1 − tanh (D)). (A.14) The term Ln(1 − tanh2 (D)) can be written as Ln(1 − tanh2 (D)) = Ln(4) − 2D − 2Ln(1 + e−2D ) = Ln(4) − 2D sgn(D) + ε D

(A.15)

where the last equality is obtained using the approximation  0 D>0 Ln(1 + e−2D ) ∼ = −2D D 0. This adds to J˙ the terms W˜ 2T Y1 Wˆ 2 = W˜ 2T Y1 W1 − W˜ 2T Y1 W˜ 2

(A.24)

and generally, J˙ becomes

− W1T ∇σ gλ tanh(D) + W1T ∇σ g λ(sgn(D) ˆ + W1T ∇σ ψς ˜ λsgn( D) ˆ + λ2 R(ε ¯ ˆ − ε D ). − sgn( D)) D

J˙1 = W˜ 1T

1523

J˙ ≤ −x T q x + k1 x + k2 − W˜ 1T β¯1 β¯1T W˜ 1 + k3 W˜ 1T β¯1  − W˜ 2T Y1 W˜ 2 + k4 W˜ 2 

(A.25)

where k4 = k N + Y1 W1 . Using (A.25), it can be shown that J˙ is negative if  k12 k1 k2 + (A.26) x > + 2 2λmin (q1 ) 4λmin (q1 ) λmin (q1 ) k3 β¯1T W˜ 1  > (A.27) r k4 W˜ 2  > . (A.28) λmin (Y1 ) This completes the proof. Note that since (A.27) holds on the output β¯1T W˜ 1 of error dynamics (62), if β¯1 is PE, as it was shown in Theorem 2, W˜ 1 is UUB. In fact PE condition is an observability-like condition on β¯1T W˜ 1 considered as an output of error dynamics (62). This PE condition shows that under (A.27) the state W˜ 1 is also bounded. See the proof of Theorem 2. R EFERENCES

(A.19)

where φ and z are defined in (19). Note that under Assumptions 2 and 3, if Condition 1 of Theorem 1 holds, the following bounds can be obtained: ˜ M ≤ k M , N ≤ k N , W1T ∇σ φz(x, uˆ d ) ≤ a2 . (A.20)

[1] F. L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, 3rd ed. New York, NY, USA: Wiley, 2012. [2] P. Ioannou and B. Fidan, Advances in Design and Control, Adaptive Control Tutorial. Philadelphia, PA, USA: SIAM, 2006. [3] S. Sastry and M. Bodson, Adaptive Control: Stability, Convergence, and Robustness. Englewood Cliffs, NJ, USA: Prentice-Hall, 1989. [4] R. S. Sutton and A. G. Barto, Reinforcement Learning - An Introduction. Cambridge, MA, USA: MIT Press, 1998. [5] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific, 1996. [6] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York, NY, USA: Wiley, 2007.

1524

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

[7] P. J. Werbos, “Neural networks for control and system identification,” in Proc. IEEE Conf. Decision Control, Dec. 1989, pp. 260–265. [8] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA, USA: MIT Press, 1991, pp. 67–95. [9] P. J. Werbos, “Approximate dynamic programming for real time control and neural modelling,” in Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds. Brentwood, U.K.: Multiscience Press, 1992. [10] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man Cybern., vol. SMC-13, no. 5, pp. 834–846, Sep./Oct. 1983. [11] D. Liu and Q. Wei, “Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 779–789, Apr. 2013. [12] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission control scheme for CDMA cellular networks,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1219–1228, Sep. 2005. [13] S. Jagannathan and G. Galan, “Adaptive critic neural network-based object grasping control using a three-finger gripper,” IEEE Trans. Neural Netw., vol. 15, no. 2, pp. 395–407, Mar. 2004. [14] G. Venayagamoorthy, R. Harley, and D. Wunsch, “Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 764–773, May 2002. [15] D. Liu, H. Javaherian, O. Kovalenko, and T. Huang, “Adaptive critic learning techniques for engine torque and air-fuel ratio control,” IEEE Trans. Syst., Man Cybern. Part B, Cybern., vol. 38, no. 4, pp. 988–993, Aug. 2008. [16] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circle Syst. Mag., vol. 9, no. 3, pp. 32–50, Jan. 2009. [17] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Comput. Int. Mag., vol. 4, no. 2, pp. 39–47, May 2009. [18] S. N. Balakrishnan, J. Ding, and F. L. Lewis, “Issues on stability of ADP feedback controllers for dynamical system,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 38, no. 4, pp. 913–917, Aug. 2008. [19] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997. [20] L. Baird, “Reinforcement learning in continuous time: Advantage updating,” in Proc. Int. Conf. Neural Netw., Jun. 1994, pp. 2448–2453. [21] K. Doya, “Reinforcement learning in continuous time and space,” Neural Comput., vol. 12, no. 1, pp. 219–245, 2000. [22] T. Hanselmann, L. Noakes, and A. Zaknich, “Continuous-time adaptive critics,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 631–647, May 2007. [23] K. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to solve the continuous infinite-time horizon optimal control problem,” Automatica, vol. 46, pp. 878–888, May 2010. [24] T. Dierks and S. Jagannathan, “Optimal control of affine nonlinear continuous-time systems,” in Proc. IEEE Amer. Control Conf., Jul. 2010, pp. 1568–1573. [25] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, Feb. 2009. [26] D. Vrabie and F. L. Lewis, “Neural network approach to continuoustime direct adaptive optimal control for partially unknown nonlinear systems,” Neural Netw., vol. 22, pp. 237–246, Apr. 2009. [27] D. Liu, X. Yang, and H. Li, “Adaptive optimal control for a class of continuous-time affine nonlinear systems with unknown internal dynamics,” Neural Comput. Appl., Nov. 2012, to be published. [28] S. Bhasin, R. Kamalapurkar, M. Johnson, K. Vamvoudakis, F. L. Lewis, and W. Dixon, “A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems,” Automatica, vol. 49, no. 1, pp. 82–92, Jan. 2013. [29] J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems,” Automatica, vol. 48, no. 11, pp. 2850–2859, Nov. 2012.

[30] Y. Jiang and Z. P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, Oct. 2012. [31] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, pp. 779–791, May 2005. [32] H. Modares, F. L. Lewis, and M. B. Naghibi-Sistani, “Online solution of nonquadratic two-player zero-sum games arising in the control of constrained input systems,” Int. J. Adapt. Control Signal Process., Oct. 2012, to be published. [33] L. J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Mach. Learn., vol. 8, nos. 3–4, pp. 293–321, 1992. [34] S. Adam, L. Busoniu, and R. Babuska, “Experience replay for realtime reinforcement learning control,” IEEE Trans. Syst., Man, Cybern., Part C, Appl. Rev., vol. 42, no. 2, pp. 201–212, Mar. 2012. [35] P. Wawrzynski, “Real-time reinforcement learning by sequential actorcritics and experience replay,” Neural Netw., vol. 22, pp. 1484–1497, Jan. 2012. [36] G. V. Chowdhary, “Concurrent learning for convergence in adaptive control without persistency of excitation,” Ph.D. dissertation, Georgia Inst. Technol., Atlanta, GA, USA, 2010. [37] A. H. Kingravi, G. Chowdhary, P. A. Vela, and E. N. Johnson, “Reproducing kernel Hilbert space approach for the online update of radial bases in neuro-adaptive control,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1130–1141, Sep. 2012. [38] F. L. Lewis, S. Jagannathan, and A. Yesildirek, Neural Network Control of Robot Manipulators and Nonlinear Systems. New York, NY, USA: Taylor & Francis, 1999. [39] P. Ioannou and J. Sun, Robust Adaptive Control. Upper Saddle River, NJ, USA: Prentice-Hall, 1996. [40] R. W. Beard, “Improving the closed-loop performance of nonlinear systems,” Ph.D. dissertation, Dept. Electr. Eng., Rensselaer Polytech. Inst., Troy, NY, USA, 1995. [41] S. E. Lyshevski, “Optimal control of nonlinear continuous-time systems: Design of bounded controllers via generalized nonquadratic functionals,” in Proc. IEEE Amer. Control Conf., Jun. 1998, pp. 205–209. [42] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [43] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Netw., vol. 2, pp. 359–366, Jan. 1985. [44] E. B. Kosmatopoulos, M. M. Polycarpou, M. A. Christodoulou, and P. A. Ioannou, “High-order neural network structures for identification of dynamical systems,” IEEE Trans. Neural Netw., vol. 6, no. 2, pp. 422–431, Mar. 1995. [45] B. A. Finlayson, The Method of Weighted Residuals and Variational Principles. New York, NY, USA: Academic Press, 1990.

Hamidreza Modares received the B.S. degree from the University of Tehran, Tehran¸ Iran, in 2004, and the M.S. degree from the Shahrood University of Technology, Shahrood, Iran, in 2006. He is currently pursuing the Ph.D. degree with the Ferdowsi University of Mashhad, Mashhad, Iran. He joined the Shahrood University of Technology as a University Lecturer, from 2006 to 2009. Since August 2012, he has been a Research Assistant with the University of Texas at Arlington Research Institute, Fort Worth, TX, USA. His current research interests include optimal control, reinforcement learning, approximate dynamic programming, neural adaptive control, and pattern recognition.

MODARES et al.: ADAPTIVE OPTIMAL CONTROL OF UNKNOWN CONSTRAINED-INPUT SYSTEMS

Frank L. Lewis (S’70–M’81–SM’86–F’94) received the bachelor’s degree in physics/EE and the M.S.E.E. degree from Rice University, Houston, TX, USA, the M.S. degree in aeronautical engineering from University of West Florida, Pensacola, FL, USA, and the Ph.D. degree from the Georgia Institute of Technology, Atlanta, GA, USA. He works in feedback control, reinforcement learning, intelligent systems, and distributed control systems. He is the author of six U.S. patents, 273 journal papers, 375 conference papers, 15 books, 44 chapters, and 11 journal special issues. Dr. Lewis is a fellow of IFAC and the U.K. Institute of Measurement & Control, PE Texas, U.K. Chartered Engineer, Distinguished Scholar Professor, Distinguished Teaching Professor, and a Moncrief-O’Donnell Chair with The University of Texas at Arlington Research Institute. He is the IEEE Control Systems Society Distinguished Lecturer. He received the Fulbright Research Award, the NSF Research Initiation Grant, the ASEE Terman Award, International Neural Network Society Gabor Award in 2009, the U.K. Institute Measurement & Control Honeywell Field Engineering Medal in 2009. He received the IEEE Computational Intelligence Society Neural Networks Pioneer Award in 2012. He is a Distinguished Foreign Scholar, Nanjing University of Science and Technology. He is a Project 111 Professor with Northeastern University, China. He received the Outstanding Service Award from Dallas IEEE Section, selected as Engineer of the Year by Fort Worth IEEE Section. He is listed in Fort Worth Business Press Top 200 Leaders in Manufacturing, and has been honored with the 2010 IEEE Region 5 Outstanding Engineering Educator Award and the 2010 UTA Graduate Dean’s Excellence in Doctoral Mentoring Award. He was elected to UTA Academy of Distinguished Teachers in 2012. He served on the NAE Committee on Space Station in 1995. He is a Founding Member of the Board of Governors of the Mediterranean Control Association. He is a recipient the IEEE Control Systems Society Best Chapter Award (as Founding Chairman of DFW Chapter), the National Sigma Xi Award for Outstanding Chapter (as President of UTA Chapter), and the US SBA Tibbets Award in 1996 (as Director of ARRI’s SBIR Program).

1525

Mohammad-Bagher Naghibi-Sistani received the B.Sc. and M.Sc. (Hons.) degrees in control engineering from the University of Tehran, Tehran, Iran, in 1991 and 1995, respectively, and the Ph.D. degree from the Department of Electrical Engineering, Ferdowsi University of Mashhad, Mashhad, Iran, in 2005. He was a Lecturer with the Ferdowsi University of Mashhad from 2001 to 2005, where he is currently an Assistant Professor. His current research interests include artificial intelligence, reinforcement learning, and control systems.

Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks.

This paper presents an online policy iteration (PI) algorithm to learn the continuous-time optimal control solution for unknown constrained-input syst...
1MB Sizes 0 Downloads 0 Views