ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

ISA Transactions journal homepage: www.elsevier.com/locate/isatrans

Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator Jilie Zhang a,b, Huaguang Zhang a,n, Zhenwei Liu a, Yingchun Wang a a b

School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110819, PR China School of Information Science and Technology, Southwest Jiaotong University, Chengdu, 610031, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 22 October 2013 Received in revised form 29 June 2014 Accepted 31 August 2014 This paper was recommended for publication by Dr. Q.-G. Wang

In this paper, we consider the problem of developing a controller for continuous-time nonlinear systems where the equations governing the system are unknown. Using the measurements, two new online schemes are presented for synthesizing a controller without building or assuming a model for the system, by two new implementation schemes based on adaptive dynamic programming (ADP). To circumvent the requirement of the prior knowledge for systems, a precompensator is introduced to construct an augmented system. The corresponding Hamilton–Jacobi–Bellman (HJB) equation is solved by adaptive dynamic programming, which consists of the least-squared technique, neural network approximator and policy iteration (PI) algorithm. The main idea of our method is to sample the information of state, state derivative and input to update the weighs of neural network by least-squared technique. The update process is implemented in the framework of PI. In this paper, two new implementation schemes are presented. Finally, several examples are given to illustrate the effectiveness of our schemes. & 2014 ISA. Published by Elsevier Ltd. All rights reserved.

Keywords: Model-free controller Optimal control Precompensator Adaptive dynamic programming

1. Introduction Since the optimal control problem for systems is ubiquity in real world, such as [1–4], it is one of the most important problems in control community. Recently, because the adaptive dynamic programming (ADP) technology (it combines with adaptive control [5–7] and dynamic programming) rises, it has been developed in a variety of areas for controlling systems with information of the system model, such as [8–16]. However, these control schemes are suited for systems whose dynamics can be characterized precisely. When the plant's dynamics are poorly modeled, the controls cannot provide satisfactory responses. This provides the motivation for developing a control procedure that does not require a model for the underlying system. Therefore, some model-free or partially model-free schemes[17–22] have been studied in recent years. [17] designs a model-free controller, but it is not optimal. The designed controllers in [18] and [19] address the case that the internal dynamics of systems is unknown, but the control matrix is required. So, it is referred as to partially model-free controller. n

Corresponding author. E-mail addresses: [email protected] (J. Zhang), [email protected] (H. Zhang), [email protected] (Z. Liu), [email protected] (Y. Wang).

While the designed controller in [20,21] and [22] addresses the system without any a priori knowledge for systems. Although, their methods are effective to design the model-free control, there exist some restrictions and disadvantages. For instance, [20] and [21] design the model-free control for the linear discrete-time and continuous-time systems respectively, but they are only restricted to linear systems. The schemes are difficult to be used to solve the nonlinear problem. [22] designed the model-free control by identifying the system parameters, but the identifying process is known to respond slowly to parameter variations from the plant. It reduces the respond speed of the designed optimal control. These restrictions and disadvantages motivate our research on the model-free control for continuous-time nonlinear systems. Owing to the emergence of computer, most continuous-time systems can be addressed in discrete version. Here, a precompensator [23] is employed to eliminate the dependence on the prior knowledge of systems, such as [24,25]. Based on these ideas and inspiration, we design the fully model-free control for continuoustime nonlinear systems by two schemes, rather than by identifying the system. One is to solve the corresponding Hamilton–Jacobi– Bellman (HJB) equation for the augmented system with a precompensator, in discrete version, by adaptive dynamic programming, which consists of the least-squared technique, neural network approximator (such as [26]) and policy iteration (PI)

http://dx.doi.org/10.1016/j.isatra.2014.08.018 0019-0578/& 2014 ISA. Published by Elsevier Ltd. All rights reserved.

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

J. Zhang et al. / ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

Assumption 1. Assume that f ð0Þ ¼ 0, that f þ gu also contains the origin, and the dynamical system is stabilizable on Ω, i.e. there exists a continuous control function uðtÞ A Ωu such that the system is asymptotically stable on Ω. Define the continuously differentiable performance index in Ω as Z 1 Uðxðx0 ; τÞ; uðx0 ; τÞÞ dτ; ð2Þ Jðxðt 0 ÞÞ ¼ t0

with the input quadratic utility function Uðxðx0 ; τÞ; uðx0 ; τÞÞ ¼ Q ðxðx0 ; τÞÞ þ uðx0 ; τÞT Ruðx0 ; τÞ, where xðx0 ; τÞ is the state trajectory of the system (1) starting at initial time x0, over time interval ½t 0 ; 1Þ; and uðx0 ; τÞ is the input signal as τ Z t 0 . Q ðxðx0 ; τÞÞ 4 0; 8 x a 0 and Q ðxðx0 ; τÞÞ ¼ 0, as x¼ 0; and R A Rmm is a positive definite matrix. Note that, for the sake of simplicity, x is used to denote xðx0 ; tÞ in some places. Other variables are also similar. Fig. 1. The structure of the system with adaptive controller.

algorithm. The main idea of our method is to sample the information of state, state derivative and input to update the weights of neural network by least squared technique. The update process is implemented in the framework of PI. The other one is an improvement version. In the improvement version, we reduces the structure of algorithm, while the result has few changes. This point can be verified by the simulations. The contributions of the paper include the following:

 The fully model-free control problem for continuous-time





nonlinear systems is solved by adaptive dynamic programming, avoiding these restrictions and disadvantages of the previous technology, such as [18–22]. The structure of the algorithm in [18] is improved and reduced. Specially, the cost function V(x) does not require to be solve by differential equation V_ ðxÞ ¼ Q ðxÞ þ uT Ru as that in Fig. 1 of paper [18]. Simultaneously, our method reduces the neural-network framework in [22] (three neural networks are used to obtain the optimal control. One is to identify the controlled system. The second one is used to approximate the value function (critic network). The last one is used to approximate the control (action network). While we only use a neural network to obtain the optimal control by approximating the value function (critic network). For the network framework, we reduce two networks, saving a lot of identifying parameters.

The rest of this paper is organized as follows. In Section 2, the foundation of adaptive dynamic programming for nonlinear systems is introduced. The augmented system with a precompensator is constructed in Section 3. Sections 4 and 5 give a new algorithm and its improvement version, respectively. Finally, the numerical examples are given to illustrate the effectiveness of our method in Section 6.

Consider the nonlinear dynamical system of the form x0 ¼ xðt 0 Þ

ð1Þ

where xðtÞ A Ω  R and uðtÞ A Ωu  R , f ðxÞ þ gðxÞuðxðtÞÞ is Lipschitz continuous on a domain Ω  Rn into Rn , with f : Ω-Rn and g : Ω-Rnm . n

m

a. u is continuous on Ω and uð0Þ ¼ 0, b. u stabilizes the system (1) on Ω, R1 c. t 0 Uðxðx0 ; τÞ; uðx0 ; τÞÞ dτ o 1; 8 x A Ω. Remark 1. In most of existing papers, the assumption condition for system (1) is that f(x) and g(x) are Lipschitz continuous, respectively. There is not any doubt about that. Theoretically speaking, the system conditions are sufficient for our method. However, owning to our consideration for practical situation, that is, the sampling period T can not be infinitesimal, the sampled process with the sampling period T requires the relatively mild (not tremendous changes) state trajectory for the closed system. Further, the sampled data can accurately extract characteristics of systems. Otherwise, if the state trajectory for the closed system is tremendous changes, the sampled process with the relatively big sampling period T can not accurately extract characteristics of closed systems at the place in which the state trajectory sudden changes. Therefore, we give the assumption condition for system (1), that is, f ðxÞ þ gðxÞu are Lipschitz continuous. It guarantees that the state trajectory of the closed system is relatively mild. For the condition, i.e. f ð0Þ ¼ 0, it means the origin is the equilibrium point for the open-loop system. Since the control u is an admissible control, the closed system (f ðxÞ þgðxÞu) also contains the origin. In addition, the stabilizable condition for the systems is the necessary and common assumption to design the control for a controlled system. The value function associated with any admissible control policy is Z 1 V u ðxðt 0 ÞÞ ¼ UðxðτÞ; uðτÞÞ dτ: ð3Þ t0

To solve the optimal control problem, that is, searching an admissible control such that the performance index (2) associated with the system (1) is minimized, the Hamiltonian is defined as Hðx; u; V x Þ ¼ Uðx; uÞ þ ð∇V x ÞT ðf ðxÞ þ gðxÞuÞ;

2. Preliminaries of the nonlinear adaptive dynamic programming

_ ¼ f ðxðtÞÞ þ gðxðtÞÞuðxðtÞÞ; xðtÞ

Definition 1 (Admissible control). Given the system (1), a control u : Rn -Rm is defined to be admissible with respect to the state utility function Uð; Þ on Ω, written u A Ωu , if

ð4Þ

where ∇V x is the gradient of the value function V(x) with respect to x under the control u, as the value function does not depend explicitly on time. While the optimal value function V n ðxÞ satisfies the HJB equation 0 ¼ min ½Hðx; u; V nx Þ: u A Ωu

ð5Þ

Eq. (5) can be used to solve for the cost V n ðxÞ. By Bellman principle of optimality, there exists the following desired

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

J. Zhang et al. / ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎

relationship between the optimal control and optimal value function: un ¼  12 R  1 g T ðxÞ∇V nx :

ð6Þ

3

Further, (12) can be rewritten as the discretized form, such as in [18,31], as follows: Z t þT Z 1 VðxðtÞÞ ¼ U ðxðtÞ; wðtÞÞ dτ þ U ðxðtÞ; wðtÞÞ dτ: ð13Þ tþT

t

Inserting it into (5), we can rewrite the Hamilton–Jacobi– Bellman (HJB) equation as 0 ¼ Q ðxÞ  14 ð∇V nx ÞT gðxÞR  1 g T ðxÞ∇V nx þ ð∇V nx ÞT f ðxÞ:

ð7Þ

where U ðxðtÞ; wðtÞÞ ¼ Q ðxðtÞÞ þ wðtÞT RwðtÞ. T is the sampling period. It also can be rewritten as Z t þT VðxðtÞÞ  Vðxðt þ TÞÞ ¼ U ðxðtÞ; wðtÞÞ dτ; ð14Þ t

In order to solve the optimal control problem, one only needs to obtain the solution of HJB equation (7) for the value function. However, it is generally difficult. Therefore, policy iteration algorithm is introduced to solve the HJB equation, such as [27,28].

then take the derivation on both sides of the equation (14) with respect to time t,

2.1. Policy iteration algorithm

where ∇V x t denotes the gradient of VðxðtÞÞ with respect to xðtÞ. Be similar to (6), the admissible control w has the following relation with cost function V ðxÞ,

Let u0 (the initial control) be an admissible policy, then the iteration between 1. (policy evaluation) solve for V i ðxÞ using Z 1 V i ðxðtÞÞ ¼ UðxðτÞ; ui ðτÞÞ dτ;

ð8Þ

∇V Tx t x_ ðtÞ  ∇V Tx t þ T x_ ðt þ TÞ ¼ U ðxðt þ TÞ; wðt þTÞÞ U ðxðtÞ; wðtÞÞ

1 w ¼  R  1 GT ðxÞ∇V x : 2

ð15Þ

ð16Þ

4. Neural network approximation of the cost function

t0

As is known to all, as a result of the universal approximation property [30], neural networks are natural candidates to approximate smooth functions on compact sets (see [26], similarly). Therefore, here, in order to solve equation (15), we approximate the value function by the following neural networks

2. (policy improvement) update the control policy using 1 ui þ 1 ¼  R  1 g T ðxÞ∇V ix ; 2

ð9Þ

N

n

converges to the optimal control u with the corresponding cost V n ðxÞ. Obviously, the conventional algorithm requires the knowledge of systems in advance. In next section, we use a precompensation to avoid the requirement, and design a model-free control for the nonlinear systems.

3. The augmented system with precompensation technique To avoid that our ADP algorithm is dependent on the system model, we used the precompensation technique in [24] to obtain an augmented system. First, we design a precompensator, where u as a state variable u_ ¼ aðuÞ þ bðuÞw;

ð10Þ

with a singularity at (u ¼ 0; w ¼ 0). While a(u) and b(u) are Lipschitz continuous. However, a(u) and b(u) should be designed to consume energy of precompensator as less as possible. The augmented system can be obtained by combining the precompensator (10) with the system (1), as follows: x_ ¼ FðxÞ þ GðxÞw;

ð11Þ

with the state vector, x ¼ ½x; uT A χ  Rn þ m and a singularity at (x ¼ 0; w ¼ 0). " # " # 0 f ðxÞ þ gðxÞu FðxÞ ¼ : χ -Rn þ m ; GðxÞ ¼ : χ -Rðn þ mÞm : bðuÞ aðuÞ w A Ωw  Rm is the control input of the augmented system. The value function can be written as Z

1

VðxðtÞÞ ¼ t

U ðxðτÞ; wðτÞÞ dτ:

ð12Þ

VðxÞ ¼ ∑ cj ϕj ðxÞ ¼ C T ΦðxÞ;

ð17Þ

j¼1

where cj is the weights of the output layer, ϕj ðxÞ is the activation functions, and N is the number of neurons on the hidden layer. C and ΦðxÞ are the vectors combining of cj and ϕj ðxÞ respectively (j¼1,…,N), such as C ¼ ½c1 ; c2 ; …; cN T and ΦðxÞ ¼ ½ϕ1 ðxÞ; ϕ2 ðxÞ; …; ϕN ðxÞT . Note that the set fϕj ðxÞgN1 must be linearly independent. Inserting (17) into (15), C T ð∇ΦðxðtÞÞx_ ðtÞ  ∇Φðxðt þ TÞÞx_ ðt þTÞÞ ¼ U ðxðt þTÞ; wðt þ TÞÞ U ðxðtÞ; wðtÞÞ;

ð18Þ

with ∇ΦðxðtÞÞ being the gradient of ΦðxðtÞÞ with respect to xðtÞ. Let zk ¼ ∇Φðxðt 0 þ ðk  1ÞTÞÞx_ ðt 0 þ ðk  1ÞTÞ ∇Φðxðt 0 þ kTÞÞx_ ðt 0 þ kTÞ and yk ¼ U ðxðt 0 þ kTÞ; wðt 0 þ kTÞÞ  U ðxðt 0 ðk  1ÞTÞ; wðt 0 ðk  1ÞTÞÞ, and t0 is the sampled initial time. Z ¼ ½z1 ; z2 ; …; zk  and Y ¼ ½y1 ; y2 ; …; yk  (k is the sampling step index). By the sampled data and least squares technique, the weight CT can be obtained by C T ¼ YZ  1 :

ð19Þ

If there exists a solution for the equation (19), Z must be invertible (or there exists the pseudo inverse). Therefore, the sampling number k is at least N, i.e. N rk. The following theorem 1 can show the Z is invertible. Before presenting theorem 1, the Lemma 1 firstly is given. Lemma 1 (Vrabie and Lewis [18]). Let u A Ωu such that f þ gu is N asymptotically stable. Given that the set fϕj g1 is linearly independent N then ( T 4 0 such that 8 xðtÞ A Ω f0g, the set fϕj ðxðtÞÞ  ϕj ðxðt þTÞÞg1 is also linearly independent. Theorem 1. Let u A Ωu such that f þ gu is asymptotically stable. N Given that the set fϕj g1 is linearly independent then (T 4 0 such that N _ þ TÞ  ∇ϕj ðxðtÞÞxðtÞg _ 8 x A Ω  f0g, the set f∇ϕj ðxðt þ TÞÞxðt is also 1 linearly independent.

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

J. Zhang et al. / ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

Proof 1. Since u A Ωu then the cost function VðxðtÞÞ is a Lyapunov function for the system x_ ¼ f þgu. VðxðtÞÞ satisfies Z tþT VðxðtÞÞ  Vðxðt þ TÞÞ ¼ ð∇V x ÞT ðf þguÞ dτ ð20Þ t

over the time interval ½t; t þ T. Taking the derivation on both sides of the equation with respect to time t and inserting (17) into (20), the equation (20) can be written as _ ðxðtÞÞ C T Φ _ ðxðt þ TÞÞ ¼ C T ∇Φðxðt þ TÞÞxðt _ þ TÞ  C T ∇ΦðxðtÞÞxðtÞ: _ CT Φ

ð21Þ

Suppose that the theorem 1 is not true, then 8 T 4 0 there exists a nonzero constant vector C A RN such that _ ðxðtÞÞ  Φ _ ðxðt þ TÞÞ  0; C T ½Φ

8 xðtÞ A Ω

This implies that we can obtain the following equation from the time integration of (22) C T ½ΦðxðtÞÞ  Φðxðt þ TÞÞ  0

ð23Þ

This means that fϕj ðxðtÞÞ  ϕj ðxðt þ TÞÞg1 is not linearly indepenN

dent, contradicting Lemma 1, Thus 8 T 4 0 such that 8 xðt 0 Þ A Ω _ ðxðtÞÞ  ϕ _ ðxðt þ TÞÞgN is linearly independent. By (21), the set fϕ j

j

1

N _ þ TÞ  ∇ϕj ðxðtÞÞxðtÞg _ f∇ϕj ðxðt þ TÞÞxðt 1

independent.

is

also

linearly



Remark 2. Owning to that we solve the discretized version HJB equation and the sampled data is linearly independent (according to Theorem 1), our method is more feasible and reasonable for synthesizing the optimal control by computer, in practice. The theorem 1 is properly applied to solve Eq. (18) associated with the augmented system (11).

Let w0 (the initial control) be an admissible policy, then the iteration between 1. (policy evaluation) solve for Ci using T

1

;

ð24Þ

2. (policy improvement) update the control policy using w

iþ1

1 i ¼  R  1 GT ðxÞ∇Φ ðxÞC i ; 2 T

The proposed optimal algorithm only requires the sample data of state, state derivative and the corresponding control along the system trajectory. The number of samples is k over time kT, and the sampling time sequence is ft 0 þ T; …; t 0 þ kTg. In order to guarantee Z is invertible, zk and yk are calculated by those data and are held by the zero-order holder until k Z N. Further, the data Z and Y are used to update the weight C. Repeat the update process under the structure of PI algorithm along the system trajectories until J C i  C i  1 J o ε (it is the stop condition of the iteration, where εis an optional parameter) The algorithm flowchart is shown in Fig. 2. 5. The improvement algorithm Hereinafter, we present a more simple form. Taking the derivation on both sides of the equation (12), we can get ð∇V x ÞT ðFðxÞ þ GðxÞwÞ ¼  Uðx; wÞ:

ð26Þ

Then, we insert the neural network (17) into (26)

4.1. Policy iteration algorithm with NN

Ci ¼ Y iZi

Fig. 2. Flowchart of PI algorithm.

ð22Þ

C T ∇ΦðxÞðFðxÞ þ GðxÞwÞ ¼  Uðx; wÞ:

ð27Þ

To calculate the weight C by the least squared method, we sample the data on both sides of the equation (27) and store in the memories ηk ¼ ∇Φðxðt 0 þ kTÞÞðFðxðt 0 þ kTÞÞ þ Gðxðt 0 þ kTÞÞwðt 0 þ kTÞÞ ¼ ∇Φ ðxðt 0 þ kTÞÞx_ ðt 0 þ kTÞ and γ k ¼ Uðxðt 0 þ kTÞ; wðt 0 þ kTÞÞ. And, let Π ¼ ½η1 ; η2 ; …; ηN  and Γ ¼ ½γ 1 ; γ 2 ; …; γ N , then the weight C can be calculated by the following equation C T ¼  ΓΠ

1

:

ð28Þ

ð25Þ 5.1. Improvement policy iteration algorithm with NN

converges to the optimal control wn with the corresponding weight Cn.

Let w0 (the initial control) be an admissible policy, then the iteration between

Remark 3. Obviously, the algorithm circumvents the requirement for the knowledge of the system (1), that is, the system dynamic f (x) and the control matrix g(x) are not required.

1. (policy evaluation) solve for Ci using

4.2. Control system structure and implementation process

2. (policy improvement) update the control policy using

The structure of the system with adaptive controller is shown in Fig. 1. From Fig. 1, we first sample the data (state, state derivative and control) at every period, then obtain the data yk and zk (yk is a scalar and zk is N  1 vector) by calculation at the time moment. Collect the such data into two storages (i.e. Y and Z) respectively until k Z N. Note that Z is a N  k matrix and Y is a 1  k vector.

T

Ci ¼  Γ Π i

i1

;

T 1 i wi þ 1 ¼  R  1 GT ðxÞ∇Φ ðxÞC i ; 2

ð29Þ

ð30Þ

converges to the optimal control wn with the corresponding weight Cn.

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

J. Zhang et al. / ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎

If there exists a solution to the equation (28), Π must be invertible (or there exists the pseudo inverse). Therefore, the sampling number k is at least N, i.e. k Z N. The following Lemma 2 can show the Π is invertible. Lemma 2 (Beard et al. [32]). If the set fϕj g1 is linearly independent N T and u A Ωu , then the set f∇ϕj ðf þ guÞg1 is also linearly independent. N

Proof 2. If the vector field f þ gu is asymptotically stable, then along the trajectories xðtÞ ¼ φðt; x0 ; uÞ; x0 A Ω, we have that Z 1 Z 1 dΦðxðτÞÞ dΦðxðτÞÞ dτ ¼  ðf þ guÞ dτ ΦðxÞ ¼  dt dx 0 0 Now suppose that the Lemma 2 is not true. Then there exists a nonzero α A RN such that

αT ∇Φðf þ guÞ  0:

1 0

From Fig. 3, we only sample the data at every period, then obtain the data γk and ηk (γk is a scalar and ηk is N  1 vector). In order to guarantee Π is invertible, collect the such data into two storages (i.e. Γ and Π ) respectively until k ZN. Note that Π is a N  k matrix and Γ is a 1  k vector. The optimal adaptive algorithm requires the sample data of state, state derivative and the corresponding control along the system trajectories. The number of samples is k over time kT at discrete moments ft 0 þ T; …; t 0 þkTg. Then, ηk and λk are obtained and held by the zero-order holder until k Z N. Further, the data Π and Γ are obtained to update the weight C. Repeat the update process under the structure of PI algorithm along the system trajectories until the control converges. The PI algorithm flowchart is shown in Fig. 4.

6. Simulation 6.1. Example 1

This implies that for all x A Ω.

Z

5

αT ∇Φðf þguÞdt ¼ 0;

Consider the following nonlinear continuous-time system x_ 1 ¼  x1 þ x2 þ 2x32 1 x_ 2 ¼  ðx1 þ x2 Þ þ sin ðx1 Þu: 2

then

αT ΦðxÞ  0; which contradicts the linear independence of fϕj g1 . N

ð31Þ

□ Weight parameters of NN

5.2. Control system structure and implementation process

25

The structure of the system with adaptive controller is shown in Fig. 3. Obviously, the improved structure becomes more simple.

20

c

1

c2 c

3

15

c4 c5

10

c

C(t)

6

5 0 −5 −10 −15

0

5

10

15

20

Time (s)

Fig. 5. The evolution of weights.

Control trajectory 0.5 Initial iteration 2 iteration

Fig. 3. The structure of the system with the improvement adaptive controller.

th

4th iteration

0

6 iteration th

8th iteration

Control w(t)

−0.5

−1

−1.5

−2

−2.5

0

2

4

6

8

10

Time (s)

Fig. 4. Flowchart of PI algorithm.

Fig. 6. The iteration evolution of control.

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

J. Zhang et al. / ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

2

Design the precompensator as aðuÞ ¼  u and bðuÞ ¼ sin ðuÞ for the system (31). Letting Q ðxÞ ¼ x T SxðS ¼ IÞ, R¼ 1, and taking ε ¼ 0:02, T ¼ 0:1 s, k ¼20 and ΦðxÞ ¼ ½x21 ; x1 x2 ; x1 u; x22 ; x2 u; u2 T ðN ¼ 6Þ, we use the method in Section 4 to obtain the ideal weights, such as Fig. 5. Fig. 5 describes the weights converge to the ideal value (C n ¼ ½0:5000; 0:0000; 0:0000; 1:0000; 0:0000; 0:5133) after 8 iterative steps (16 s), with the initial weights C 0 ¼ ½0; 1:5;  0:1; 0; 0; 0:3. We can obtain the optimal control by (25) wn ¼  12 R  1 GT ðxÞ∇Φi ðxÞC n : T

From Fig. 6, we can see the iteration evolution of control. Namely, the control evolutions corresponding to the every updated weight in iteration process are shown. Fig. 7 shows the state trajectory of the augmented system is asymptotic stability under the optimal control wn ðtÞ. After 10 s, the state goes to zero. While u is the control of the system (1). 6.2. Example 2. The improvement scheme Consider the same system and precompensator as that in Example 1, and other parameters are also same as that in Example 1. We use the improvement scheme to obtain the ideal weights, such as Fig. 8. Fig. 8 describes the weights converge to the ideal value (C n ¼ ½0:5000; 0:0000; 0:0000; 1:0000; 0:0000; 0:5020) after

8 iterative steps (16 s), with the initial weights C 0 ¼ ½0; 1:5;  0:1; 0; 0; 0:3. We can obtain the optimal control by (25) wn ¼  12 R  1 GT ðxÞ∇Φ ðxÞC n : i

From Fig. 9, we can see the iteration evolutions of control. In the iteration process, the control evolutions corresponding to every updated weight are shown in Fig. 9. Fig. 10 shows the state trajectory of the augmented system is asymptotic stability under the optimal control wn ðtÞ. After 10 s, the state goes to zero. While u is the control of the system (1). Remark 4. By comparing the above two schemes, from Figs. 5 and 8, we can see that the iteration processes of updating weight are different. However, they converge to the identical ideal value at last. And then, every updated weight corresponds a control trajectory. When the control trajectory does not change, the optimal control will be obtained. 6.3. Comparison Next, we give the simulation results of updating weight by the method in [22], such as Figs. 11 and 12. Obviously, after 400 s, the ideal weights are obtained. While we can obtain the ideal weights after 16 s by the method in this paper, such as Figs. 5 or 8. Our

state trajectory

1.4

x

u

Initial iteration 2 iteration th

0

1

4th iteration 6 iteration

−0.1

0.8

th

8 iteration th

−0.2

0.6

Control w(t)

\bar{x}(t)

Control trajectory

0.1

1

x2

1.2

0.4

−0.3

0.2

−0.4

0

−0.5

−0.2

T

0

2

4

6

8

−0.6

10

Time (s)

−0.7

Fig. 7. The evolution of state of the augmented system with precompensator.

0

2

4

6

8

10

Time (s)

Fig. 9. The iteration evolution of control. Weight parameters of NN

6

c

1

5

x

c3

u 1

c5 c6

0.8

C(t)

\bar{x}(t)

3

1

x2

1.2

c4

4

state trajectory

1.4

c2

2

0.6 0.4

1 0.2

0

−1

0

0

5

10 Time (s)

Fig. 8. The evolution of weights.

15

20

−0.2

0

2

4

6

8

10

Time (s)

Fig. 10. The evolution of state of the augmented system with precompensator.

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

J. Zhang et al. / ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Parameters of the critic NN

3.5

under the framework of PI. In this paper, two new implementation schemes are presented. Finally, the examples have proven the effectiveness of our schemes. Our method is a new ADP algorithm. The ADP technology is used widely in aircraft [25], robot arm [33], Micro‐electro‐ mechanical‐system actuator [34] and turbocharged diesel engine [21] and so on. At present, we just study our method in theory. However, owning to our method also is a kind of ADP algorithms, it exists great potential application prospect Future research efforts will focus on how to apply the our methods to nonlinear systems with delay, switched systems and some practical problems. Specially, we will consider the model-free optimal control for the time-delay systems by the method in [35,36] in the next work.

Wa1 Wa2 Wa3

3 2.5 2

Wa(t)

1.5 1 0.5 0 −0.5 −1 −1.5

0

100

200

300

400

500

7

600

Time (s)

Acknowledgments

Fig. 11. The evolution of weights for action network.

This work was supported by the National Natural Science Foundation of China (61034005, 61433004, 61203046, 61273027), and the National High Technology Research and Development Program of China (2012AA040104) and IAPI Fundamental Research Funds 2013ZCX14. This work was supported also by the development project of Key laboratory and Science and Technology Planning Project of Liaoning Province, China (2013219005).

Parameters of the critic NN

4

Wc1 Wc2 Wc3

3

2

Wc(t)

1

References 0

−1

−2

−3

0

100

200

300

400

500

600

Time (s)

Fig. 12. The evolution of weights for critic network.

method is quicker than that in [18]. For the reason, the method in [18] requires to identify the controlled system by neural networks, and identify weights of three networks. The identifying process is known to respond slowly to parameter variations from the plant. It reduces the respond speed of the designed optimal control. In addition, the method in [18] requires three neural networks to obtain the optimal control. One is to identify the controlled system. The second one is used to approximate the value function (critic network). The latter one is used to approximate the control (action network). While we only use one neural network to approximate the value function (critic network). For the network framework, we reduce two networks, saving a lot of identifying parameters.

7. Conclusion In this paper, an on-line model-free optimal control algorithm has been proposed by two new implementation schemes based on adaptive dynamic programming. The Hamilton–Jacobi–Bellman (HJB) equation has been solved by adaptive dynamic programming, which consists of the least-squared technique, neural network approximator and policy iteration (PI) algorithm. The main idea of our method is to sample the information of state, state derivative and input to update the weights of the neural network by least-squared technique. The update process is implemented

[1] Howlett P. Optimal strategies for the control of a train. Automatica 1996;32 (4):519–32. [2] Singla M, Shieh Leang-San, Song G, Xie L, Zhang Y. A new optimal sliding mode controller design using scalar sign function. ISA Transactions 2014;53 (2):267–79. [3] Xu Z, Song Q, Wang D. Intelligent optimal control of single-link flexible robot arm. IEEE Transactions on Industrial Electronics 2004;51(1):201–20. [4] Alipouri Y, Poshtan J. Optimal controller design using discrete linear model for a four tank benchmark process. ISA Transactions 2013;52(5):644–51. [5] Li Y, Tong S, Li T. Adaptive fuzzy output feedback control of uncertain nonlinear systems with unknown backlash-like hysteresis. Information Sciences 2012;198:130–46. [6] Jin X, Yang Sr G, Che W. Adaptive pinning control of deteriorated nonlinear coupling networks with circuit realization. IEEE Transactions on Neural Networks and Learning Systems 2012;23(9):1345–55. [7] Jin X, Yang G. Adaptive synchronization of a class of uncertain complex networks against network deterioration. IEEE Transactions on Circuits and Systems I 2011;58(6):1396–409. [8] Zhang H, Luo Y, Liu D. Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints. IEEE Transactions on Neural Networks 2009;20(9):1490–503. [9] Zhang H, Liu D, Luo Y, Wang D. Adaptive dynamic programming for controlalgorithms and stability. London: Springer-Verlag; 2013. [10] Zhang H, Cui L, Luo Y. Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Transactions on Cybernetics 2013;43(1):206–16. [11] Zhang J, Zhang H, Luo Y, Liang H. Nearly optimal control scheme using adaptive dynamic programming based on generalized fuzzy hyperbolic model. Acta Automatica Sinica 2013;39(2):142–9. [12] Abu-Khalaf M, Lewis F. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 2005;41(5):779–91. [13] Vamvoudakis K, Lewis F. Online actor-critic algorithm to solve the continuoustime infinite horizon optimal control problem. Automatica 2010;46 (5):878–88. [14] Zhang J, Zhang H, Luo Y, Feng T. Model-free optimal control design for a class of linear discrete-time systems with multiple delays using adaptive dynamic programming. Neurocomputing 2014;135(5):163–70. [15] Wei Q, Liu D. An iterative ϵ-optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state. Neural Networks 2012;32(6):236–44. [16] Zhang H, Zhang J, Yang G, Luo Y. Leader-based optimal coordination control for the consensus problem of multi-agent differential games via fuzzy adaptive dynamic programming. IEEE Transactions on Fuzzy Systems http://dx.doi.org/ 10.1109/TFUZZ.2014.2310238. [17] Spall J, Cristion J. Model-free control of nonlinear stochastic systems with discrete-time measurements. IEEE Transactions on Automatic Control 1998;43 (9):1198–210.

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

8

J. Zhang et al. / ISA Transactions ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [18] Vrabie D, Lewis F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks 2009;22(3):237–46. [19] Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis F. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009;45 (2):477–84. [20] Lewis F, Vamvoudakis K. Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2011;41(1):14–23. [21] Jiang Y, Jiang Z. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 2012;48 (10):2699–704. [22] Zhang H, Cui L, Zhang X, Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Transactions on Neural Networks 2011;22(12):2226–36. [23] Li Y, Tong S, Li T. Adaptive fuzzy backstepping control of static var compensator based on state observer. Nonlinear Dynamics 2013;73(1–2):133–42. [24] Saeks R, Cox C. Adaptive critic control and functional link networks. In: Processing of IEEE Conference Systems, Man, Cybernetics; 1998. p. 1652–1657. [25] Cox C, Stepniewski S, Jorgensen C, Saeks R, Lewis C. On the design of a neural network autolander. International Journal of Robust and Nonlinear Control 1999;9(14):1071–96. [26] Tong S, Li Y, Zhang H. Adaptive neural network decentralized backstepping output-feedback control for nonlinear large-scale systems with time delays. IEEE Transactions on Neural Networks 2011;22(7):1073–86. [27] Modares H, Sistani M-B, Lewis F. A policy iteration approach to online optimal control of continuous-time constrained-input systems. ISA Transactions 2013;52(5):611–21. [28] Bradtke S, Ydstie B, Barto A. Adaptive linear quadratic control using policy iteration. In: Proceedings of the American Control Conference; 1994. 3(3): p. 3475–3479. [30] Hornik K, Stinchcombe M, White M. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 1990;3:551–60. [31] A. Jadbabaie, Receding horizon control of nonlinear systems: a control lyapunov function approach [Thesis, in partial fulfillment of the requirements for the degree of Doctor of Philosophy]. California: California Institute of Technology Pasadena; 2000. [32] Beard R, Saridis G, Wen J. Galerkin approximations of the generalized Hamilton–Jacobi–Bellman equation. Automatica 1997;33(11):2159–77. [33] Khana S, Herrmann G, Lewis F, Pipe T, Melhuish C. Reinforcement learning and optimal adaptive control: an overview and implementation examples. Annual Reviews in Control 2012;36(1):42–59. [34] Padhi R, Unnikrishnan N, Wang X, Balakrishnan S. A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural Networks 2006;19(10):1648–60. [35] Hua C, Wang Q, Guan X. Robust adaptive controller design for nonlinear timedelay systems via T–S fuzzy approach. IEEE Transactions on Fuzzy Systems 2009;17(4):901–10. [36] Hua C, Wang Q, Guan X. Adaptive fuzzy output-feedback controller design for nonlinear time-delay systems with unknown control direction. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2009;39 (2):363–74.

Jilie Zhang was born in Fushun, China, in 1984. He received the B.S. degree in automation from Liaoning University of Technology, Jinzhou, China, in 2007 and the M.S. degrees in control theory and control engineering from Kunming University of Science and Technology, Kunming, China, in 2010. He received the Ph.D degree in College of Information Science and Engineering, Northeastern University, PR China, in 2014. His main research interests include fault diagnosis, approximate dynamic programming, reinforcement learning, game theory and multi-agent system.

Huaguang Zhang received the B.S. degree and the M.S. degree in control engineering from Northeast Dianli University of China, Jilin City, China, in 1982 and 1985, respectively. He received the Ph.D. degree in thermal power engineering and automation from Southeast University, Nanjing, China, in 1991. He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and Head of the Institute of Electric Automation, School of Information Science and Engineering, Northeastern University, Shenyang, China. His main research interests are fuzzy control, stochastic system control, neural networks based control, nonlinear control, and their applications. He has authored and coauthored over 280 journal and conference papers, six monographs and co-invented 90 patents.He is the Chair of the Adaptive Dynamic Programming and Reinforcement Learning Technical Committee on IEEE Computational Intelligence Society. He is an Associate Editor of Automatica, IEEE Transactions on Neural Networks, IEEE Transactions on Cybernetics, and Neurocomputing. He was an Associate Editor of IEEE Transactions on Fuzzy Systems (2008–2013). He was awarded the Outstanding Youth Science Foundation Award from the National Natural Science Foundation Committee of China, in 2003. He was named the Cheung Kong Scholar by the Education Ministry of China, in 2005. He is a recipient of the IEEE Transactions on Neural Networks 2012 Outstanding Paper Award.

Zhenwei Liu received the B.S. degree in electrical engineering and automation from the Changchun Institute of Technology, Changchun, China, in 2004. He is currently pursuing the Ph.D. degree in control theory and control engineering with Northeastern University, Shenyang, China. His current research interests include stability analysis and control of delayed systems, neural networks-based control, and nonlinear control.

Yingchun Wang was born in Liaoning Province, China, in 1974. He received the B.S., M.S., and Ph.D. degrees from Northeastern University, Shenyang, China, in 1997, 2003, and 2006, respectively. He was an Associate Professor in the School of Information Science and Engineering, Northeastern University. His research interests include network control, complex systems, fuzzy control and fuzzy systems, stochastic control, neural networks, and cooperation control.

Please cite this article as: Zhang J, et al. Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator. ISA Transactions (2015), http://dx.doi.org/10.1016/j.isatra.2014.08.018i

Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator.

In this paper, we consider the problem of developing a controller for continuous-time nonlinear systems where the equations governing the system are u...
962KB Sizes 1 Downloads 12 Views