2820

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

Finite-Approximation-Error-Based Discrete-Time Iterative Adaptive Dynamic Programming Qinglai Wei, Member, IEEE, Fei-Yue Wang, Fellow, IEEE, Derong Liu, Fellow, IEEE, and Xiong Yang

Abstract—In this paper, a new iterative adaptive dynamic programming (ADP) algorithm is developed to solve optimal control problems for infinite horizon discrete-time nonlinear systems with finite approximation errors. First, a new generalized value iteration algorithm of ADP is developed to make the iterative performance index function converge to the solution of the Hamilton–Jacobi–Bellman equation. The generalized value iteration algorithm permits an arbitrary positive semi-definite function to initialize it, which overcomes the disadvantage of traditional value iteration algorithms. When the iterative control law and iterative performance index function in each iteration cannot accurately be obtained, for the first time a new “design method of the convergence criteria” for the finite-approximationerror-based generalized value iteration algorithm is established. A suitable approximation error can be designed adaptively to make the iterative performance index function converge to a finite neighborhood of the optimal performance index function. Neural networks are used to implement the iterative ADP algorithm. Finally, two simulation examples are given to illustrate the performance of the developed method. Index Terms—Adaptive critic designs, adaptive dynamic programming (ADP), approximate dynamic programming, approximation error, neural networks, neuro-dynamic programming, nonlinear systems, optimal control, reinforcement learning, value iteration.

I. I NTRODUCTION PTIMAL control of nonlinear systems has been the focus of control fields for many decades [1]–[11]. Dynamic programming has been a useful technique in handling optimal control problems for many years, though it is often computationally untenable to perform it to obtain the optimal solutions [12]. Characterized by strong abilities of self-learning and adaptivity, adaptive dynamic programming (ADP), proposed by Werbos [13], [14], has demonstrated the capability to find the optimal control policy and solve the Hamilton–Jacobi–Bellman (HJB) equation

O

Manuscript received February 9, 2014; revised May 26, 2014 and July 29, 2014; accepted July 31, 2014. Date of publication September 26, 2014; date of current version November 13, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61034002, Grant 61233001, Grant 61273140, Grant 61304086, Grant 61374105, and Grant 71232006, in part by Beijing Natural Science Foundation under Grant 4132078, and in part by the Early Career Development Award of SKLMCCS. This paper was recommended by Associate Editor S. X. Yang. The authors are with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2354377

in a practical way [15]–[20]. In [21]–[26], data-driven ADP algorithms were developed to design the optimal control, where mathematical models of control systems were necessary. In [27]–[29], hierarchical ADP with multiple-goal representation networks was investigated and the implementing efficiency of ADP was improved. Iterative methods are primary tools in ADP to obtain the solution of HJB equation indirectly and have received more and more attentions [30]–[37]. Policy iteration algorithms for optimal control of continuous-time nonlinear systems were given in [38] and [39]. In 2014, Liu and Wei [40] developed a policy iteration algorithm for discrete-time nonlinear systems. Value iteration algorithms for optimal control of discretetime nonlinear systems were given in [41]. Convergence of the value iteration algorithm was proven in [42]–[44]. In [45]–[47], value iteration algorithms with discount factors were discussed, where convergence rates of the value iteration algorithms were proposed to improve the convergence of the iterative performance index function. In value iteration algorithms of ADP, the initial admissible control law in policy iteration algorithms can be avoided, and hence they have gained many attentions [42]–[51], while for most previous value iteration algorithms, a zero performance index function, i.e., V0 (xk ) ≡ 0, is chosen to initialize the algorithms. On the other hand, most of the previous research on iterative ADP algorithms required accurate iterative control and accurate iterative performance index function to guarantee the effectiveness of the proposed ADP algorithms. These iterative ADP algorithms can be called “accurate iterative ADP algorithms.” But for most real-world control systems, the accurate iterative control laws in the iterative ADP algorithms cannot be obtained. Hence, it is necessary to investigate ADP algorithms with approximation errors. How to analyze and design the approximation errors to guarantee the convergence properties of the iterative ADP algorithms is the key to real-world implementations of ADP. Although in several papers [52]–[54], convergence properties of neural networks for implementing ADP algorithms were discussed, these convergence criteria of neural networks were generally difficult to obtain. In [55], based on iterative θ -ADP algorithm, an “error bound” analysis method was proposed to obtain a uniform convergence criterion. To the best of our knowledge, all the convergence criteria in the existing literature were difficult to obtain and there are no discussions on how to design a convergence criterion that makes iterative ADP algorithms converge. This motivates our research. In this paper, a new finite-approximation-error-based discrete-time generalized value iteration algorithm will be

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

WEI et al.: FINITE-APPROXIMATION-ERROR-BASED DISCRETE-TIME ITERATIVE ADP

constructed. Two contributions are emphasized. First, the new generalized value iteration algorithm is developed to overcome the disadvantage of traditional value iteration algorithms. Next, considering approximation errors, the convergence properties for the finite-approximation-error-based generalized value iteration algorithm are analyzed. We emphasize that for the first time a new “design method of the convergence criteria” for the finite-approximation-error based generalized value iteration algorithm is established. It permits the developed generalized value iteration algorithm to design a suitable approximation error adaptively to make the iterative performance index functions converge to a finite neighborhood of the optimal performance index function. II. P ROBLEM S TATEMENT In this paper, we will study the following discrete-time nonlinear system: xk+1 = F(xk , uk ), k = 0, 1, 2, . . .

(1)

where xk ∈ x is the n-dimensional state vector and uk ∈ u is the m-dimensional control vector. Let x and u be the domain of definition for state and control, which are defined as x = {xk |xk ∈ Rn and xk  < ∞} and u = {uk |uk ∈ Rm and uk  < ∞}, respectively, where  ·  denotes the Euclidean norm. Let x0 be the initial state and F(xk , uk ) be the system function. Let uk = (uk , uk+1 , . . . ) be an arbitrary sequence of controls from k to ∞. The performance index function for state x0 under the control sequence u0 = (u0 , u1 , . . . ) is defined as ∞    U(xk , uk ) J x0 , u0 =

(2)

k=0

where U(xk , uk ) > 0, for ∀ xk , uk = 0, is the stage cost function. The goal of this paper is to find an optimal control scheme which stabilizes system (1) and simultaneously minimizes the performance index function (2). For convenience of analysis, results of this paper are based on the following assumptions. Assumption 1: System (1) is controllable. Assumption 2: The system state xk = 0 is an equilibrium state of system (1) under the control uk = 0, i.e., F(0, 0) = 0. Assumption 3: The feedback control uk = u(xk ) satisfies uk = u(xk ) = 0 for xk = 0. Assumption 4: The stage cost function U(xk , uk ) is a positive definite function for any xk and uk . Define the control sequence set as Uk = {uk : uk = (uk , uk+1 , . . .), ∀uk+i ∈ Rm , i = 0, 1, . . .}. Then, for arbitrary control sequence uk ∈ Uk , the optimal performance index function can be defined as   J ∗ (xk ) = inf J(xk , uk ) : uk ∈ Uk . (3) uk

According to Bellman’s principle of optimality, J ∗ (xk ) satisfies the discrete-time HJB equation   J ∗ (xk ) = inf U(xk , uk ) + J ∗ (F(xk , uk )) . (4) uk

Then, the law of optimal single control can be expressed as   u∗ (xk ) = arg inf U(xk , uk ) + J ∗ (F(xk , uk )) . (5) uk

2821

Hence, the HJB equation (4) can be written as     J ∗ (xk ) = U xk , u∗ (xk ) + J ∗ F(xk , u∗ (xk )) .

(6)

Generally, J ∗ (xk ) is a high nonlinear and nonanalytic function. This makes the optimal control nearly impossible to obtain by directly solving the HJB equation (6). In this paper, a generalized value iterative ADP algorithm will be developed to obtain the optimal solution of the HJB equation indirectly. III. P ROPERTIES OF THE F INITE -A PPROXIMATION -E RROR -BASED G ENERALIZED I TERATIVE ADP A LGORITHM In this section, a new finite-approximation-error-based generalized value iteration algorithm is developed. A new convergence analysis method will be established and a new design method of the convergence criteria will also be developed. A. Derivation of the Finite-Approximation-Error-Based Generalized Value Iteration Algorithm In the developed generalized value iteration algorithm, the performance index function and control law are updated by iterations, with the iteration index i increasing from 0 to ∞. For all xk ∈ x , let the initial function (xk ) ≥ 0 be an arbitrary positive semi-definite function. For all xk ∈ x , let the initial performance index function Vˆ 0 (xk ) = (xk ). The iterative control law vˆ 0 (xk ) can be computed as   vˆ 0 (xk ) = arg min U (xk , uk ) + Vˆ 0 (xk+1 ) + ρ0 (xk ) (7) uk

where Vˆ 0 (xk+1 ) = (xk+1 ) and the performance index function can be updated as      Vˆ 1 (xk ) = U xk , vˆ 0 (xk ) + Vˆ 0 F xk , vˆ 0 (xk ) + π0 (xk ) (8) where ρ0 (xk ) and π0 (xk ) are approximation error functions of the iterative control law and iterative performance index function, respectively. For i = 1, 2, . . . , the iterative ADP algorithm will iterate between   vˆ i (xk ) = arg min U(xk , uk ) + Vˆ i (xk+1 ) + ρi (xk ) (9) uk

and      Vˆ i+1 (xk ) = U xk , vˆ i (xk ) + Vˆ i F xk , vˆ i (xk ) + πi (xk ) (10) where ρi (xk ) and πi (xk ) are finite approximation error functions of the iterative control and performance index function, respectively. For convenience of analysis, for ∀ i = 0, 1, . . . , we assume that ρi (xk ) = 0 and πi (xk ) = 0 for xk = 0. B. Special Case of Accurate Generalized Value Iteration Algorithm In this subsection, we first discuss the properties of generalized value iterative ADP algorithm where the iterative performance index function and the iterative control law can accurately be obtained in each iteration. For ∀ i = 0, 1, . . . , the generalized value iteration algorithm is reduced to the following ⎧ equations: ⎨ vi (xk ) = arg min {U(xk , uk ) + Vi (F(xk , uk ))} uk (11) ⎩ Vi+1 (xk ) = min {U(xk , uk ) + Vi (F(xk , uk ))} uk

2822

where V0 (xk ) = (xk ) is a positive semi-definite function. Remark 1: In [42], for V0 (xk ) ≡ 0, it was proven that the iterative performance index function in (11) is monotonously nondecreasing and converges to the optimum. In [45]–[47], the convergence property of value iteration for discounted case has been investigated and it was proven that the iterative performance index function will converge to the optimum if the discount factor α satisfies 0 < α < 1. For the undiscounted performance index function (2) and the positive semi-definite function, i.e., α ≡ 1 and V0 (xk ) = (xk ), the convergence analysis methods in [42]–[47] are invalid. Hence, a new convergence analysis method will be developed. In [44], [56], and [57], a “function bound” method was proposed for the traditional value iteration algorithm with the zero initial performance index function. Based on the previous contribution of value iteration algorithms [42]–[47], and inspired by [44], a new convergence analysis method for the generalized value iteration algorithm is developed in this subsection. Lemma 1: Let (xk ) ≥ 0 be an arbitrary positive semidefinite function. Under Assumptions 1–4, for i = 1, 2, . . . , the iterative performance index function Vi (xk ) in (11), is positive definite for all xk ∈ x . Theorem 1: For i = 0, 1, . . . , let Vi (xk ) and vi (xk ) be updated by (11). Then, we have that the iterative performance index function Vi (xk ) converges to J ∗ (xk ) as i → ∞. Proof: First, according to Assumptions 1–4, we know that J ∗ (xk ) is a positive definite function for xk . Let  > 0 be an arbitrary small positive number. Define  = xk | xk ∈  x , xk  <  . For all xk ∈  , according to Lemma 1, for ∀ i = 0, 1, . . . , we have Vi (xk ) → J ∗ (xk ) as  → 0. Hence, the conclusion holds for all xk ∈  ,  → 0. On the other hand, for an arbitrary small  > 0, xk ∈ x \ , and uk ∈ u , we have 0 < U(xk , uk ) < ∞, 0 ≤ V0 (xk ) < ∞, 0 < J ∗ (xk ) < ∞, and 0 ≤ J ∗ (F(xk , uk )) < ∞. Hence, there exist constants γ , δ and δ that satisfy J ∗ (F(xk , uk )) ≤ γ U(xk , uk ) and δJ ∗ (xk ) ≤ V0 (xk ) ≤ δJ ∗ (xk ), where 0 < γ < ∞ and 0 ≤ δ ≤ 1 ≤ δ, respectively. As J ∗ (xk ) is unknown, we have that the values of γ , δ and δ cannot be obtained directly. In the following, we will prove that for arbitrary γ , δ and δ, Vi (xk ) will converge to the optimum and the estimations for the values γ , δ and δ can be omitted. We will first prove that for ∀ i = 0, 1, . . . , the following inequality holds:

δ−1 ∗ (12) Vi (xk ) ≥ 1 +  i J (xk ). 1 + γ −1 Mathematical induction is introduced to prove the conclusion. Let i = 0. We have   V1 (xk ) = min U(xk , uk ) + V0 (xk+1 ) uk   δ−1 U(xk , uk ) ≥ min 1 + γ uk 1+γ    δ−1 J ∗ (xk+1 ) + δ− 1+γ   δ−1 J ∗ (xk ). (13) = 1+ (1 + γ −1 )

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

Assume that the conclusion holds for i = l−1, l = 1, 2, . . . . Then, for i = l, we have Vl+1 (xk ) = min {U (xk , uk ) + Vl (xk+1 )} uk  ≥ min U(xk , uk ) + 1 + uk

 ≥ 1+

δ−1 (1 + γ −1 )l



(1 + γ −1 )



δ−1



l−1

J (xk+1 )

J ∗ (xk ).

(14)

The mathematical induction is completed. On the other hand, according to theidea from (13)–(14), for ∀ i = 0, 1, . . . ,  we can get Vi (xk ) ≤

1+

δ−1 i (1+γ −1 )

J ∗ (xk ). Letting i → ∞,

we can get the conclusion. Remark 2: If for ∀ xk , we let the initial performance index function V0 (xk ) ≡ 0, then the generalized value iteration algorithm is reduced to the traditional value iteration algorithms [42]–[44], [48]–[51], where the convergence property for the traditional value iteration algorithms can easily be justified according to Theorem 1 in this paper. So, we can say that the traditional value iteration algorithm is a special case of the generalized value iteration algorithm. C. Properties of the Finite-Approximation-Error-Based Generalized Value Iteration Algorithm Considering approximation errors, we generally have vˆ i (xk ) = vi (xk ), Vˆ i+1 (xk ) = Vi+1 (xk ), i = 0, 1, . . . , and the convergence property in Theorem 1 for accurate generalized value iteration algorithms becomes invalid. Hence, in this subsection, a new “design method of the convergence criteria” will be established. It permits the developed generalized value iteration algorithm to design a suitable approximation error adaptively that makes the iterative performance index function converge to a finite neighborhood of the optimal one. Define the target iterative performance index function as   (15) i (xk ) = min U (xk , uk ) + Vˆ i−1 (xk+1 ) uk

where Vˆ 0 (xk ) = 0 (xk ) = (xk ). For ∀ i = 1, 2, . . . , let σi > 0 be a finite approximation error that satisfies Vˆ i (xk ) ≤ σi i (xk ).

(16)

From (16), we can see that the iterative performance index function i (xk ) can be larger or smaller than Vˆ i (xk ) and the convergence properties are different for different values of σi . As the convergence analysis methods for 0 < σi < 1 and σi ≥ 1 are different, in the following, the convergence properties for different σi will be discussed separatively. Theorem 2: Let xk ∈ x be an arbitrary controllable state and Assumptions 1–4 hold. For ∀ i = 1, 2, . . . , let i (xk ) be expressed as in (15) and Vˆ i (xk ) be expressed as in (10). If for ∀ i = 1, 2, . . . , there exists 0 < σi < 1 that makes (16) hold, then we have that the iterative performance index function converges to a bounded neighborhood of the optimal performance index function J ∗ (xk ). Proof: If 0 < σi < 1, according to (16), we have 0 ≤ Vˆ i (xk ) < i (xk ). Using mathematical induction, we can prove

WEI et al.: FINITE-APPROXIMATION-ERROR-BASED DISCRETE-TIME ITERATIVE ADP

that for ∀ i = 1, 2, . . . , the following inequality holds: 0 < Vˆ i (xk ) < Vi (xk ).

we have (17)

According to Theorem 1, we have Vi (xk ) → J ∗ (xk ). Then for ∀ i = 0, 1, . . . , Vˆ i (xk ) is upper bounded and 0 < lim Vˆ i (xk ) ≤ lim Vi (xk ) = J ∗ (xk ). i→∞

(18)

i→∞

The proof is completed. Theorem 3: Let xk ∈ x be an arbitrary controllable state and Assumptions 1–4 hold. For ∀ i = 1, 2, . . . , let i (xk ) be expressed as (15) and Vˆ i (xk ) be expressed as (10). Let 0 < γi < ∞ be a constant that makes Vi (F(xk , uk )) ≤ γi U (xk , uk )

(19)

hold. If for ∀ i = 1, 2, . . . , there exists 1 ≤ σi < ∞ that makes (16) hold uniformly, then we have ⎡ i−1     σi−1 σi−2 · · · σi−j+1 σi−j − 1 Vˆ i (xk ) ≤ σi ⎣1+ j=1

γi−1 γi−2 · · · γi−j   × (γi−1 + 1) (γi−2 + 1) · · · γi−j + 1

 Vi (xk ) (20)

where we define

i  j

2823

(·) = 0 for ∀j > i, i, j = 0, 1, . . . , and

σi−1 σi−2 · · · σi−j+1 (σi−j − 1) = (σi−1 − 1) for j = 1. Proof: The theorem can be proven by mathematical induction. First, let i = 1. We can easily obtain 1 (xk ) = V1 (xk ). According to (16), we have Vˆ 1 (xk ) ≤ σ1 V1 (xk ). Thus, the conclusion holds for i = 1. Next, let i = 2. According to (15),

  2 (xk ) = min U(xk , uk ) + Vˆ 1 (F(xk , uk )) uk   σ1 − 1 U(xk , uk ) ≤ min 1 + γ1 uk γ1 + 1    σ1 − 1 V1 (xk+1 ) + σ1 − γ +1  1   σ1 − 1 = 1 + γ1 min U(xk , uk ) + V1 (xk+1 ) γ1 + 1 uk   σ1 − 1 V2 (xk ). (21) = 1 + γ1 γ1 + 1 According to (16), we can obtain   σ1 − 1 Vˆ 2 (xk ) ≤ σ2 1 + γ1 V2 (xk ) γ1 + 1

(22)

which shows that (20) holds for i = 2. Assume that (20) holds for i = l−1, where l = 2, 3, . . . Then, for i = l, we can obtain (23), which can be seen at the bottom of the page. Then, according to (16), we can obtain (20) which proves the conclusion for ∀ i = 0, 1, . . . From (20), we can see that for i = 0, 1, . . . , there exists an error between Vˆ i (xk ) and Vi (xk ). As i → ∞, the bound of the approximation errors may increase to infinity. Thus, in the following, we will give the convergence properties of the iterative ADP algorithm (7)–(10) using error-bound method. Before presenting the next theorem, the following lemma is necessary. Lemma 2: Let {bi }, i = 1, 2, . . . , be a sequence of positive numbers. Let 0 < λi < ∞ be a bounded positive constant for

  l (xk ) = min U (xk , uk ) + Vˆ l−1 (F (xk , uk )) uk ⎧ ⎛

⎞ l−2 ⎨    γl−2 γl−3 · · · γl−j−1   ⎠ σl−2 σl−3 · · · σl−j σl−j−1 − 1 ≤ min U (xk , uk ) + σl−1 ⎝1 + uk ⎩ + 1) + 1) · · · γ + 1 (γ (γ l−2 l−3 l−j−1 j=1 ⎫ ⎬ × Vl−1 (xk ) ⎭ ⎧⎛ ⎛

⎞ l−1 ⎨    γl−2 γl−3 · · · γl−j   ⎠ U (xk , uk ) ≤ min ⎝1 + γl−1 ⎝ σl−1 σl−2 · · · σl−j+1 σl−j − 1 uk ⎩ + 1) + 1) · · · γ + 1 (γ (γ l−1 l−2 l−j j=1 ⎛ ⎞ ⎛ l−2     γl−2 γl−3 · · · γl−j−1 γ σ l−1 l−1 ⎝1 +  ⎠ σl−2 σl−3 · · · σl−j σl−j−1 − 1 +⎝ γl−1 + 1 + 1) + 1) · · · γ + 1 (γ (γ l−2 l−3 l−j−1 j=1 ⎫  ⎬ 1 + Vl−1 (xk ) ⎭ γl−1 + 1 ⎛ ⎞ l−1     γl−1 γl−2 · · · γl−j   ⎠ min {U (xk , uk ) + Vl−1 (xk )} = ⎝1 + σl−1 σl−2 · · · σl−j+1 σl−j − 1 uk (γl−1 + 1) (γl−2 + 1) · · · γl−j + 1 j=1 ⎛ ⎞ l−1     γl−1 γl−2 · · · γl−j   ⎠ Vl (xk ) = ⎝1 + σl−1 σl−2 · · · σl−j+1 σl−j − 1 (23) + 1) + 1) · · · γ + 1 (γ (γ l−1 l−2 l−j j=1

2824

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

∀i = 1, 2, . . . and let ai = λi bi . If that

∞ 

∞ 

bi is finite, then we have

i=1

ai is finite.

i=1

Proof: Since for ∀ i = 1, 2, . . . , λi is finite, we can let ∞ ∞ ∞    ¯λ = sup{λ1 , λ2 , . . .}. Then, we have ai = λi bi ≤ λ¯ bi i=1

i=1

i=1

is finite. Theorem 4: Let xk ∈ x be an arbitrary controllable state and Assumptions 1–4 hold. If for ∀ i = 1, 2, . . . , Vˆ i (xk ) satisfies (20) and the approximation error σi+1 satisfies γi + 1 1 ≤ σi+1 ≤ qi (24) γi γi where qi is an arbitrary constant which satisfies < γi + 1 qi < 1, then as i → ∞, the iterative performance index function Vˆ i (xk ) of the generalized value iteration algorithm (7)–(10) converges to a bounded neighborhood of the optimal performance index function J ∗ (xk ). Proof: For (20) in Theorem 3, if we let i−1    σi−1 σi−2 · · · σi−j+1 σi−j − 1 i = j=1

γi−1 γi−2 · · · γi−j   × (γi−1 + 1) (γi−2 + 1) · · · γi−j + 1

(25)

then we can get

i−1  σi−1 σi−2 · · · σi−j γi−1 γi−2 · · · γi−j   i = + 1) + 1) · · · γ + 1 (γ (γ i−1 i−2 i−j j=1

i−1  σi−1 σi−2 · · · σi−j+1 γi−1 γi−2 · · · γi−j   (26) − (γi−1 + 1) (γi−2 + 1) · · · γi−j + 1 j=1 where we define σi−1 σi−2 · · · σi−j+1 = 1 for j = 1. If we let σi−1 σi−2 · · · σi−j γi−1 γi−2 · · · γi−j   aij = (27) (γi−1 + 1) (γi−2 + 1) · · · γi−j + 1 and bij =

σi−1 σi−2 · · · σi−j+1 γi−1 γi−2 · · · γi−j   (γi−1 + 1) (γi−2 + 1) · · · γi−j + 1

(28)

where i = 1, 2, . . . , and j = 1, 2, . . . , i − 1, then we have i−1 i−1 i−1 i−1     i = aij − bij . We know that if aij and bij are j=1

j=1

j=1

j=1

both finite as i → ∞, then lim i is finite. According to (28), i→∞ σi−j+1 γi−j bij bij = ≤ qi−j < 1, then we we have . If bi(j−1) (γi−j + 1) bi(j−1) γi−j + 1 can get σi−j+1 ≤ qi−j . Let  = i − j. We can obtain γi−j γ + 1 σ+1 ≤ q where  = 1, 2, . . . , i − 1. Letting i → ∞, γ we can obtain (24). Letq = sup{q1 , q2 , . . .}, where 0 < q < 1. We can get bij ≤

γi−1 +1 γi−1

qj−1 . Thus, we can obtain

i−1  j=1

bij ≤

 i−1   γi−1 + 1 j=1

γi−1

q j−1 .

(29)

γi < q < 1 and γi−1 is finite for ∀ i = 1, 2, . . . , γi + 1 i−1  letting i → ∞, we have lim bij is finite. As

i→∞ j=1

On the other hand, for ∀ i = 1, 2, . . . , and for ∀j = 1, 2, . . . , i−1, we have aij = σi−j bij . As for ∀ i = 1, 2, . . . and for ∀j = 1, 2, . . . , i − 1, 1 ≤ σi−j < ∞ is finite, according to i−1  Lemma 2, we conclude lim aij must be finite. Therefore, i→∞ j=1

we can obtain lim i is finite. According to Theorem 1, we i→∞

have lim Vi (xk ) = J ∗ (xk ). Hence, the iterative performance i→∞

index function Vˆ i (xk ) is convergent to a bounded neighborhood of the optimal performance index function J ∗ (xk ). The proof is completed. Remark 3: From Theorem 4, we can see that there is no constraint for the approximation error σ1 in (24). If we let the parameter γ0 satisfy V0 (F(xk , uk )) ≤ γ0 U (xk , uk )

(30)

where V0 (F(xk , uk )) = (F(xk , uk )) and let the approximation γ0 + 1 error σ1 satisfy 1 ≤ σ1 < , then we can see that the γ0 convergence criterion (24) is valid for ∀ i = 0, 1, . . . . Combining Theorems 2 and 4, the convergence criterion of the finite-approximation-error-based generalized value iteration algorithm can be established. Theorem 5: Let xk ∈ x be an arbitrary controllable state and Assumptions 1–4 hold. If for ∀ i = 0, 1, . . . , inequality 0 < σi+1 ≤ qi

γi + 1 γi

(31)

holds, where 0 < qi < 1 is an arbitrary constant, then as i → ∞, the iterative performance index function Vˆ i (xk ) converges to a bounded neighborhood of the optimal performance index function J ∗ (xk ). D. Designs of the Convergence Criteria With Finite Approximation Errors In the previous subsection, we have discussed the convergence property of the generalized value iteration algorithm. From (31), for ∀ i = 0, 1, . . . , the approximation error σi+1 is a function of γi . Hence, if we can obtain the parameter γi for i = 1, 2, . . . , then we can design an iterative approximation error σi+1 to guarantee the iterative performance index function Vˆ i+1 (xk ) to be convergent. In the following, the detailed design method of the convergence criteria will be discussed. According to (19), we give the following definition. Define γi as γi = {γi |γi U(xk , uk ) ≥ Vi (F(xk , uk )), ∀xk ∈ x , ∀uk ∈ u }. (32) As the existence of approximation errors, the accurate iterative performance index function Vi (xk ) cannot be obtained in general. Hence, the parameter γi cannot be obtained directly from (19). In this subsection, several methods will be introduced on how to estimate γi .

WEI et al.: FINITE-APPROXIMATION-ERROR-BASED DISCRETE-TIME ITERATIVE ADP

2825

Algorithm 1 Method I for Estimating γi 1: Given an arbitrary admissible control law μ(xk ) for nonlinear system (1); 2: Construct an iterative performance index function Pi (xk ) by (34), i = 1, 2, . . . , where P0 (xk ) = V0 (xk ) = (xk ); 3: Compute the constant γ˜i by (35); 4: return γ˜i and let γi = γ˜i .

Algorithm 2 Method II for Estimating γi ˆ i (xk ) by (7)–(10); 1: For ∀ i = 1, 2, . . . , obtain V ˆ i (xk ) 2: Obtain the iterative performance index function V by (43); ˆ i (xk ) = Vˆ i (xk ); 3: Let V 4: Obtain the parameter γˆi by (41); 5: return γˆi and let γi = γˆi .

Lemma 3: Let νi (xk ), be an arbitrary control law. Define Vi+1 (xk ) as in (10) and define i+1 (xk ) as

So, the conclusion holds for i = 1. Assume that the conclusion holds for i = l − 1, l = 2, 3, . . . Then for i = l, we can obtain   Vˆ l (xk ) ≥ l (xk ) = min U(xk , uk ) + Vˆ l−1 (xk+1 )

i+1 (xk ) = U (xk , νi (xk )) + i (xk+1 ).

(33)

If V0 (xk ) = 0 (xk ) = (xk ), then for ∀ i = 1, 2, . . . , we have Vi (xk ) ≤ i (xk ). Theorem 6: Let μ(xk ) be an arbitrary admissible control law of nonlinear system (1), that is Pi+1 (xk ) = U (xk , μ(xk )) + Pi (xk+1 )

(34)

where P0 (xk ) = V0 (xk ) = (xk ). If there exists a constant γ˜i that satisfies γ˜i U(xk , uk ) ≥ Pi (F(xk , uk ))

(35)

then we have γ˜i ∈ γi . Proof: As μ(xk ) is an arbitrary admissible control law, according to Lemma 3, we have Pi (xk ) ≥ Vi (xk ). If γ˜i satisfies (35), then we can get γ˜i U(xk , uk ) ≥ Pi (F(xk , uk )) ≥ Vi (F(xk , uk )).

(36)

The proof is completed. From Theorem 6, the first estimation method of the parameter γi can be described in Algorithm 1. Remark 4: Theorem 6 shows that if we find an arbitrary control law μ(xk ), then for ∀ i = 1, 2, . . . , the parameter γ˜i can be estimated and the convergence criterion can be provided. The merit of this method is the convenience for estimating γ˜i . But the admissible control law μ(xk ) of system (1) is generally difficult to obtain. Only in [40], a neural-network-based method is presented to obtain the admissible control law μ(xk ) by experiments. Thus Algorithm 1 is generally difficult to implement and more effective methods will be developed. Lemma 4: For ∀ i = 1, 2, . . . , let i (xk ) be expressed as in (15) and Vˆ i (xk ) be expressed as in (10). If for ∀ i = 1, 2, . . . , the inequality Vˆ i (xk ) ≥ i (xk )

(37)

holds, then we have Vˆ i (xk ) ≥ Vi (xk )

(38)

where Vi (xk ) is defined in (11). Proof: This conclusion can be proven by mathematical induction. First, consider i = 1. We have Vˆ 1 (xk ) ≥ 1 (xk ) = min {U(xk , uk ) + V0 (xk+1 )} uk

= V1 (xk ).

(39)

uk

≥ min {U(xk , uk ) + Vl−1 (xk+1 )} = Vl (xk ). uk

(40)

The conclusion is proven. Then, we can derive the following theorem. Theorem 7: For ∀ i = 0, 1, . . . , Vˆ i (xk ) is obtained by (7)–(10). If for ∀ i = 0, 1, . . . , the approximation error of the iterative performance index function πi (xk ) ≥ 0, and there exists γˆi that satisfies γˆi U(xk , uk ) ≥ Vˆ i (F(xk , uk ))

(41)

then we have γˆi ∈ γi . Proof: According to the generalized value iteration algorithm (7)–(10). As the existence of the approximation error ρ0 (xk ) in the iterative control law, for i = 1, we have 1 (xk ) = min {U(xk , uk ) + V0 (xk+1 )} uk   ≤ U xk , vˆ 0 (xk ) + Vˆ 0 (xk+1 )

(42)

where Vˆ 0 (xk ) = 0 (xk ) = (xk ). If π0 (xk ) ≥ 0, then we can get Vˆ 1 (xk ) ≥ V1 (xk ). Using the mathematical induction, we can prove that Vˆ i (xk ) ≥ Vi (xk ) holds for ∀i = 1, 2, . . . According to (41) we can get γˆi U(xk , uk ) ≥ Vi (F(xk , uk )), which proves γˆi ∈ γi . From Theorem 7, we can see that if for ∀ i = 0, 1, . . . , we can force the approximation error πi (xk ) ≥ 0, then the parameter γi can be estimated. An effective method is to add another approximation error |πi (xk )| to the iterative performance index function Vˆ i+1 (xk ). The new iterative performance index function can be defined as Vˆ i+1 (xk ) = U(xk , vˆ i (xk )) + Vˆ i (xk+1 ) + πi (xk ) + |πi (xk )| (43) = Vˆ i+1 (xk ) + |πi (xk )|. Hence, the second estimation method of the parameter γi can be described in Algorithm 2. Remark 5: In Algorithm 2, we use the iterative performance index function Vˆ i (xk ) to obtain the parameter γˆi by (41). According to Lemma 4, we have Vˆ i (xk ) ≥ Vi (xk ). Hence, we can get γˆi ∈ γi and the admissible control law μ(xk ) is unnecessary. This is a merit of Algorithm 2. However, by introducing an additional error |πi (xk )|, the difference between Vˆ i (xk ) and i (xk ) may be enlarged. This makes the iterative performance index function Vˆ i (xk ) possible to diverge as i → ∞. This is

2826

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

the disadvantage of Algorithm 2. In the following, another method is developed to overcome this difficulty. For ∀ i = 1, 2, . . . , there exists a finite constant 0 < σ i ≤ 1 that makes Vˆ i (xk ) ≥ σ i i (xk ).

(44)

If Vˆ i (xk ) ≥ i (xk ), then we define σ i = 1. We can derive the following results. Lemma 5: Let xk ∈ x be an arbitrary controllable state and Assumptions 1–4 hold. Let Vˆ 0 (xk ) = V0 (xk ) = (xk ). For ∀ i = 1, 2, . . . , let i (xk ) be expressed as in (15) and Vˆ i (xk ) be expressed as in (10). Let 0 < γi < ∞ be a constant that satisfies (19). If for ∀ i = 1, 2, . . . , there exists 0 < σ i ≤ 1 that satisfies (44), then we have ⎡ i−1    σ i−1 σ i−2 · · · σ i−j+1 σ i−j − 1 Vˆ i (xk ) ≥ σ i ⎣1+ j=1

γi−1 γi−2 · · · γi−j × (γi−1 + 1)(γi−2 + 1) · · · (γi−j + 1)

Algorithm 3 Method III for Estimating γi 1: For ∀ i = 1, 2, . . . , record σ 0 , σ 1 , . . . , σ i−1 and γ 0 , γ 1 , . . . , γ i−1 ; ˆ i (xk ) and i (xk ) by (10) and by (15), respectively; 2: Obtain V If Vˆ i (xk ) ≥ i (xk ), then we let σ i = 1. Otherwise, obtain the approximation error σ i by (44); ˆ i (xk ) and U(xk , uk ). Solve γ by 4: Obtain γˆi according to V i (50); 5: return γ and let γi = γ . i i 3:

As for ∀ i = 1, 2, . . . , 0 < σ i ≤ 1 and 0 < γi < ∞, we have ⎛ l−2    σ l−2 σ l−3 · · · σ l−j σ l−j−1 − 1 0 < σ l−1 ⎝1 + j=1

!

γl−2 γl−3 · · · γl−j−1   × (γl−2 + 1) (γl−3 + 1) · · · γl−j−1 + 1

Vi (xk ) (45)

where we define

i 

(·) = 0, for ∀j > i, i, j = 0, 1, . . . , and we

j

let σ i−1 σ i−2 · · · σ i−j+1 (σ i−j − 1) = (σ i−1 − 1), for j = 1. Theorem 8: For ∀ i = 1, 2, . . . , let 0 < γi < ∞ be a constant that satisfies (19). Define a new iterative performance index function as ⎡ i−1    σ i−1 σ i−2 · · · σ i−j+1 σ i−j − 1 V i (xk ) = σ i ⎣1+ j=1

γi−1 γi−2 · · · γi−j × (γi−1 + 1)(γi−2 + 1) · · · (γi−j + 1)

(46)

0 < V i (xk ) ≤ Vi (xk ).

Substituting (49) into (48), we can obtain the conclusion holds for i = l. The proof is completed. From Theorem 8, we can estimate the lower bound of the iterative performance index function Vi (xk ). Hence, we can derive the following theorem. Theorem 9: For ∀ i = 1, 2, . . . , let Vˆ i (xk ) be the iterative performance index function that satisfies (37). If for ∀ i = 1, 2, . . . , there exists γˆi that satisfies (41), then we have γi = where



i = σ i ⎣1+

where 0 < σ i ≤ 1. Then for ∀ i = 1, 2, . . . , we have

≤ 1. (49)

! Vi (xk )



γˆi ∈ γi i

(50)

i−1   σ i−1 σ i−2 · · · σ i−j+1 (σ i−j − 1) j=1

(47)

Proof: The conclusion can be proven by mathematical induction. First, let i = 1. we have V 1 (xk ) ≤ σ 1 V1 (xk ) which shows (47) holds for i = 1. Assume that the inequality (47) holds for i = l − 1, l = 2, 3, . . . Then for i = l, we can get ⎡ l−1    σ l−1 σ l−2 · · · σ l−j+1 σ l−j − 1 σ l ⎣1 + j=1

⎤  γl−1 γl−2 · · · γl−j ⎦ × (γl−1 + 1)(γl−2 + 1) · · · (γl−j + 1) ⎡ ⎛ ⎛ l−2   γ l−1 ⎝σ l−1 ⎝1 + σ l−2 σ l−3 · · · σ l−j = σ l ⎣1 + γl−1 + 1 j=1  ⎞ ⎞⎞⎤ σ l−j−1 − 1 γl−2 γl−3 · · · γl−j−1 ⎠ − 1⎠⎠⎦. × (γl−2 + 1)(γl−3 + 1) · · · (γl−j−1 + 1) (48)

⎞⎤ γ i−1 γ i−2 · · · γ i−j ⎠⎦.   ×  γ i−1 + 1 γ i−2 + 1 · · · γ i−j + 1 (51) Proof: The conclusion can be proven by mathematical induction. First, let i = 1. According to (51) we have γˆ1 U(xk , uk ) ≥ Vˆ 1 (F(xk , uk )) ≥ σ 1 V1 (F(xk , uk )).

(52)

Assume that the conclusion holds for i = l−1, l = 2, 3, . . . . For i = l, According to Lemma 5, we can get Vˆ l (xk ) ≥ l Vl (xk ). From (41), we can obtain γˆl U (xk , uk ) ≥ Vˆ l (F(xk , uk )) ≥ l Vl (F(xk , uk ))

(53)

γˆl ∈ γl . The proof is completed. l The third estimation method of the parameter γi can be described in Algorithm 3. Remark 6: We can see that if we record σ j and γ j , j = 1, 2, . . . , i − 2, then we can compute i−1 by (51). which shows that γ l =

WEI et al.: FINITE-APPROXIMATION-ERROR-BASED DISCRETE-TIME ITERATIVE ADP

Algorithm 4 Finite-Approximation-Error Based Generalized Value Iteration Algorithm Initialization: Choose randomly an array of system states xk in x ; Choose a semi-positive definite function (xk ) ≥ 0; Choose a convergence precision ζ ; Give a sequence {qi }, i = 0, 1, . . . , where 0 < qi < 1; Give two constants 0 < ς < 1, 0 <  < 1. Iteration: 1: Let the iteration index i = 0; 2: Let V0 (xk ) = (xk ) and obtain γ0 by (30); ˆ 1 (xk ) by (8); 3: Compute vˆ 0 (xk ) by (7) and obtain V 4: Obtain σ1 by (16). If σ1 (xk ) satisfies (24), then estimate γ1 by Algorithm ϒ, ϒ = 1, 2, 3. Let i = i + 1 and goto next step. Otherwise, decrease ρ0 (xk ) and π0 (xk ), i.e., ρ0 (xk ) = ςρ0 (xk ) and π0 (xk ) = π0 (xk ), respectively. Goto Step 3; ˆ i+1 (xk ) by (10); 5: Compute vˆ i (xk ) by (9) and obtain V 6: Obtain σi+1 by (16). If σi+1 satisfies (24), then estimate γi+1 by Algorithm ϒ, ϒ = 1, 2, 3, and goto next step. Otherwise, decrease ρi (xk ) and πi (xk ), i.e., ρi (xk ) = ςρi (xk ) and πi (xk ) = πi (xk ), respectively. Goto Step 5; ˆ i+1 (xk ) − Vˆ i (xk )| ≤ ζ , then the optimal performance 7: If |V index function is obtained and goto Step 8; else let i = i+1 and goto Step 5; ˆ i (xk ). 8: return vˆ i (xk ) and V

2827

approximation errors are required to ensure the convergence of the generalized value iteration algorithm. It is an important property for the neural network implementation of the finite-approximation-error based generalized value iteration algorithm. For large xk , approximation errors of neural networks can be large. As xk  → 0, the approximation errors of neural networks are also required to be zero. As is known, the approximation of neural networks cannot be globally accurate with no approximation errors, so the implementation of the generalized value iteration algorithm by neural networks may be invalid as xk → 0. On the other hand, in the real world, neural networks are generally trained under a global uniform training precision or approximation error. Thus, the developed generalized value iteration algorithm requires that the training precision is high and the approximation errors are small that make the iterative performance index function convergent for most of the state space. IV. S IMULATION S TUDIES To evaluate the performance of our finite-approximationerror-based generalized value iteration algorithm, we choose two examples for numerical experiments. Example 1: In our first example, we will consider a spring-mass-damper system [59]. The system function can be expressed by dy d2 y + b + κy = u 2 dt dt where y is the position and u is the control input. Let M = 0.1 kg denote the mass of object. Let κ = 2 kgf/m be the stiffness coefficient of spring and let b = 0.1 be the wall friction. If we let x1 = y, x2 = dy dt . Discretizing the system function using Euler method [60] with the sampling interval t = 0.1s leads to $ ! % ! $ 1 T x1k x1(k+1) κ b = x2(k+1) x2k − T 1 − T M M %  0 + T uk . (56) M M

Then we can obtain γ i−1 by (50). For ∀ i = 2, 3, . . . , the convergence criterion (31) can be justified. Hence it is unnecessary to construct a new iterative performance index function to estimate the parameter γi . On the other hand, from the expression of (51), we know that 0 < i ≤ 1 holds for ∀i = 1, 2, . . . . It shows that the convergence criterion (50) is feasible. Therefore, it is recommended to use Algorithm 3 to estimate γi . In this paper, we will give comparisons of these three methods. E. Summary of the Finite-Approximation-Error-Based Generalized Value Iteration Algorithm Now, we summarize the finite-approximation-error-based generalized value iteration algorithm in Algorithm 4. Remark 7: Generally, in iterative ADP algorithms, the difference between Vˆ i (xk ) and i (xk ) is obtained, that is Vˆ i (xk ) − i (xk ) = i (xk )

(54)

where i (xk ) is the approximation error function. According to the definition of σi in (16) and the convergence criterion (24), we can easily obtain the following equivalent convergence criterion: i (xk ) ≤

1 ˆ γi−1 +1 Vi (xk ).

(55)

From (55), we can see that for different system state xk , it requires different approximation error i to guarantee the convergence of Vˆ i (xk ). If xk  is large, then the developed generalized value iteration algorithm permits large approximation errors to converge and if xk  is small, then small

Let the initial state be x0 = [1, −1]T . Let the performance index function be expressed by (2). The stage cost function is expressed as U(xk , uk ) = xkT Qxk + uTk Ruk , where Q = R = I and I denotes the identity matrix with a suitable dimension. Let the state space be x = {x | − 1 ≤ x1 ≤ 1, −1 ≤ x2 ≤ 1}. We choose 5000 states in x to implement the generalized value iteration algorithm to obtain the optimal control law. Neural networks are used to implement the developed generalized value iteration algorithm. The critic and the action networks are both chosen as three-layer backpropagation (BP) neural networks with the structures of 2–8–1 and 2–8–1, respectively. The implementations of ADP by neural networks can be seen in [34], [40], and [58] and the details are omitted here. To illustrate the effectiveness of the algorithm, four different initial performance index functions are considered. Let the initial performance index function be expressed by  j (xk ) = xkT Pj xk , j = 1, . . . , 4. Let

2828

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

(a)

(b)

(c)

(d)

Fig. 1. Trajectories of γ ’s by  1 (xk )– 4 (xk ). γI , γII and γIII by (a)  1 (xk ), (b)  2 (xk ), (c)  3 (xk ), and (d)  4 (xk ).

P1 = 0. Let P2 –P4 be initialized by arbitrary positive definite matrices with the forms P2 = [8.60, −0.66; −0.66, 3.09], P3 = [1.06, −1.6587; −1.66, 13.09], P4 = [60, 4; 4, 4], respectively. Let γI , γII , and γIII denote the γ ’s which are obtained by Algorithms 1–3, respectively. Let qi = 0.9999 for ∀ i = 0, 1, . . . , and let ς =  = 0.5. For linear system (56), it is easy to find an admissible control law μ(xk ) = [0.30, −0.30]xk . Let the control spaces be u = {u | − 3 ≤ u ≤ 3}. Initialized by  j (xk ), j = 1, . . . , 4, implement the finite-approximation-error based generalized value iteration algorithm. We choose 20 000 data in x × u to compute γI , γII , and γIII . The corresponding trajectories are presented in Fig. 1(a)–(d), respectively. Define the admissible error as ¯i (xk ) =

1 Vˆ i (xk ). γi−1 + 1

Fig. 2. Plots of the admissible errors corresponding to  1 (xk ) and  2 (xk ). Admissible errors by (a) γI and  1 (xk ), (b) γII and  1 (xk ), (c) γIII and  1 (xk ), (d) γI and  2 (xk ), (e) γII and  2 (xk ), and (f) γIII and  2 (xk ).

(57)

From (57), we can see that for different initial performance index functions  j (xk ), j = 1, 2, 3, 4, and for different iteration index i, we can obtain different Vˆ i (xk ), which means ¯i (xk ) will be different. As Vˆ i (xk ) is a function of xk , we have that ¯i (xk ) is also a function of xk . From Fig. 1, we can see that for Algorithms 1–3, the parameters γI , γII , and γIII are different. Hence, if we choose different Algorithms 1–3 to obtain γi−1 , then the admissible errors ¯i (xk ) will be different. These properties will be displayed in Figs. 2 and 3, respectively, where “In” denotes initial iteration and “Lm” denotes limiting iteration. According to γI , γII , and γIII , the plots of the admissible errors corresponding to  1 (xk ) are shown in Fig. 2(a)–(c), respectively. The plots of the admissible errors corresponding to  2 (xk ) are shown in Fig. 2(d)–(f ), respectively. According to γI , γII , and γIII , the plots of the admissible errors corresponding to  3 (xk ) are shown in Fig. 3(a)–(c), respectively. The plots of the admissible errors corresponding to  4 (xk ) are shown in Fig. 3(d)–(f ), respectively. Let V γI , V γII and V γIII be the iterative performance index functions that are obtained by γI , γII and γIII , respectively.

Fig. 3. Plots of the admissible errors corresponding to  3 (xk ) and  4 (xk ). Admissible errors by (a) γI and  3 (xk ), (b) γII and  3 (xk ), (c) γIII and  3 (xk ), (d) γI and  4 (xk ), (e) γII and  4 (xk ), and (f) γIII and  4 (xk ).

The trajectories of the iterative performance index functions initialized by  1 (xk )– 4 (xk ) are shown in Fig. 4(a)–(d), respectively. From Fig. 4, we can see that for different initial performance index functions, under γI , γII , and γIII , the iterative performance index functions can converge to a finite neighborhood of the optimal performance index function. The iterative states and iterative controls by  1 (xk ) and γI are shown in Fig. 5(a) and (b), respectively, where we can see that the iterative states and controls are also convergent. For linear system (56), the optimal performance index function can analytically be expressed as J ∗ (xk ) = xkT P∗ xk , where P∗

WEI et al.: FINITE-APPROXIMATION-ERROR-BASED DISCRETE-TIME ITERATIVE ADP

(a)

(b)

(c)

(d)

Fig. 4. Trajectories of the iterative performance index functions. V γI , V γII and V γIII by (a)  1 (xk ), (b)  2 (xk ), (c)  3 (xk ), and (d)  4 (xk ).

(a)

(b)

(c)

(d)

Fig. 5. Iterative trajectories and comparisons. (a) States by  1 (xk ) and γI . (b) Control signals by  1 (xk ) and γI . (c) Comparisons of states. (d) Comparisons of controls.

can be obtained by algebraic Riccati equation, i.e., P∗ = [26.61, 1.81; 1.81, 1.90] and the optimal control law can be expressed as u∗ (xk ) = [0.67, −0.65]xk . The comparisons of the states and control by the developed algorithm and the optimal ones are shown in Fig. 5(c) and (d). Therefore, the effectiveness of the developed finite-approximation-errorbased generalized value iteration algorithm can be proven for linear systems. In next example, the effectiveness of the developed algorithm will be illustrated for nonlinear systems. Remark 8: If V0 (xk ) =  1 (xk ), then the developed generalized value iteration algorithm is reduced to the traditional value iteration algorithms [42]–[44], [48]–[51], where we can see that all the properties of traditional value iteration can be verified by generalized value iteration algorithm. On the other hand, initialized by an arbitrary positive semi-definite function, the developed generalized value iteration algorithm can

2829

also be implemented. So we say that the traditional value iteration algorithm is a special case of the developed generalized value iteration algorithm. This is a superiority of the developed algorithm. Example 2: We now examine the performance of the developed algorithm in a torsional pendulum system [40], [55], [58]. The dynamics of the pendulum is as follows: ⎧ dθ ⎪ ⎪ =ω ⎨ dt (58) ⎪ dθ dω ⎪ ⎩J = u − Mgl sin θ − fd dt dt where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar, respectively. The system states are the current angle θ and the angular velocity ω. Let J = 4/3Ml2 and fd = 0.2 be the rotary inertia and frictional factor, respectively. Let g = 9.8 m/s2 be the gravity. Discretization of the system function and performance index function using Euler and trapezoidal methods [60] with the sampling interval t = 0.1s leads to ! $ ! $ 0.1x2k + x1k x1(k+1) = x2(k+1) −0.49 × sin(x1k ) − 0.1 × fd × x2k + x2k $ ! 0 + (59) u 0.1 k where x1k = θk and x2k = ωk . Let the initial state be x0 = [1, −1]T . We choose 10 000 states in x . The critic and the action networks are also chosen as threelayer BP neural networks, where the structures are set by 2–8–1 and 2–8–1, respectively. To illustrate the effectiveness of the algorithm, we also choose four different initial performance index functions which are expressed by  j (xk ) = xkT Pj xk , j = 1, . . . , 4. Let P1 = 0. Let P2 –P4 be initialized by positive definite matrices given by P2 = [2.35, 3.31; 3.31, 9.28], P3 = [5.13, −5.72; −5.72, 15.13], P4 = [100.78, 5.96; 5.96, 20.51], respectively. Let qi = 0.9999 for ∀ i = 0, 1, . . . , and let ς =  = 0.5. We use the action network to obtain the admissible control law [40] with the expression given by μ(xk ) = Winitial φ(Yinitial xk + b initial ), where φ(·) denotes the sigmoid activation function [40], [58]. The weight matrices are obtained as $ 4.15 −0.14 2.64 0.93 −3.74 0.39 Winitial = −3.67 −0.67 −0.25 −2.59 0.27 −1.58 !T 1.13 1.40 0.24 1.13 −2.69 −4.77 3.49 4.91 0.60 −3.55 2.97 0.22 Yinitial = [ 0.01, 1.89, −0.12, 0.10, −0.02, 0.18, −0.01, − 0.01, −1.75, 0.03, −0.00, 0.04]. and b initial = [ − 5.88, 0.81, −2.72, −2.07, −0.03, −0.23, − 0.41, 1.59, 0.88, 1.97, −2.92, −4.28]T . Initialized by  j (xk ), j = 1, . . . , 4, we choose 50 000 data in x × u to compute γI , γII , and γIII . The corresponding trajectories are presented in Fig. 6(a)–(d), respectively. According to γI , γII , and γIII , the plots of the admissible errors with  1 (xk ) are shown in Fig. 7(a)–(c), respectively. The plots of

2830

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

(a)

(b)

(c)

(d)

Fig. 6. Trajectories of γ ’s by  1 (xk )– 4 (xk ). γI , γII , and γIII by (a)  1 (xk ), (b)  2 (xk ), (c)  3 (xk ), and (d)  4 (xk ).

Fig. 7. Plots of the admissible errors corresponding to  1 (xk ) and  2 (xk ). Admissible errors by (a) γI and  1 (xk ), (b) γII and  1 (xk ), (c) γIII and  1 (xk ), (d) γI and  2 (xk ), (e) γII and  2 (xk ), and (f) γIII and  2 (xk ).

the admissible errors corresponding to  2 (xk ) are shown in Fig. 7(d)–(f), respectively. According to γI , γII , and γIII , the plots of the admissible errors corresponding to  3 (xk ) are shown in Fig. 8(a)–(c), respectively. The plots of the admissible errors corresponding to  4 (xk ) are shown in Fig. 8(d)–(f), respectively. Initialized by  1 (xk )– 4 (xk ), the trajectories of the iterative performance index functions for V γI , V γII , and V γIII are shown in Fig. 9(a)–(d), respectively. The iterative control and state trajectories by  1 (xk ) and γI are shown in Figs. 10(a) and 11(a), respectively. The iterative control and state trajectories by  2 (xk ) and γII are shown in Figs. 10(b) and 11(b), respectively. The iterative control and state trajectories by  3 (xk ) and γIII are shown in Figs. 10(c) and 11(c), respectively. The iterative control and state trajectories by  4 (xk ) and γIII are shown in Figs. 10(d) and 11(d), respectively.

Fig. 8. Plots of the admissible errors corresponding to  1 (xk ) and  2 (xk ). Admissible errors by (a) γI and  3 (xk ), (b) γII and  3 (xk ), (c) γIII and  3 (xk ), (d) γI and  4 (xk ), (e) γII and  4 (xk ), and (f) γIII and  4 (xk ).

(a)

(b)

(c)

(d)

Fig. 9. Trajectories of the iterative performance index functions. V γI , V γII , and V γIII by (a)  1 (xk ), (b)  2 (xk ), (c)  3 (xk ), and (d)  4 (xk ).

From Figs. 9–11, we can see that for different initial performance index functions under γI , γII , and γII , the iterative performance index functions can converge to a finite neighborhood of the optimal performance index function. The corresponding iterative controls and iterative states are also convergent. Therefore, the effectiveness of the developed finite-approximation-error based generalized value iteration algorithm can be shown for nonlinear systems. On the other hand, we should say that if the initial weights of the neural networks, such as weights of the initial admissible control law, are changed, then the convergence property will also be different. Now, we also use the action network to obtain the admissible control law [40] with the expression

φ(Yinitial xk + b initial ), and the weight given by μ (xk ) = Winitial

WEI et al.: FINITE-APPROXIMATION-ERROR-BASED DISCRETE-TIME ITERATIVE ADP

2831

(a)

(b)

(a)

(b)

(c)

(d)

(c)

(d)

Fig. 10. Iterative control signals. Trajectories by (a)  1 (xk ) and γI , (b)  2 (xk ) and γII , (c)  3 (xk ) and γIII , and (d)  4 (xk ) and γIII .

(a)

(b)

(c)

(d)

Fig. 11. Iterative state trajectories. Trajectories by (a)  1 (xk ) and γI , (b)  2 (xk ) and γII , (c)  3 (xk ) and γIII , and (d)  4 (xk ) and γIII .

matrices are obtained as $ −1.92 −2.35 4.59 −0.44 −2.10 −3.97

Winitial = 4.45 4.27 −1.55 4.89 4.49 −2.81 !T 1.41 −4.61 −3.37 2.99 −3.48 −2.39 4.64 1.53 3.32 3.81 3.30 4.24

Yinitial = [ − 0.064, 0.002, 0.013, 0.012, −0.101, 0.015, 0.008, 0.019, −0.178, −0.002, −0.123, −0.091] and b initial = [ − 4.87, 3.90, −3.09, 2.21, 1.35, 0.44, − 0.45, −1.33, −2.16, 3.08, −3.92, −4.86]T. We choose  3 (xk ) as the initial performance index function for facilitating our analysis. The trajectories of γI by μ(xk ) and γI by μ (xk ) are shown in Fig. 12(a). We can see that if the initial weights of the neural networks for admissible control law are changed, then the parameter γI is also changed. For

Fig. 12. Comparisons for different μ(xk ). (a) γI and γI by μ(xk ) and μ (xk ).

(b) V γI and V γI . (c) System states by γI and γI . (d) Optimal control by

γI and γI .

different γI , according to Theorem 4 of the revised paper, we know that the iterative performance index function can be convergent under different approximation errors. The trajectories of the iterative performance index function by γI and γI are shown in Fig. 12(b), respectively, where we can see that for different initial weights of the neural networks for admissible control law, the iterative performance index function will converge to different values. Hence, we can say that the initial weights of the neural network will make the convergence properties different. The optimal trajectories of system states and control by γI and γI are shown in Fig. 12(c) and (d), respectively. V. C ONCLUSION In this paper, an effective generalized value iteration algorithm is developed to solve optimal control problems for infinite horizon discrete-time nonlinear systems. The developed generalized value iteration algorithm of ADP permits an arbitrary positive semi-definite function to initialize the algorithm, which overcomes the disadvantage of traditional value iteration algorithms. When the iterative control law and iterative performance index function in each iteration cannot accurately be obtained, we emphasize that for the first time a new “design method of the convergence criteria” for the finite-approximation-error-based generalized value iteration algorithm is established to make the iterative performance index functions converge to a finite neighborhood of the optimal performance index function. BP neural networks are used to implement the finite-approximation-error-based generalized value iteration algorithm. Finally, two simulation examples are given to illustrate the performance of the developed algorithm. R EFERENCES [1] W. J. Rugh, “System equivalence in a class of nonlinear optimal control problems,” IEEE Trans. Autom. Control, vol. 16, no. 2, pp. 189–194, Feb. 1971.

2832

[2] R. R. Mohler and W. J. Kolodziej, “Optimal control of a class of nonlinear stochastic systems,” IEEE Trans. Autom. Control, vol. 26, no. 5, pp. 1048–1054, May 1981. [3] A. Sootla, A. Rantzer, and G. Kotsalis, “Multivariable optimizationbased model reduction,” IEEE Trans. Autom. Control, vol. 54, no. 10, pp. 2477–2480, Oct. 2009. [4] R. Ruiz-Cruz, E. N. Sanchez, F. Ornelas-Tellez, A. G. Loukianov, and R. G. Harley, “Particle swarm optimization for discrete-time inverse optimal control of a doubly fed induction generator,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 1698–1709, Dec. 2013. [5] Z. Sunberg, S. Chakravorty, and R. S. Erwin, “Information space receding horizon control,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 2255–2260, Dec. 2013. [6] Q. Wang, N. Sharma, M. Johnson, C. M. Gregory, and W. E. Dixon, “Adaptive inverse optimal neuromuscular electrical stimulation,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 1710–1718, Dec. 2013. [7] C. Yang, Z. Li, and J. Li, “Trajectory planning and optimized adaptive control for a class of wheeled inverted pendulum vehicle models,” IEEE Trans. Cybern., vol. 43, no. 1, pp. 24–36, Jan. 2013. [8] U. Halder, S. Das, and D. Maity, “A cluster-based differential evolution algorithm with external archive for optimization in dynamic environments,” IEEE Trans. Cybern., vol. 43, no. 3, pp. 881–897, Jun. 2013. [9] P. Giselsson and A. Rantzer, “On feasibility, stability and performance in distributed model predictive control,” IEEE Trans. Autom. Control, vol. 59, no. 4, pp. 1031–1036, Apr. 2014. [10] A. L. Jennings and R. Ordonez, “Optimal inverse functions created via population-based optimization,” IEEE Trans. Cybern., vol. 44, no. 6, pp. 950–965, Jun. 2014. [11] A. Zhou, Y. Jin, and Q. Zhang, “A population prediction strategy for evolutionary dynamic multiobjective optimization,” IEEE Trans. Cybern., vol. 44, no. 1, pp. 40–53, Jan. 2014. [12] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton Univ. Press, 1957. [13] P. J. Werbos, “Advanced forecasting methods for global crisis warning and models of intelligence,” Gen. Syst. Yearbook, vol. 22, pp. 25–38, 1977. [14] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton and P. J. Werbos, Eds. Cambridge, MA, USA: MIT Press, 1991, pp. 67–95. [15] Q. Wei and D. Liu, “Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems,” Neural Comput. Appl., vol. 24, no. 6, pp. 1355–1367, May 2014. [16] Q. Wei, D. Wang, and D. Zhang, “Dual iterative adaptive dynamic programming for a class of discrete-time nonlinear systems with timedelays,” Neural Comput. Appl., vol. 23, nos. 7–8, pp. 1851–1863, Dec. 2013. [17] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 14–25, Jan. 2011. [18] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission control scheme for CDMA cellular networks,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1219–1228, Sep. 2005. [19] T. Mori and S. Ishii, “Incremental state aggregation for value function estimation in reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 5, pp. 1407–1416, Oct. 2011. [20] D. Molina, G. K. Venayagamoorthy, J. Liang, and R. G. Harley, “Intelligent local area signals based damping of power system oscillations using virtual generators and approximate dynamic programming,” IEEE Trans. Smart Grid, vol. 4, no. 1, pp. 498–508, Jan. 2013. [21] B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M. B. Naghibi-Sistani, “Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167–1175, Apr. 2014. [22] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming and feedback stabilization of nonlinear systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 882–893, May 2014. [23] Q. Wei and D. Liu, “Data-driven neuro-optimal temperature control of water gas shift reaction using stable iterative adaptive dynamic programming,” IEEE Trans. Ind. Electron., vol. 61, no. 11, pp. 6399–6408, Nov. 2014. [24] Q. Wei and D. Liu, “Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification,” IEEE Trans. Autom. Sci. Eng., DOI: 10.1109/TASE.2013.2284545

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

[25] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 2226–2236, Dec. 2011. [26] X. Zhong, H. He, H. Zhang, and Z. Wang, “Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., DOI: 10.1109/TNNLS.2014.2305841 [27] R. A. C. Bianchi, M. F. Martins, C. H. C. Ribeiro, and A. H. R. Costa, “Heuristically-accelerated multiagent reinforcement learning,” IEEE Trans. Cybern., vol. 44, no. 2, pp. 252–265, Apr. 2014. [28] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control based on the dual critic network design,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 913–928, Jun. 2013. [29] Z. Ni, H. He, J. Wen, and X. Xu, “Goal representation heuristic dynamic programming on maze navigation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 12, pp. 2038–2050, Dec. 2013. [30] H. Xu and S. Jagannathan, “Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 471–484, Mar. 2013. [31] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 145–157, Jan. 2013. [32] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming and feedback stabilization of nonlinear systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 882–893, May 2014. [33] Q. Wei and D. Liu, “An iterative -optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state,” Neural Netw., vol. 32, no. 6, pp. 236–244, Aug. 2012. [34] Q. Wei and D. Liu, “A novel iterative θ -Adaptive dynamic programming for discrete-time nonlinear systems,” IEEE Trans. Autom. Sci. Eng., DOI: 10.1109/TASE.2013.2280974 [35] Q. Wei and D. Liu, “Numerical adaptive learning control scheme for discrete-time nonlinear systems,” IET Control Theory Appl., vol. 7, no. 11, pp. 1472–1486, Jul. 2013. [36] H. Zhang, Q. Wei, and D. Liu, “An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games,” Automatica, vol. 47, no. 1, pp. 207–214, Jan. 2011. [37] E. Zhou, “Optimal stopping under partial observation: Near-value iteration,” IEEE Trans. Autom. Control, vol. 58, no. 2, pp. 500–506, Feb. 2013. [38] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive dynamic programming,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 32, no. 2, pp. 140–153, May 2002. [39] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, May 2005. [40] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 621–634, Mar. 2014. [41] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific, 1996. [42] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 943–949, Aug. 2008. [43] D. P. Bertsekas, Dynamic Programming and Optimal Control, 3rd ed. Belmont, MA, USA: Athena Scientific, 2007. [44] B. Lincoln and A. Rantzer, “Relaxing dynamic programming,” IEEE Trans. Autom. Control, vol. 51, no. 8, pp. 1249–1260, Aug. 2006. [45] A. Almudevar and E. F. de Arruda, “Optimal approximation schedules for a class of iterative algorithms, with an application to multigrid value iteration,” IEEE Trans. Autom. Control, vol. 57, no. 12, pp. 3132–3146, Dec. 2012. [46] E. A. Feinberg and J. Huang, “The value iteration algorithm is not strongly polynomial for discounted dynamic programming,” Oper. Res. Lett., vol. 42, no. 2, pp. 130–131, Mar. 2014. [47] E. F. Arruda, F. O. Ourique, J. LaCombe, and A. Almudevar, “Accelerating the convergence of value iteration by using partial transition functions,” Eur. J. Oper. Res., vol. 229, no. 1, pp. 190–198, Aug. 2013. [48] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 937–942, Jul. 2008.

WEI et al.: FINITE-APPROXIMATION-ERROR-BASED DISCRETE-TIME ITERATIVE ADP

2833

[49] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming,” IEEE Trans. Autom. Sci. Eng., vol. 9, no. 3, pp. 628–634, Jul. 2012. [50] Q. Wei, H. Zhang, and J. Dai, “Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions,” Neurocomputing, vol. 72, nos. 7–9, pp. 1839–1848, Mar. 2009. [51] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Syst., vol. 32, no. 6, pp. 76–105, Dec. 2012. [52] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878–888, 2010. [53] Q. Yang and S. Jagannathan, “Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 377–390, Apr. 2012. [54] H. Zhang, L. Cui, and Y. Luo, “Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using singlenetwork ADP,” IEEE Trans. Cybern., vol. 43, no. 1, pp. 206–216, Feb. 2013. [55] D. Liu and Q. Wei, “Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 779–789, Apr. 2013. [56] L. Grune and A. Rantzer, “On the infinite horizon performance of receding horizon controllers,” IEEE Trans. Autom. Control, vol. 53, no. 9, pp. 2100–2111, Sep. 2008. [57] A. Rantzer, “Relaxed dynamic programming in switching systems,” IEE Proc. Control Theory Appl., vol. 153, no. 5, pp. 567–574, Sep. 2006. [58] J. Si and Y.-T. Wang, “On-line learning control by association and reinforcement,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264–276, Mar. 2001. [59] R. C. Dorf and R. H. Bishop, Modern Control Systems, 12th ed. New York, NY, USA: Prentice Hall, 2011. [60] R. Padhi, N. Unnikrishnan, X. Wang, and S. N. Balakrishnan, “A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems,” Neural Netw., vol. 19, no. 10, pp. 1648–1660, 2006.

Fei-Yue Wang (S’87–M’89–SM’94–F’03) received the Ph.D. degree in computer and systems engineering from Rensselaer Polytechnic Institute, Troy, NY, USA, in 1990. He joined the University of Arizona, Tucson, AZ, USA, in 1990, and became a Professor and Director of the Robotics and Automation Laboratory and the Program for Advanced Research in Complex Systems. In 1999, he founded the Intelligent Control and Systems Engineering Center with the Chinese Academy of Sciences (CAS), Beijing, China, under the support of the Outstanding Overseas Chinese Talents Program. Since 2002, he has been the Director of the Key Laboratory of Complex Systems and Intelligence Science with CAS, and is currently the Director of the State Key Laboratory of Management and Control for Complex Systems. His current research interests include social computing, web science, complex systems, and intelligent control. Prof. Wang was the recipient of the National Prize in Natural Sciences of China and the Outstanding Scientist Award by ACM for his work in intelligent control and social computing, in 2007. He was an Editor-in-Chief of the International Journal of Intelligent Control and Systems and the World Scientific Series in Intelligent Control and Intelligent Automation, from 1995 to 2000. He is currently an Editor-in-Chief of the IEEE T RANSACTIONS ON I NTELLIGENT T RANSPORTATION S YSTEMS. He has served as chairs for over 20 IEEE, ACM, INFORMS, and ASME conferences. He was the President of IEEE Intelligent Transportation Systems Society, from 2005 to 2007, the Chinese Association for Science and Technology, New York, NY, USA, in 2005, and the U.S. Zhu Kezhen Education Foundation, New York, from 2007 to 2008. He is currently the Vice President of the ACM China Council and the Vice President/Secretary-General of Chinese Association of Automation. He is a member of Sigma Xi and an Elected Fellow of INCOSE, IFAC, ASME, and AAAS.

Qinglai Wei (M’11) received the B.S. degree in automation, and the M.S. and Ph.D. degrees in control theory and control engineering, from Northeastern University, Shenyang, China, in 2002, 2005, and 2008, respectively. From 2009 to 2011, he was a Post-Doctoral Fellow with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, where he is currently an Associate Professor. His current research interests include neural-networks-based control, adaptive dynamic programming, optimal control, nonlinear systems, and their industrial applications.

Xiong Yang received the B.S. degree in mathematics and applied mathematics from Central China Normal University, Wuhan, China, in 2008, the M.S. degree in pure mathematics from Shandong University, Jinan, China, in 2011, and the Ph.D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2014. He is currently an Assistant Professor with the Institute of Automation, Chinese Academy of Sciences. His current research interests include adaptive dynamic programming, reinforcement learning, adaptive control, and neural networks.

Derong Liu (S’91–M’94–SM’96–F’05) received the Ph.D. degree in electrical engineering from the University of Notre Dame, Notre Dame, IN, USA, in 1994. He was a Staff Fellow with General Motors Research and Development Center, Warren, MI, USA, from 1993 to 1995. He was an Assistant Professor with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, USA, from 1995 to 1999. He joined the University of Illinois at Chicago, Chicago, IL, USA, in 1999, and became a Full Professor of Electrical and Computer Engineering and Computer Science, in 2006. He was selected for the “100 Talents Program” by the Chinese Academy of Sciences in 2008. He has published 15 books (six research monographs and nine edited volumes). Prof. Liu was the recipient of Michael J. Birck Fellowship from the University of Notre Dame, in 1990, the Harvey N. Davis Distinguished Teaching Award from the Stevens Institute of Technology, in 1997, the Faculty Early Career Development (CAREER) Award from the National Science Foundation, in 1999, the University Scholar Award from the University of Illinois from 2006 to 2009, and the Overseas Outstanding Young Scholar Award from the National Natural Science Foundation of China, in 2008. He is currently an Editor-in-Chief of the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS. He is a fellow of the International Neural Network Society.

Finite-approximation-error-based discrete-time iterative adaptive dynamic programming.

In this paper, a new iterative adaptive dynamic programming (ADP) algorithm is developed to solve optimal control problems for infinite horizon discre...
2MB Sizes 1 Downloads 6 Views