Model-Free Dual Heuristic Dynamic Programming.

1834

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Model-Free Dual Heuristic Dynamic Programming Zhen Ni, Haibo He, Senior Member, IEEE, Xiangnan Zhong, and Danil V. Prokhorov, Senior Member, IEEE Abstract— Model-based dual heuristic dynamic programming (MB-DHP) is a popular approach in approximating optimal solutions in control problems. Yet, it usually requires offline training for the model network, and thus resulting in extra computational cost. In this brief, we propose a model-free DHP (MF-DHP) design based on finite-difference technique. In particular, we adopt multilayer perceptron with one hidden layer for both the action and the critic networks design, and use delayed objective functions to train both the action and the critic networks online over time. We test both the MF-DHP and MB-DHP approaches with a discrete time example and a continuous time example under the same parameter settings. Our simulation results demonstrate that the MF-DHP approach can obtain a control performance competitive with that of the traditional MB-DHP approach while requiring less computational resources.

Index Terms— Action-dependent dual heuristic dynamic programming (DHP), adaptive critic designs (ACDs), adaptive dynamic programming (ADP), online learning, reinforcement learning. I. I NTRODUCTION Adaptive dynamic programming (ADP) has demonstrated great potential in solving the optimal control problems in a principled way in [1]–[4]. Dual heuristic dynamic programming (DHP) is one of the powerful techniques in the ADP framework and shows promising control performance for various nonlinear systems and complex applications [5]–[8]. Among these papers, the model network is generally employed to predict the future system state and the subsequent future partial derivatives of the value function. Usually, offline training for the model network is the key prerequisite for the (traditional) modelbased DHP (MB-DHP) design. This motivates our present research on the model-free DHP (MF-DHP) design. ADP, and more specifically, adaptive critic design (ACD), has been categorized into three major groups [8]–[10]: 1) heuristic dynamic programming (HDP); 2) DHP; and 3) globalized DHP (GDHP). In [11], the online learning control with the direct HDP was first presented. The action network was used to provide the control action, and the critic network evaluated the control performance with a value function over time. The authors proposed to implement the objective function of the critic network backward. They used the previous value function and current value function for the temporal difference (TD) error, rather than the current and the next value function. This design clearly simplified the implementation of ADP and reduced its computational burden. A boundedness result as a part of theoretical analysis of this design was provided in [12]. This direct HDP approach also showed promising control performance in practical Manuscript received December 18, 2013; revised April 14, 2015; accepted April 16, 2015. Date of publication May 5, 2015; date of current version July 15, 2015. This work was supported in part by the U.S. Army Research Office under Grant W911NF-12-1-0378 and in part by the Division of Electrical, Communications and Cyber Systems through the National Science Foundation under Grant ECCS 1053717. Z. Ni, H. He, and X. Zhong are with the Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail: [email protected]; [email protected]; [email protected]). D. V. Prokhorov is with the Toyota Technical Center, Toyota Research Institute of North America, Ann Arbor, MI 48105 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2015.2424971

applications, such as the large power system control and the helicopter stabilization [13]–[15]. Many other researchers also contributed to ADP/HDP approach in both theoretical and practical considerations in [16]–[21]. The optimal control based on iterative ADP design for both continuous time and discrete time systems has been studied and investigated in [22]–[25]. Such design has been equipped with solid mathematical proofs and rigorous convergence analysis in a number of recently developed papers in [26]–[31]. Recently, a new design called goal representation HDP (GrHDP) has been studied on various tracking control and balancing control benchmarks [32]–[34]. A hierarchical HDP design with multiple goal representation was also proposed and studied in [35] and [36]. Maze navigation of the learning agent was studied with the GrHDP approach in [33] and [37]. All these GrHDP designs were adopted without any model network and demonstrated the control performance with all the neural networks learning from scratch. Rigorous proofs of convergence for the GrHDP design were included in [32], [33], and [38]–[40]. We also investigated several real applications for smart grid and energy storage systems in [41] and [42]. In the DHP design, the action network is also assigned to approximate the control policy, while the critic network is adopted to evaluate the control performance with the partial derivatives of the value function. In general, the DHP design relies on a previously obtained (e.g., trained on actual system data) and sufficiently accurate model. This model network is used to predict the future system state and the subsequent future partial derivatives of the value function. The objective functions incorporate the forward TD error (i.e., between the k-step and the (k + 1)-step). For instance, in [5] and [6], the authors first built the model network for the turbogenerator in the power system, and then tested the DHP/GDHP approaches with the offline trained parameters. In [8] and [43]–[45], the authors discussed various DHP/GDHP algorithms emphasizing the importance of the accurate model network. In addition, researchers in [7] and [46]–[48] did the convergence analysis for the DHP/GDHP approaches. They first proved the stability of the model network under certain conditions, and then demonstrated the convergence of both the value function and its partial derivatives under certain assumptions/constraints. However, a sufficiently accurate model network might not always be available, which may negatively affect the performance of MB-DHP. Although many publications mentioned that the model network is not necessarily needed in ADP, there is still an opportunity to verify this idea. We are inspired by the modelfree HDP (MF-HDP) design in [11], and adopt the finite-difference approach to realize the learning process for both the action and the critic networks. We contend that the value of this brief is in introducing a novel concept of the MF-DHP reinforced by successful simulation results. To the best of our knowledge, this is the first time that MF-DHP is implemented and verified in examples.1 In this MF-DHP design, we backup one step to save both the previous (k − 1)-step and the current k-step partial derivatives of 1 Our design belongs to the group of action-dependent (AD) adaptive critic designs, of which two designs, such as ADHDP and ADDHP, are discussed in this brief. They feature direct connection between the action and the critic networks. We do not use the prefix AD when referring to the HDP and the DHP designs for simplicity of notation.

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


1835

Fig. 1. Schematic of the online MF-DHP design. Solid arrow and dashed arrow: signal path and backpropagation path, respectively. Gray elliptical block: mathematical calculation in (7). The vector Y is defined as the input for the critic network, including the state vector and control action.

value function. Similar to [11], we adopt delayed errors for both the action and the critic networks, and employ the gradient descent method to update the weights in both networks over time. Comparing with the (traditional) MB-DHP design [8], we replace the backpropagation paths through the model network with finite differences as surrogates for appropriate derivatives. We compare the control performance of the proposed MF-DHP approach with the traditional MB-DHP approach under the same settings. The statistical simulation results demonstrate that the proposed MF-DHP approach requires less computational cost, meanwhile staying very competitive in performance with the MB-DHP approach. The rest of this brief is organized as follows. Section II presents both the MF-HDP design and the proposed MF-DHP design. Section III provides our simulation studies on two typical examples, followed by the discussion and conclusion in Section IV. II. M ODEL -F REE HDP AND DHP D ESIGNS In this section, we contrast the MF-HDP design with the proposed MF-DHP design. We first discuss the MF-HDP design with the definitions of the objective functions for both the action and the critic networks, and then introduce our proposed MF-DHP design and its simplified learning algorithm. A. Model-Free HDP Design In the MF-HDP design [11], the ultimate objective for the controller is to seek the strategy to minimize the value function J , expressed as J (k) =

∞

γ n r (k + n)

(1)

n=0

where γ is a discount factor and r is the reward function (also called local cost/utility function). In [11], the critic network is used to approximate J function and is trained backward in time, which is of great importance for the online model-free operation. The objective function of critic network is thus defined as ecHDP = γ J (k) − [J (k − 1) − r (k − 1)].

(2)

The weight updating for the critic network is expressed as ∂ J (k) (3) ∂ Wc (k) where ηc is the learning rate and Wc is the weight parameters for the critic network. While the action network is designed to seek the control policy u so as to optimize the overall cost expressed in (1). To get a gradient of the cost function J with respect to the weights in action network (Wa ), we simply backpropagate J through the critic network. This gives us ∂ J /∂u and ∂ J /∂ Wa for all the inputs in the vector and the weights, respectively. Wc = −ηc · γ (γ J (k) − [J (k − 1) − r (k − 1)]) ·

B. Model-Free DHP Design In DHP design as presented in Fig. 1, the output of the critic network is defined as λ(k) = ∂ J (k)/∂Y (k), and the input for the critic network is defined as Y (k) = [x(k)T u(k)T ]T . Motivated by the MF-HDP design [11], we adopt the same technique to obtain the TD error between λ(k − 1) and λ(k) in our MF-DHP design. Starting from (2), we can obtain the error function for the MF-DHP design as T ∂Y (k) ∂r (k − 1) · λ(k) − λ(k − 1) − . (4) ec = γ · ∂Y (k − 1) ∂Y (k − 1) The objective function for the critic network can thus be written as 1 T e ec . 2 c From the chain rule, we can obtain the weight updates as ∂ Ec Wc = −ηc ∂ Wc T ∂Y (k) ∂λ(k − 1) · = −ηc · ec · γ ∂Y (k − 1) ∂ Wc (k − 1) Ec =

(5)

(6)

where λ(k − 1) is the previous partial derivatives of value function. If x and u are both scalar, we have ⎡ ⎤ ∂ x(k) ∂u(k) ⎢ ∂ x(k − 1) ∂ x(k − 1) ⎥ ∂Y (k) ⎥. (7) =⎢ ∂u(k) ⎦ ∂Y (k − 1) ⎣ ∂ x(k) ∂u(k − 1) ∂u(k − 1) If x or u is a vector, we need to implement (7) with respect to the components in x or u in a component-by-component vector format. In Fig. 1, we show the error functions for both the action network (ea ) and the critic network (ec ) with the dashed arrows. As both Y and λ are vectors in our MF-DHP design, we need to implement (4) and (6) with respect to the components in Y and λ in a component-bycomponent vector format. Therefore, the error function in (4) can be written in a vector form as ec = [ecx1 ecx2 . . . ecxM ecu1 ecu2 . . . ecuN ]T and λ = [λ1x λ2x . . . λxM λu1 λu2 . . . λuN ]T , where M is the dimension of x and N is the dimension of u. In this way, we can realize the error with finite-difference approach [49], [50] for implementing (7), and obtain the error function ec via the calculation on each component without any usage of model network. For the action network, we adopt the error function similar to [8]. We define the error of the action network as follows: ea = λu (k)

(8)

where λu is part of λ and can be written as λu = [λu1 λu2 . . . λuN ]T . The objective function is thus defined as Ea =

1 T e ea . 2 a

(9)

1836


Fig. 2. Comparison of the controlled evolution of x1 between the MF-DHP approach (solid line) and the MB-DHP approach (dashed line).

Fig. 3. Comparison of the controlled evolution of x2 between the MF-DHP approach (solid line) and the MB-DHP approach (dashed line).

The weight updating for the action network can be written as N ∂u j (k) (10) Wa = −ηa · λuj (k) · ∂ Wa (k) j =1

where ηa is the learning rate for the weight parameters in the action network. The learning of the action network is not carried out through the internal backpropagation paths of the critic network leading from its output to the input u, since this is equivalent to the HDP design. Instead, we use λu from the output of the critic network directly. It has been discussed in [8] that the error function in (8) satisfies all the action-dependent forms of ACDs. In the learning process, we first update the weights in the critic network and then update the weights in action network. We adopt two criteria (i.e., error threshold and max iteration number) for the iterative learning process. The learning process is regarded as completed once both networks finish its iterative learning.

Fig. 4.

Typical trajectories of λ with the proposed MF-DHP approach.

Fig. 5.

Typical trajectories of λ with the traditional MB-DHP approach.

III. S IMULATION S TUDIES We adopt the multilayer perceptron (MLP) with one hidden layer for both the action and the critic networks in our MF-DHP design. The action network observes the system state variables, and then provides a control output. A MLP builds a relationship between system variables and output actions, and therefore it is a controller that learns based on data generated by the closed-loop system. We first test both the MB-DHP and MF-DHP approaches on a discrete time example, and show that the proposed MF-DHP approach can obtain a control performance competitive with that of the traditional MB-DHP approach. Then, we further test both approaches on a popular continuous time example. A. Case Study One The nonlinear discrete time system is defined as follows (derived from [51]): 0 − sin(0.5x 1 (k)) + u(k) (11) x(k + 1) = 1 − cos(1.4x 2 (k)) ∗ sin(0.9x 1 (k)) where x(k) is the input vector for the action network and x(k) = [x1 (k) x2 (k)]T . The control output u(k) (u(k) ∈ R) is computed as the output of action network (one-hidden-layer MLP). We define the cost/reward function as r (k) = x T (k)Qx(k) + u T (k)Ru(k), where Q and R are the corresponding identity matrices. The MLP structures for the action and the critic networks are 2-6-1 (i.e., two input nodes, six hidden nodes and one output node) and 3-6-3, respectively. The activation function is defined as the tanh. The initial system state value is assigned as x(0) = [0.5 −0.5]T ,

and the initial weights for both the action and the critic networks are uniformly randomly selected from [−0.5, 0.5]. The learning for both networks will be terminated if the squared error is under the threshold or the iteration number exceeds the maximum iteration number. Here, we set the error thresholds for both the action and the critic networks as Ta = Tc = 1e − 10, and the maximum iteration numbers for the action and the critic networks are Na = 50 and Nc = 60, respectively. The learning rates for both networks are set as 0.1. The control performance is presented in Figs. 2 and 3, where one can see the typical system trajectories for both the MF-DHP approach (solid line) and the MB-DHP approach (dashed line). Both figures show that the MF-DHP approach can obtain the competitive results with those of the MB-DHP approach. In Figs. 4 and 5, we provide the corresponding trajectories for λ with the proposed MF-DHP approach and the traditional MB-DHP approach, respectively. The results indicate that the MF-DHP approach does not sacrifice much in terms of learning accuracy. We also measure the computational


1837

cost of backpropagation algorithm in both approaches. The MF-DHP approach needs 0.070 s for the backpropagation algorithm for the total simulation time of 0.971 s. However, the MB-DHP approach requires 0.151 s for the backpropagation algorithm for the total simulation time of 2.137 s. Our proposed MF-DHP approach shows relatively efficient learning process while achieving a competitive control performance. B. Case Study Two In order to assess control performance of the proposed MF-DHP design from a statistical viewpoint, we test it on a more complex, continuous time example. We test both the MB-DHP and MF-DHP approaches on the ball and beam balancing example [36], which has been widely studied as a popular control example for its nonlinearity and instability. The objective of the learning controller is to balance the ball as long as possible and make the angle of the beam as small as possible. The dynamics is assumed to be unknown for the MF-DHP approach. We also note that, although we used continuous time example here, we implement them in discrete time. Similar to [32] and [35], we call the continuous time system model function [e.g., state-space function (23)] with ode45 function in MATLAB with an integral step size 0.02 s. In contrast with the MB-DHP approach, we first need to have data to train the model network offline, and then employ the DHP approach to control the system. From [36], we can obtain the motion equations from the Lagrange equation 1 I m + b2 x¨ + (mr 2 + Ib ) α¨ − mx α˙ 2 = mg(sin α) (12) r r [m(x )2 + Ib + Iω ]α¨ + (2m x˙ x + bl 2 )α˙ + K l 2 α 1 (13) + (mr 2 + Ib ) x¨ − mgx (cos α) = ul(cos α) r where m 0.0162 kg, the mass of the ball; r 0.02 m, the roll radius of the ball; 4.32 × 10−5 kg · m2 , the inertia moment of the ball; Ib b 1 N/m, the friction coefficient of the drive mechanics; l 0.48 m, the radius of force application; 0.5 m, the radius of beam; lω K 0.001 N/m, the stiffness of the drive mechanics; g 9.8 N/kg, the gravity; Iω 0.14025 kg · m2 , the inertia moment of the beam; u the force of the drive mechanics. In order to simplify the system model function, we define that x1 = x represents the position of the ball, x2 = x˙ represents the velocity the ball, x3 = α is the angle of the beam with respect to the horizontal axis, and x4 = α˙ is the angular velocity of the beam. In this way, the system function (12) and (13) can be transformed into the following form: 1 Ib m + 2 x˙2 + (mr 2 + Ib ) x˙4 r r = mx1 x42 + mg(sin x3 )

1 (mr 2 + Ib ) x˙2 + mx12 + Ib + Iω x˙4 r = (ul + mgx1 ) cos x3 − (2mx2 x1 + bl 2 )x4 − K l 2 x3 .

(14)

(15)

To be more clear, we rewrite (14) and (15) into a matrix formula as follows: A B x˙ P1 · 2 = (16) C D x˙4 P2

Fig. 6. Comparison of the required number of trials in 100 runs. Both the MF-DHP and MB-DHP approaches are tested with different choices of hidden nodes in both the action and critic networks.

where P1 = mx1 x42 + mg(sin x3 )

(17)

P2 = (ul + mgx1 ) cos x3 − (2mx2 x1 + bl 2 )x4 − K l 2 x3 I A = m + b2 r 1 B = (mr 2 + Ib ) r 1 C = (mr 2 + Ib ) r D = mx12 + Ib + Iω .

(18) (19) (20) (21) (22)

Therefore, we can obtain the general form of this problem as −1 A B P1 x˙2 = (23) x˙4 P2 C D and the other two terms in the state vector can be expressed as x˙1 = x2 and x˙3 = x4 . Thus, the input for the action network is defined as X = [x1 x2 x3 x4 ]. The control output u is generated by the action network based on the observation of X, and will then be applied to the system. All simulation results presented here are based on 100 independent runs, and each run consists of the maximum of 1000 trials. A run will be considered successful if the last trial (i.e., trial number less or equal than 1000) can maintain balancing of the ball for 6000 time steps with a step size of 0.02 s. A run will be regarded as a failed run if the position of the ball (x1 ) is out of the bound (i.e., [−0.48 m, 0.48 m]) or the angle of the beam exceeds the maximum deviation (i.e., outside of [−0.24 rad, 0.24 rad]). In this simulation, we train both approaches offline and test them with the final learned control policy. The system starts with the random initial conditions (e.g., x1 and x3 are uniformly initialized in [−0.2 m, 0.2 m] and [−0.15 rad, 0.15 rad], respectively), and these initial conditions are independent among 100 runs. The other parameter settings are kept the same as those in Section III-A. We compare both the MF-DHP and MB-DHP approaches with different choices of hidden nodes in both the critic and the action networks. The boxplot of the successful trial numbers (i.e., the required number of trials to reach success in each run) is provided in Fig. 6. The number of hidden nodes is chosen as 4, 7, 10, and 20 for both the action and the critic networks. Both approaches can

1838


Fig. 7. Comparison of the average computation time per each successful run. Both the MF-DHP and MB-DHP approaches are compared only in terms of backpropagation computational cost in each successful run (i.e., 6000 time steps).

achieve 100% successful rate. Meanwhile, the MF-DHP approach shows very competitive performance in terms of the required average number of trials to learn successful balancing as that of the MB-DHP approach with different choices of hidden nodes. In Fig. 7, we provide the required computational time cost per successful run for both approaches. We note that we only account for the backpropagation training time in every time step and sum those times up in each successful run (i.e., the 6000 time steps). As expected, the MF-DHP approach takes less computational time than that of the MB-DHP approach. In Fig. 7, we can also see the growing computational advantage of the MF-DHP approach over the MB-DHP approach, as the number of the hidden nodes increases. These results indicate that the MF-DHP approach can reduce the computational burden, while still maintaining a competitive performance. IV. D ISCUSSION AND C ONCLUSION In this brief, we propose an adaptive critic design, the MF-DHP design based on finite differences. Our proposed design has a simpler architecture than the traditional DHP designs because it does not require any model network. This may result in significant savings in terms of computational cost, as presented in Fig. 7. We adopt the finite-difference approach to approximate the gradient descent in the backpropogation paths [i.e., ∂Y (k)/∂Y (k − 1)], resulting in a very competitive control performance (e.g., the required average number of trials to reach success) with the existing MB-DHP design. In addition, we would also like to share our thoughts in terms of controller stability, control policy convergence, and noise tolerance issues of this new MF-DHP design. As similar to the (direct) MF-HDP in [11], stability of this design can also be carried out with a Lyapunov stability analysis similar to the one in [12] in terms of boundedness of state variables and weights attained under certain conditions. In addition, the convergence of the value function J and the partial derivatives of the value function λ have been presented in [46] and [47]. We can also analyze in a similar way for convergence of partial derivatives of the value function λ. As for the noise tolerance of our proposed approach, we tested both the MB-DHP and MF-DHP approaches with 5% uniform noise on the position of the ball in the second case study (the noise is added in the same way as that in [11] and [34]). In 100 independent runs (with seven hidden nodes in both the action and the critic networks), the simulation

results show that, to reach success, the MF-DHP approach requires the average of 45 trials, while the MB-DHP approach requires the average of 39.1 trials. We can see that the performance of MF-DHP approach is consistently close to that of the MB-DHP approach, albeit both required average trial numbers are slightly higher than those numbers under noise-free conditions presented in Fig. 6. In summary, we present encouraging results toward designing the DHP architecture without any model prediction/network. The MF-DHP design can not only reduce the computational burden by not using a model network but also keep the control performance at a level very competitive with that of the MB-DHP design. The simplified learning algorithm is also elaborated. We demonstrate the control performance of the proposed MF-DHP design on two simulation examples and compare with the results of the traditional MB-DHP design. In addition, comparing with the MF-HDP design in which the critic network approximates J values, the proposed MF-DHP design attempts to approximate the derivatives of J function. In general, using derivatives of the optimization criterion, rather than the optimization criterion itself, is regarded as being advantageous to searching for the optimal control policy [8], [43], [44]. Therefore, the proposed MF-DHP design is expected to have better learning and control performance than the existing MF-HDP design or other reinforcement learning approaches [11], [19]. In comparison with the system-theoretic control approaches [52], [53], our proposed approach is more general than the referenced approaches because it does not assume a specific functional form for the system to be controlled. While ADP approaches might not be as fast in adapting to modeling uncertainties as the approaches in [52] and [53], our approach is still applicable when dealing with uncertainties because of learning through interaction in the closed-loop system by both the action and the critic networks. It should be noted that in our MF-DHP design, we employ a simple, first-order approximation of derivatives ∂ J /∂Y . Though a high level of matching between performances of MF-DHP and MB-DHP is observed, it may not always be achieved as the derivative approximation accuracy depends on the sampling rate of the dynamics and may not be adequate for some problems. R EFERENCES [1] D. P. Bertsekas, Dynamic Programming and Optimal Control. Belmont, MA, USA: Athena Scientific, 1995. [2] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Eds., Handbook of Learning and Approximate Dynamic Programming. Hoboken, NJ, USA: Wiley, 2004. [3] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. Hoboken, NJ, USA: Wiley, 2007. [4] F. L. Lewis and D. Liu, Eds., Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Hoboken, NJ, USA: Wiley, 2013. [5] S. Ray, G. K. Venayagamoorthy, B. Chaudhuri, and R. Majumder, “Comparison of adaptive critic-based and classical wide-area controllers for power systems,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 1002–1007, Aug. 2008. [6] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, “Dual heuristic programming excitation neurocontrol for generators in a multimachine power system,” IEEE Trans. Ind. Appl., vol. 39, no. 2, pp. 382–394, Mar./Apr. 2003. [7] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [8] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997. [9] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May 2009.


[10] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Third Quarter 2009. [11] J. Si and Y.-T. Wang, “Online learning control by association and reinforcement,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264–276, Mar. 2001. [12] F. Liu, J. Sun, J. Si, W. Guo, and S. Mei, “A boundedness result for the direct heuristic dynamic programming,” Neural Netw., vol. 32, pp. 229–235, Aug. 2012. [13] C. Lu, J. Si, and X. Xie, “Direct heuristic dynamic programming for damping oscillations in a large power system,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 1008–1013, Aug. 2008. [14] R. Enns and J. Si, “Helicopter trimming and tracking control using direct neural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4, pp. 929–939, Jul. 2003. [15] Y. Tang, H. He, Z. Ni, J. Wen, and X. Sui, “Reactive power control of grid-connected wind farm based on adaptive dynamic programming,” Neurocomputing, vol. 125, pp. 125–133, Feb. 2014. [16] D. Liu, D. Wang, F.-Y. Wang, H. Li, and X. Yang, “Neural-networkbased online HJB solution for optimal robust guaranteed cost control of continuous-time uncertain nonlinear systems,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2834–2847, Dec. 2014. [17] Q. Wei, F.-Y. Wang, D. Liu, and X. Yang, “Finite-approximation-errorbased discrete-time iterative adaptive dynamic programming,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2820–2833, Dec. 2014. [18] D. Liu and Q. Wei, “Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 779–789, Apr. 2013. [19] B. Xu, C. Yang, and Z. Shi, “Reinforcement learning output feedback NN control using deterministic learning technique,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 635–641, Mar. 2014. [20] D. Liu, H. Li, and D. Wang, “Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 44, no. 8, pp. 1015–1027, Aug. 2014. [21] H. Li, D. Liu, and D. Wang, “Integral reinforcement learning for linear continuous-time zero-sum games with completely unknown dynamics,” IEEE Trans. Autom. Sci. Eng., vol. 11, no. 3, pp. 706–714, Jul. 2014. [22] D. Liu, D. Wang, and H. Li, “Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp. 418–428, Feb. 2014. [23] Q. Wei and D. Liu, “Data-driven neuro-optimal temperature control of water–gas shift reaction using stable iterative adaptive dynamic programming,” IEEE Trans. Ind. Electron., vol. 61, no. 11, pp. 6399–6408, Nov. 2014. [24] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 621–634, Mar. 2014. [25] X. Yang, D. Liu, and Y. Huang, “Neural-network-based online optimal control for uncertain non-linear continuous-time systems with control constraints,” IET Control Theory Appl., vol. 7, no. 17, pp. 2037–2047, Nov. 2013. [26] D. Wang, D. Liu, and Q. Wei, “Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach,” Neurocomputing, vol. 78, no. 1, pp. 14–22, Feb. 2012. [27] Q. Wei and D. Liu, “An iterative -optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state,” Neural Netw., vol. 32, pp. 236–244, Aug. 2012. [28] D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs,” Inf. Sci., vol. 220, pp. 331–342, Jan. 2013. [29] H. Li and D. Liu, “Optimal control for discrete-time affine non-linear systems using general value iteration,” IET Control Theory Appl., vol. 6, no. 18, pp. 2725–2736, Dec. 2012. [30] D. Liu, H. Li, and D. Wang, “Neural-network-based zero-sum game for discrete-time nonlinear systems via iterative adaptive dynamic programming algorithm,” Neurocomputing, vol. 110, pp. 92–100, Jun. 2013. [31] Q. Wei and D. Liu, “Numerical adaptive learning control scheme for discrete-time non-linear systems,” IET Control Theory Appl., vol. 7, no. 11, pp. 1472–1486, Jul. 2013. [32] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control based on the dual critic network design,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 913–928, Jun. 2013.

1839

[33] Z. Ni, H. He, J. Wen, and X. Xu, “Goal representation heuristic dynamic programming on maze navigation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 12, pp. 2038–2050, Dec. 2013. [34] H. He, Z. Ni, and J. Fu, “A three-network architecture for on-line learning and optimization based on adaptive dynamic programming,” Neurocomputing, vol. 78, no. 1, pp. 3–13, Feb. 2012. [35] Z. Ni, H. He, D. Zhao, and D. V. Prokhorov, “Reinforcement learning control based on multi-goal representation using hierarchical heuristic dynamic programming,” in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), Brisbane, Queensland, Australia, Jun. 2012, pp. 1–8. [36] H. He, Z. Ni, and D. Zhao, “Learning and optimization in hierarchical adaptive critic design,” in Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. L. Lewis and D. Liu, Eds. Hoboken, NJ, USA: Wiley, 2013. [37] Z. Ni and H. He, “Heuristic dynamic programming with internal goal representation,” Soft Comput., vol. 17, no. 11, pp. 2101–2108, Nov. 2013. [38] Z. Ni, X. Fang, H. He, D. Zhao, and X. Xu, “Real-time tracking on adaptive critic design with uniformly ultimately bounded condition,” in Proc. IEEE Symp. Adapt. Dyn. Program. Reinforcement Learn. (ADPRL), Singapore, Apr. 2013, pp. 39–46. [39] Z. Ni, X. Zhong, and H. He, “A boundedness theoretical analysis for GrADP design: A case study on maze navigation,” in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), Killarney, Ireland, Jul. 2015, pp. 1–8. [40] X. Zhong, H. He, H. Zhang, and Z. Wang, “Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 12, pp. 2141–2155, Dec. 2014. [41] Y. Tang, H. He, J. Wen, and J. Liu, “Power system stability control for a wind farm based on adaptive dynamic programming,” IEEE Trans. Smart Grid, vol. 6, no. 1, pp. 166–177, Jan. 2015. [42] X. Sui, Y. Tang, H. He, and J. Wen, “Energy-storage-based lowfrequency oscillation damping control using particle swarm optimization and heuristic dynamic programming,” IEEE Trans. Power Syst., vol. 29, no. 5, pp. 2539–2548, Sep. 2014. [43] D. V. Prokhorov, R. A. Santiago, and D. C. Wunsch, II, “Adaptive critic designs: A case study for neurocontrol,” Neural Netw., vol. 8, no. 9, pp. 1367–1372, 1995. [44] Z. Ni, H. He, D. Zhao, X. Xu, and D. V. Prokhorov, “GrDHP: A general utility function representation for dual heuristic dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 3, pp. 614–627, Mar. 2015. [45] Z. Ni, Y. Tang, H. He, and J. Wen, “Multi-machine power system control based on dual heuristic dynamic programming,” in Proc. IEEE Symp. Comput. Intell. Appl. Smart Grid (CIASG), Orlando, FL, USA, Dec. 2014, pp. 1–7. [46] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming,” IEEE Trans. Autom. Sci. Eng., vol. 9, no. 3, pp. 628–634, Jul. 2012. [47] D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, “Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming,” Automatica, vol. 48, no. 8, pp. 1825–1832, Aug. 2012. [48] D. Wang and D. Liu, “Neuro-optimal control for a class of unknown nonlinear dynamic systems using SN-DHP technique,” Neurocomputing, vol. 121, pp. 218–225, Dec. 2013. [49] T. Weiland, “Time domain electromagnetic field computation with finite difference methods,” Int. J. Numer. Model., Electron. Netw., Devices Fields, vol. 9, no. 4, pp. 295–319, 1996. [50] V. Moncrief, “Finite difference approach to solving operator equations of motion in quantum theory,” Phys. Rev. D, vol. 28, no. 10, pp. 2485–2490, Nov. 1983. [51] D. Liu and D. Wang, “Optimal control of unknown nonlinear discretetime systems using the iterative globalized dual heuristic programming algorithm,” in Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. L. Lewis and D. Liu, Eds. Hoboken, NJ, USA: Wiley, 2013. [52] S. S. Ge, C. Yang, and T. H. Lee, “Adaptive predictive control using neural network for a class of pure-feedback systems in discrete time,” IEEE Trans. Neural Netw., vol. 19, no. 9, pp. 1599–1614, Sep. 2008. [53] C. Yang, S. S. Ge, C. Xiang, T. Chai, and T. H. Lee, “Output feedback NN control for two classes of discrete-time systems with unknown control directions in a unified approach,” IEEE Trans. Neural Netw., vol. 19, no. 11, pp. 1873–1886, Nov. 2008.

GrDHP: a general utility function representation for dual heuristic dynamic programming.

Goal representation heuristic dynamic programming on maze navigation.

Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks.

A dynamic multiarmed bandit-gene expression programming hyper-heuristic for combinatorial optimization problems.

Dynamic programming.

Optimal control strategy design based on dynamic programming for a dual-motor coupling-propulsion system.

A Stereo Dual-Channel Dynamic Programming Algorithm for UAV Image Stitching.

Pareto optimization in algebraic dynamic programming.

Eradication of Ebola Based on Dynamic Programming.

Clipping in neurocontrol by adaptive dynamic programming.

Revisiting approximate dynamic programming and its convergence.

Dynamic programming algorithms for restriction map comparison.

A heuristic ranking approach on capacity benefit margin determination using Pareto-based evolutionary programming technique.

Finite-approximation-error-based discrete-time iterative adaptive dynamic programming.

Robust adaptive dynamic programming with an application to power systems.

DPFrag: trainable stroke fragmentation based on dynamic programming.

Robust adaptive dynamic programming and feedback stabilization of nonlinear systems.

A Dynamic Programming Algorithm For (1,2)-Exemplar Breakpoint Distance.

Using dynamic programming to improve fiducial marker localization.

Determining minimum energy conformations of polypeptides by dynamic programming.

F -Discrepancy for Efficient Sampling in Approximate Dynamic Programming.

Adaptive dynamic programming as a theory of sensorimotor control.

Minimizing the total service time of discrete dynamic berth allocation problem by an iterated greedy heuristic.

On the Morphology of a Growing City: A Heuristic Experiment Merging Static Economics with Dynamic Geography.