Adaptive optimal control of highly dissipative nonlinear spatially distributed processes with neuro-dynamic programming.

684

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

Adaptive Optimal Control of Highly Dissipative Nonlinear Spatially Distributed Processes With Neuro-Dynamic Programming Biao Luo, Huai-Ning Wu, and Han-Xiong Li, Fellow, IEEE Abstract— Highly dissipative nonlinear partial differential equations (PDEs) are widely employed to describe the system dynamics of industrial spatially distributed processes (SDPs). In this paper, we consider the optimal control problem of the general highly dissipative SDPs, and propose an adaptive optimal control approach based on neuro-dynamic programming (NDP). Initially, Karhunen-Loève decomposition is employed to compute empirical eigenfunctions (EEFs) of the SDP based on the method of snapshots. These EEFs together with singular perturbation technique are then used to obtain a finite-dimensional slow subsystem of ordinary differential equations that accurately describes the dominant dynamics of the PDE system. Subsequently, the optimal control problem is reformulated on the basis of the slow subsystem, which is further converted to solve a Hamilton–Jacobi–Bellman (HJB) equation. HJB equation is a nonlinear PDE that has proven to be impossible to solve analytically. Thus, an adaptive optimal control method is developed via NDP that solves the HJB equation online using neural network (NN) for approximating the value function; and an online NN weight tuning law is proposed without requiring an initial stabilizing control policy. Moreover, by involving the NN estimation error, we prove that the original closed-loop PDE system with the adaptive optimal control policy is semiglobally uniformly ultimately bounded. Finally, the developed method is tested on a nonlinear diffusion-convection-reaction process and applied to a temperature cooling fin of high-speed aerospace vehicle, and the achieved results show its effectiveness. Index Terms— Adaptive optimal control, empirical eigenfunction (EEF), highly dissipative partial differential equations (PDEs), neuro-dynamic programming (NDP), spatially distributed processes (SDPs).

F

I. I NTRODUCTION OR many industrial systems, significant spatial variations are widely existed owing to the underlying physical

Manuscript received September 17, 2013; revised February 12, 2014; accepted April 20, 2014. Date of publication May 19, 2014; date of current version March 16, 2015. This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2012CB720003, in part by the National Natural Science Foundation of China under Grant 61121003 and Grant 51175519, and in part by the General Research Fund through the Research Grants Council of Hong Kong SAR under Grant CityU 116212. B. Luo is with the Science and Technology on Aircraft Control Laboratory, Beihang University, Beijing 100191, China, and also with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: [email protected]). H.-N. Wu is with the Science and Technology on Aircraft Control Laboratory, Beihang University, Beijing 100191, China (e-mail: [email protected]). H.-X. Li is with the Department of Systems Engineering and Engineering Management, City University of Hong Kong, Hong Kong (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2320744

phenomena, such as diffusion, convection, phase-dispersion, vibration, flow, and so on. These spatially distributed processes (SDPs) are naturally described by a set of nonlinear partial differential equations (PDEs) with homogeneous or mixed boundary conditions. Typical examples include the rapid thermal chemical vapor deposition [1] in microelectronics manufacturing, the temperature profile [2] of highspeed aerospace vehicle, catalytic packed-bed reactors [3], [4] used to convert methanol to formaldehyde, the Czochralski crystallization [5] of high-purity crystals, tokamak device [6] for nuclear fusion, an intangible doughnut-shaped bottle created by magnetic lines is used to confine the hightemperature plasma, wavy behavior [7] of excitable media in biology and chemistry, as well as several fluid dynamic systems [8], [9], and so on. Most of these SDPs are described with highly dissipative PDEs, which have an important feature that their long-term dynamic behavior is characterized by a finite number of degrees of freedom. This means that their dominant dynamics can be described by a few modes, which is a promising feature that makes it possible for extending the control theories and methods of ordinary differential equation (ODE) systems to the highly dissipative PDE systems. One popular approach to the controller design of highly dissipative PDEs involves the application of model reduction techniques to derive an ODE system that describes the dynamics of the dominant (slow) modes of the PDE system, which are subsequently used as the basis for the synthesis of finite-dimensional controllers [10], [11]. Following this thought, many control approaches have been developed in the last few decades. Approximate inertial manifold [12] were employed to derive lower order ODE systems of the parabolic PDE systems [13], which were further used for the synthesis of finite-dimensional nonlinear feedback controllers that enforce stability and output tracking in the closed-loop system. In [14] and [15], adaptive model reduction technique based on Karhunen-Loève decomposition (KLD) was applied to derive the finite-dimensional ODE model of highly dissipative PDE systems, which was then used to design a nonlinear controller via state feedback [14] and output feedback [15], [16]. Optimal control of PDE systems refers to a class of methods that can be applied to synthesize a control policy, which results in best possible behavior with respect to the prescribed performance criterion. In [17], based on the empirical eigenfunctions (EEFs) computed via KLD, Bendersky and Christofides discretized the infinite-dimensional optimization problem of transport-reaction processes to a set of

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LUO et al.: ADAPTIVE OPTIMAL CONTROL OF HIGHLY DISSIPATIVE NONLINEAR SPATIALLY DISTRIBUTED PROCESSES

low-dimensional nonlinear programming problems, which are then iteratively solved with successive quadratic programming (SQP); this method was further extended to highly dissipative PDE systems in [4], where an efficient nonlinear programming approach was presented. In [18], the PDE system was formulated as an ODE system via spatial discretization and then an optimal feedback control approach was proposed. By taking a fixed structure for the control law based on Ritz and extended Ritz approximation schemes, a mathematical programming problem was formulated and solved with SQP in [19]. In [2], a dual heuristic programming (DHP) approach was introduced to solve the temperature profile control problem of highspeed aerospace vehicle. Xu et al. [20] considered a class of bilinear parabolic PDE systems and suggested a sequential linear quadratic (LQ) regulator approach based on an iterative scheme, and in [6] this approach was successfully applied to the optimal tracking control of current profile in tokamaks. Luo and Wu [21] considered a class of parabolic PDE systems with nonlinear spatial differential operator, and developed neural network (NN)-based optimal control method by solving the Hamilton–Jacobi–Bellman (HJB) equation. However, most of these optimal control approaches are designed off-line and thus fragile to changing dynamics of a system during operation. To the best of our knowledge, the adaptive optimal control problem of general highly dissipative PDE systems, based on online design with neuro-dynamic programming (NDP), was rarely studied. NDP is one class of reinforcement learning (RL) that have been widely investigated as a machine learning technique in artificial intelligence community [22]–[24]. NDP solves the dynamic programming problem forward-in-time using NN as value function approximator, and then avoids the so-called curse of dimensionality. In recent years, some NDP approaches have been introduced to synthesize controller for ODE systems [25]–[27]. For discrete-time (DT) optimal control problems, several NDP methods [28]–[33] were developed for solving the DT HJB equation in an offline iterative manner. Al-Tamimi et al. [28] suggested a heuristic dynamic programming (HDP) approach for solving the optimal control problem of nonlinear DT systems. In [29], the nonlinear DT HJB equation was converted to a sequence of linear generalized HJB equations. In [30], the infinitetime optimal tracking control problem with a new type of performance index was solved using the greedy HDP iteration algorithm. The optimal control problem for a class of nonlinear DT systems with control constraints [31] and time delays [34], respectively, were also solved with HDP methods. Fu et al. [35] investigated the adaptive learning and control for multiple–input-multiple–output system based on approximate dynamic programming (ADP). A finite-horizon optimal control problem was studied in [32] by introducing a ε-error bound; and finite-time problem with control constraint was considered by proposing a DHP scheme [36] with single NN. These works assumed that there are no NN reconstruction errors. To involve the effects of NN approximation errors, a neural HDP method in [37] was applied to learn state and output feedback adaptive critic control policy of nonlinear DT affine systems with disturbances. Dierks and Jagannathan [38]

685

proposed a time-based ADP, which is an online control method without using value and policy iterations (PIs). Liu et al. [39]– [41] presented globalized DHP algorithms using three NNs for estimating system dynamics, cost function and its derivatives, and control policy, where model NN construction error was considered. For continuous-time (CT) optimal control problems, Beard et al. [42] applied Galerkin approximation to solve generalized HJB equation, and in [43] this method was extended to constrained control problem by introducing a nonquadratic performance functional and using NN to approximate cost function. Doya [44] discussed employing appropriate estimators for approximating value function such that the temporal difference error is minimized in RL approach. Murray et al. [45] suggested two (PI) algorithms that avoid the necessity of knowing the internal system dynamics. Vrabie et al. [46] extended their result and proposed a new PI algorithm to solve the LQ regulation problem online along a single state trajectory. A nonlinear version of this algorithm was presented in [47] using NN approximator. Vamvoudakis and Lewis [48] also gave a socalled synchronous PI algorithm, which tunes synchronously the weights of both actor and critic NNs for the optimal control problem nonlinear ODE systems. Luo et al. [49], [50] proposed off-policy RL approaches for optimal control and H∞ control design of CT systems. The off-policy RL is a kind of data-based method, which learns the control policy from the data generated with arbitrary policies rather than the evaluating policy. Liu et al. [51] proposed a novel online learning optimal control approach to deal with the decentralized stabilization problem for a class of CT nonlinear interconnected systems. However, due to the infinite-dimensional nature of the PDE systems, it is still difficult to directly use these control design methods of ODE systems for the PDE systems. In this paper, we consider the optimal control problem of general highly dissipative PDE systems, and develop an adaptive optimal control method based on NDP. The main contributions of this paper can be briefly summarized as follows. 1) An adaptive optimal control method based on NDP is proposed for nonlinear highly dissipative PDE systems. To the best our knowledge, NDP is rarely used for the control design of PDE systems and it still remains an open problem. 2) The developed adaptive optimal control method can solve the HJB equation online without requiring an initial stabilizing control law. 3) In the presence of the NN approximate error, we prove that the developed adaptive optimal control policy can ensure that the original closed-loop PDE system is semiglobally uniformly ultimately bounded (SGUUB), with singular perturbation (SP) theory. 4) To show the effectiveness of the developed adaptive optimal control method, simulation studies are conducted on a general nonlinear convection-diffusion-reaction process of four cases, and then applied to a complex temperature cooling fin of high-speed aerospace vehicle.

686


The rest of this paper is arranged as follows. The problem description is given in Section II. KLD and SP technique are used for model reduction in Section III. In Section IV, an adaptive optimal control method based on NDP is proposed and the stability of the closed-loop PDE system is proved. Simulation studies are conducted in Section V and a brief conclusion is given in Section VI. Notation: R, Rn , and Rn×m are the set of real numbers, the n-dimensional Euclidean space and the set of all real matrices, respectively. | · |, · , and ·Rn denote the absolute value for scalars, Euclidean norm, and inner product for vectors, respectively. Let R∞ be the vector space of infinite sequences a [a1, .. . , a∞ ]T of real numbers equipped with ∞ 2 the norm aR∞ i=1 ai , which a natural generalization of space Rn . The superscript T is used for the transpose and I denotes the identify matrix of appropriate dimension, and ∇ ∂/∂ x denotes a gradient operator notation. σ (·) and σ (·) denote the maximum singular value and the minimum singular value. For a symmetric matrix M, M > (≥)0 means that it is a positive (semipositive) definite matrix. v2M v T Mv for some real vector v and symmetric matrix M > (≥)0 with appropriate dimensions. C 1 () is function space on with first derivatives are continuous. L 2 ([z 1 , z 2 ], Rn ) is an infinitedimensional Hilbert space of n-dimensional square integrable vector functions ω(z) ∈ L 2 ([z 1 , z 2 ], Rn ), z ∈ [z 1 , z 2 ] ⊂ R with the inner product and norm: ω1 (·), ω2 (·) equipped z2 1/2 for z 1 ω1 (z), ω2 (z)Rn dz and ω1 (·)2 ω1 (·), ω1 (·) n ∀ω1 (z), ω2 (z) ∈ L 2 ([z 1 , z 2 ], R ) . II. P ROBLEM D ESCRIPTION We consider a general class of CT SDPs, which are described by highly dissipative nonlinear PDEs with the following state-space representation: ⎧

∂y(z, t) ∂y ∂ 2 y ∂ n0 y ⎪ ⎪ = L y, , 2 , . . . , n + B(z)u(t) ⎨ ∂t ∂z ∂z ∂z 0 (1) z2 ⎪ ⎪ H (z)y(z, t)dz ⎩ yh (t) = z1

subjected to the mixed-type boundary conditions

∂y ∂ 2 y ∂ n0 −1 y q y, , 2 , . . . , n −1 =0 ∂z ∂z ∂z 0 z=z 1 or z 2

(2)

and the initial condition y(z, 0) = y0 (z)

Remark 1: The highly dissipative nonlinear PDE system (1)–(3) represents a large number of complex industrial SDPs. Representative examples include transport-reaction processes with significant diffusive and dispersive mechanisms that are naturally described by nonlinear parabolic PDEs (such as chemical reactors [52], [53], catalytic rod [4], heat transfer [2], rapid thermal chemical vapor deposition [1], crystal growth processes [5], FitzHugh–Nagumo equation [7], [15], etc.), and several fluid dynamic systems (such as Burgers’ equation [8] for gas dynamics and traffic flow, Kuramoto–Sivashinsky equation [9], Navier–Stokes equations [9], etc.). In this paper, we consider the optimal control problem of PDE system (1)–(3) with the following infinite-horizon LQ cost functional: +∞

yh (t) + u(t)2R dt (4) V (y0 (·)) 0

where R > 0. The goal is to find a control policy u ∗ (t) such that the above cost functional is minimized, i.e., u(t) u ∗ (t) arg min V (y0 (·, t)). u

(5)

III. M ODEL R EDUCTION BASED ON KLD AND SP T ECHNIQUE It is known that [14]–[16], [54] the main feature of highly dissipative PDE systems is that they involve a spatial differential operator whose eigenspectrum can be partitioned into a finite-dimensional slow one and an infinite-dimensional stable fast complement. This means that their dominant dynamic behavior can be accurately described by finite-dimensional systems. However, for many real industrial SDPs with nonlinear spatial differential operators [1], [4], [20], [54]–[56], it is impossible to compute their analytic expressions of the eigenvalues and eigenfunctions, and thus, the direct use of basis functions are prohibited to derive finite-dimensional approximations of the PDE system. To overcome this difficulty, we initially compute a set of EEFs (dominant spatial patterns) of the PDE system using KLD based on the method of snapshots. These EEFs together with SP technique will be subsequently applied to obtain a slow subsystem that accurately describes the dominant dynamics of the PDE system. A. EEFs Computation via KLD

(3)

where z ∈ [z 1 , z 2 ] ⊂ R is the spatial coordinate, t ∈ [0, ∞) is the temporal coordinate, y(z, t) = [y1 (z, t) , . . . , yn (z, t)]T ∈ Rn is the state, and u(t) ∈ R p is the manipulated input. L ∈ Rn is a sufficiently smooth nonlinear vector function that involves a highly dissipative, possibly nonlinear, spatial differential operator of order n 0 (an even number). q is a sufficiently smooth nonlinear vector function, B(z) and H (z) are known sufficiently smooth matrix functions of appropriate dimension, and y0 (z) is a smooth vector function representing the initial state profile.

KLD is a popular statistical pattern analysis method for seeking the dominant structures in an ensemble of a high-dimensional process, and obtaining low-dimensional approximate descriptions in many engineering fields. Given an ensemble of data, KLD yields a set of orthogonal EEFs for the representation of the ensemble, as well as a measure of the relative contribution of each EEF to the total energy (mean square fluctuation) of the ensemble. In this sense, the EEFs provide an optimal basis for the truncated series representation, which has a smaller mean square error than a representation by any other basis of the same dimension. In other words, the projection onto the first few EEFs capture most of the energy than any other projections. These properties make the


Algorithm 1 KLD Implementation Procedure Step 1: For the real SDP, collect an ensemble of states {yi (z)}, i = 1, . . . , M. Step 2: Compute matrix C = (ckj ) M×M with ckj = 1 z2 T M z 1 yk (ξ )y j (ξ )dξ . Step 3: Solve the eigenvalue problem Cαi = λi αi , and assume that λ1 λ2 · · · λ M . Step 4: Normalize the eigenvector αi to satisfy 1 αi , αi Rn = Mα . Then, EEFs can be computed with i φi = Y (z)αi , where Y (z) = [y1 (z) · · · y M (z)].

EEFs a natural one to be considered when performing model reduction. KLD has already been applied to compute EEFs for PDE systems [4], [16], [57], [58]. Here, we omit the derivation of KLD, and directly give the implementation procedure in Algorithm 1. B. Model Reduction With SP Technique For convenience, we denote y(·, t) y(z, t), and M(·) M(z), z ∈ [z 1 , z 2 ] for some space-varying matrix function M(z). To simplify the notation, we consider the PDE system (1)–(3) with n = 1 without loss of generality. Assume that the PDE state y(z, t) can be represented as an infinite weighted sum of a complete set of orthogonal basis functions {φi (z)} y(z, t) =

∞

x i (t)φi (z)

(6)

i=1

where x i (t) is a time-varying coefficient named the mode of the PDE system. By taking inner product with φi (z) (i = 1, . . .) on both sides of PDE system (1)–(3), we obtain the following infinite-dimensional ODE system: ⎧ x(t) ˙ = f s (x(t), x f (t)) + Bu(t) ⎪ ⎪ ⎨ x˙ f (t) = L f (x(t), x f (t)) + B f u(t) (7) yh (t) = Hs x(t) + H f x f (t) ⎪ ⎪ ⎩ x(0) = x 0 , x f (0) = x f 0 where x(t) = y(·, t), s (·) [x 1 (t), . . . , x N (t)]T (8) T ∈ and x f (t) = y(·, t), f (·) [x N+1 (t),. . . , x ∞ (t)] ∞ R , f s (x, x f ) L, s (·), L f (x, x f ) z L, f (·) , B B(·), s (·) , B f B(·), f (·) , Hs z12 H (z)sT (z)dz, z H f z12 H (z)Tf (z)dz, x 0 y0 (·), s (·), x f 0 y0 (·), f (·) with s (z) [φ1 (z), . . . , φ N (z)]T and f (z) [φ N+1 (z), . . . , φ∞ (z)]T . Owing to the highly dissipative nature of the PDE system (1)–(3), it is reasonable [55], [59] to assume that L f (x, x f ) =

1 A f ε x f + f f (x, x f ) ε

where ε > 0 is a small positive parameter quantifying the separation between the slow (dominant) and fast (negligible) modes, A f ε is a matrix that is stable (in the sense that the

687

state of the system x˙ f = A f ε x f tends exponentially to zero), and f f (x, x f ) satisfies f f (x, x f ) k1 x + k2 x f R∞

(9)

for x β1 and x f R∞ β2 with k1 , k2 > 0. Then, system (7) can be rewritten as the following standard singularly perturbed form: ⎧ x(t) ˙ = f s (x(t), x f (t)) + Bu(t) ⎪ ⎪ ⎨ ε x˙ f (t) = A f ε x f (t) + ε f f (x(t), x f (t)) + ε B f u(t) . (10) yh (t) = Hs x(t) + H f x f (t) ⎪ ⎪ ⎩ x(0) = x 0 , x f (0) = x f 0 Introducing the fast time-scale τ = t/ε and setting ε = 0, we obtain the following infinite-dimensional fast subsystem from: dx f = A f εx f (11) dτ which is exponentially stable [55], [59]. Then, setting ε = 0 in the system (10), we have that x f = 0, and thus the following slow subsystem is obtained: x(t) ˙ = f (x(t)) + Bu(t) (12) x(0) = x 0 where f (x) = f s (x, x f ). Observe that the LQ cost functional (4) can be rewritten as +∞ x(t)2Q s + u(t)2R dt + R(x, x f ) (13) V = 0

+∞ where R(x, x f ) 0 (2x T Q s f x f + x Tf Q f x f )dt, Q s HsT Hs , Q s f HsT H f , Q f H Tf H f . Using x f = 0, we have that R(x, x f ) = 0, then the LQ cost function (13) is written as +∞ x(τ )2Q s + u(τ )2R dτ. (14) V (x 0 ) = 0

Now, the optimal control problem of the PDE system (1)–(3) can be reformulated as: find a state feedback control policy to minimize the LQ cost function (14) based on the slow subsystem (12). In this paper, we use the EEFs for model reduction, i.e., the basis function φi (z) is computed with KLD. The dimension of s (i.e., N) is chosen such that it satisfies N M λi λj 1 − ζ (15) i=1

j =1

for a small positive real number ζ > 0. Remark 2: The use of EEFs for model reduction of nonlinear PDE systems has two merits compared with analytical eigenfunctions. The first is that EEFs can be computed for PDE systems with any form of nonlinear spatial differential operator, while analytical eigenfunctions are only available for PDE systems with a known linear spatial differential operator. Another important merit of EEFs is that they can be computed online on the basis of data collection of system states by using KLD, which does not require a mathematical system model. Thus, the highly complexity or unavailability of system dynamic model have no effects on the computation of EEFs.

688


IV. A DAPTIVE O PTIMAL C ONTROL D ESIGN W ITH NDP In this section, we design a NDP based adaptive optimal controller for the PDE system (1)–(3) on the basis of the finitedimensional ODE model (12). A. Adaptive Optimal Control Design It is known [60] that the optimal control problem of the nonlinear finite-dimensional ODE system (12) with cost function (14) relies on the solution of the following HJB equation: 1 (∇V ∗ )T f + x2Q s − (∇V ∗ )T B R −1 B T ∇V ∗ = 0 4

(16)

where V ∗ (x) ∈ C 1 (). Then, the optimal control policy u ∗ is given by 1 (17) u ∗ (x) = − R −1 B T ∇V ∗ (x). 2 Note that the HJB equation (16) is a nonlinear PDE that has proven to be impossible to solve analytically. To overcome this difficulty, we propose an adaptive optimal control method based on NDP, which can learn the solution of HJB equation online. NDP is implemented on an actor-critic structure, where actor and critic NNs are used for approximating control policy and value function, respectively. From the well-known high-order Weierstrass approximation theorem [61], it follows that a continuous function can be represented by an infinite-dimensional linearly independent basis function set {ψk (x)}∞ k=1 . For real practical application, it is usually required to approximate the function in a compact set with a finite-dimensional function set. Thus, we consider the critic NN for approximating the solution V ∗ (x) of the HJB equation (16) on a compact set . Let (x) [ψ1 (x) ψ2 (x), . . . , ψ L (x)]T be a vector of linearly independent activation functions, where L is the number of neurons in the NN hidden layer. Then, V ∗ (x) can be expressed with (x) as follows: V ∗ (x) = θ ∗T (x) + δ1 (x) θ∗

(18)

[θ1∗

θ2∗ , . . . , θ L∗ ]T is the unknown ideal constant and δ1 (x) is the NN approximation error. The

where weight vector, optimal control law (17) is rewritten as

1 1 u ∗ (x) = − R −1 B T [∇(x)]T θ ∗ − R −1 B T ∇δ1 (x). 2 2

(19)

For convenience, define a matrix D = B R −1 B T . It follows from (18) that the HJB equation (17) is represented as: 2 1 (20) θ ∗T [∇] f + x2Q s − θ ∗T [∇] D + δ2 (x) = 0 4 where δ2 (x) is the residual error of the HJB equation due to the critic NN error and given by 1 δ2 (x) [∇δ1 (x)]T ( f (x) + Bu ∗ (x)) − ∇δ1 (x)2D . (21) 4 Although the ideal critic NN weight vector θ ∗ provides the best approximate solution of the HJB equation (16), it is unknown. Actually, the output of the critic NN is (x) = V θ T (x)

(22)

θ2 , . . . , θ L ]T is the estimate of θ ∗ . Thus, the where θ [ θ1 actual output of the actor NN is 1 θ. (23) u (x) = − R −1 B T [∇(x)]T 2 It is observed from (22) and (23) that the critic and actor NNs share the same weight vector θ , which implies that only the critic NN is required. Due to the NN approximation error, replacing V ∗ (x) in the (x) of (22) yields the following HJB equation (16) with V residual error: (x))T f (x) + x2Q − 1 ∇ V (x)2D = δ3 (x) (∇ V s 4 where 1 δ3 (x) θ T [∇(x)]2D . (24) θ T [∇(x)] f (x) + x2Q s − 4 It is observed from (20) and (24) that when θ → θ ∗, δ3 → −δ2 . Therefore, it is desired to select tuning law of θ to minimize the squared residual error 1 E( θ) = δ32 (x). 2 Besides, the tuning law is also expected to ensure stability of the closed-loop PDE system (under the adaptive optimal control law (23)) during the learning process of the critic NN weight vector. To attain the above two goals, we select the tuning law of θ as follows: θ˙ = −αμρ(x)δ3 (x) + ακ∇(x)D∇V (x)

(25)

where α > 0 is the adaptive gain of the critic NN, ρ(x) ∇(x)( f (x) + B u (x)), and μ 1/(1 + ρ T ρ)2 0 if (∇V )T ( f + B u) κ= (26) 0.5 else and V (x) ∈ C 1 () is a radially unbounded Lyapunov function satisfying 1 (∇V )T f − (∇V )T D(∇)T θ ∗ 0. (27) 2 Remark 3: It is noted that the weight vector tuning law (25) of θ contains two parts. The first part is a normalized gradient descent scheme for minimizing the squared residual error. The second part is designed for stability propose, which is activated when the control policy is not stabilizing. This means that in the proposed adaptive optimal control method, the initial critic NN weight vector θ(0) can be chosen randomly, and thus no initial admissible control policy is needed. B. Stability Analysis Notice that the adaptive optimal control is designed based on slow subsystem. Thus, we should analyze the stability of the original closed-loop PDE system. Before starting, we introduce a definition and some reasonable assumptions. Definition 1 [62]: Consider the system (12), the solution x(t) is SGUUB, if for any compact set ⊂ R N and x(t0 ) = x 0 ∈ , there exist positive constants μ1 and T (μ1 , x 0 ) such that x(t) < μ1 for all t > t0 + T .


Assumption 1: 1) the ideal weight vector of the critic NN is bound, i.e., θ ∗ θ M , where θ M is a positive number; 2) the critic NN approximation error δ1 (x) and its derivative ∇δ1 (x) are bounded on a compact set containing , i.e., |δ1 | δ1M and ∇δ1 δ1D M , where δ1M and δ1D M are positive numbers; 3) the critic NN activation functions and their gradients are bounded on a compact set containing , i.e., ψ M and ∇ ψ D M , where ψ M and ψ D M are positive numbers; 4) f (x) is bounded on a compact set containing , i.e., f υ f M , where υ f M is a positive number. ˜ Define the critic NN weight estimation error as θ (t) θ (t) − θ ∗ , the value function estimation error as ˜ (x) − V ∗ (x), and the control estimation error as V (x) V u(x) ˜ u (x)−u ∗ (x), respectively. Then, we have the following result. Theorem 1: Consider the optimal control problem of the PDE system (1)–(3) and the adaptive optimal control policy (23) with the tuning law (25) for critic NN weight vector. Let Assumption 1 holds. Then, there exist positive real constants σ1 , σ2 and ε∗ , such that if x 0 σ1 , x f 0 R∞ σ2 , and ε ∈ (0, ε∗ ), it follows that: 1) the critic NN weight estimation error θ˜ is SGUUB, and the state y(z, t) of the original closed-loop PDE system (1)-(3) is SGUUB in L2 -norm, i.e., there exists a positive number σ3 such that y(·, t)2 σ3 ; 2) the state of slow subsystem x(t) converges to compact set: = {x|x Bx }, where Bx is defined in (A23); 3) the value function estimation error V˜ (x), and the control estimation error u(x) ˜ are also SGUUB. Proof: See the Appendix. Remark 4: From Assumption 1 and the definition of bound Bx , it is noted that the compact set for NN approximation can be determined before control design. Remark 5: The convergences of the V˜ (x) and u(x) ˜ shows that, with the adaptive optimal control policy (23) and critic , and NN weight tuning law (25), V u will approach V ∗ and u ∗ , respectively. Summarize briefly, the adaptive optimal control policy (23) and its weight tuning law (25) is based on NDP for solving the HJB equation online. It is noted that the control approach does not require an initial stabilizing control policy, and thus the initial critic NN weights can be given randomly. This is very important because an initial stabilizing control policy is often difficult to obtain for practical complex SDPs. Since the adaptive optimal controller is designed based on the slow subsystem, the SP theory is employed to prove the convergence of the original closed-loop PDE system in Theorem 1, where the NN estimation error is involved. V. S IMULATION S TUDIES In this section, we first test the effectiveness of the developed adaptive optimal control on a general convectiondiffusion-reaction process. Then, it is applied to a temperature cooling fin of high-speed aerospace vehicle.

689

A. Effectiveness Test on a General Convection-Diffusion-Reaction Process Consider the following general nonlinear diffusionconvection-reaction process [4], [14], [56]: ⎧

∂y(z, t) ⎪ ⎪ = (1 − αdc ) fd y, ∂y/∂z, ∂ 2 y/∂z 2 ⎪ ⎨ ∂t απdc f c (∂y/∂z) + fr (z, y) + b(z)u(t) (28) ⎪ ⎪ ⎪ ⎩ yh (t) = y(z, t)dz 0

subjected to the Dirichlet boundary conditions y(0, t) = y(π, t) = 0

(29)

and the initial condition y0 (t) = 0.3 sin(3z)

(30)

where fc = ∂y/∂z, fd = ∂/∂z (k(y)∂y/∂z), fr = βT (z)(e−γ /(1+y) − e−γ ) − βU y, b(z) = βU b(z), y is the PDE state, z ∈ [0, π], k(y) is the diffusion coefficient that may be constant or dependent on the state, βT (z) is the heat of reaction, which is spatially varying, βU is the heat transfer coefficient, γ is activation energy, and b(z) is the actuator distribution function. These parameters are given as: k(y) = 0.5 + 0.7/(1 + y), βT (z) = 16(cos(z) + 1), βU = 1, γ = 4, b(z) = H (z − 0.2π) − H (z − 0.4π), where H (·) is the standard Heaviside function. The terms f c , fd , and fr are corresponding to convection, diffusion and reaction phenomena in the process, respectively. The coefficient αdc (0 αdc 1) represents the weight of convection and diffusion in the process. It is known [1]–[4], [6], [14], [20] that the diffusion term f d is highly dissipative, which appears in parabolic PDE systems. Thus, their dominant system dynamics can be represented by a lower order slow subsystem, and an infinitedimensional fast subsystem that is exponentially stable. In contrast, the convection term f c , which appears in hyperbolic PDE systems [63]–[66], does not have the slow-fast separation feature because their eigenmodes contain the same, or nearly the same amount of energy. This implies that for hyperbolic PDE systems, there does not exist an exponentially decay subsystem, and thus infinite number of modes is required to accurately describe their dynamic behavior. Then, controller design methods based on slow subsystem are usually difficult for hyperbolic systems. For convectiondiffusion-reaction process (28)–(30), with the increase of weight coefficient αdc , the weight of convection term f c increasing and the weight of diffusion term f d decreasing, and thus the control problem becomes more and more difficult for the developed adaptive optimal control method. To test the efficiency of the developed method and show the influence of the weight coefficient αdc between convection and diffusion, we conduct simulations on four cases respectively αdc = 0.2, 0.5, 0.7 and 0.9. For four cases, we use the same parameters setting. The weighting matrix R in (4) is given as R = 1. Let the time and space sample interval be t = 0.05 s and z = 0.0628. Using different initial conditions and input signals, we collect 5000 snapshots for computing EEFs with KLD. The first three EEFs are employed to compute the state

690


Fig. 1. State profiles of open-loop PDE system. (a) αdc = 0.2. (b) αdc = 0.5. (c) αdc = 0.7. (d) αdc = 0.9.

of the slow subsystem. To use the developed adaptive optimal control method, we select critic NN activation function vector (x) of size 18 (i.e., L = 18) as (x) = x 12 x 1 x 2 x 1 x 3 x 22 x 2 x 3 x 32 x 12 x 22 x 12 x 32 x 22 x 32 T x 1 x 23 x 1 x 33 x 2 x 33 x 13 x 2 x 13 x 3 x 23 x 3 x 14 x 24 x 34 (31)

Fig. 2. First three EEFs (blue solid, green dash, and red dot lines represent EEFs φ1 (z) ∼ φ3 (z), respectively). (a) αdc = 0.2. (b) αdc = 0.5. (c) αdc = 0.7. (d) αdc = 0.9.

and each element of initial NN weight vector θ (0) is given as 1 for four cases (also can be generated randomly). Using the adaptive optimal control policy (23) and the NN weights tuning law (25) with V = 0.5x T x and α = 10, closed-loop simulations are conducted. To show the effects of the neglected the fast dynamics, we have computed the estimation error between the actual closed-loop PDE system N state y(z, t) and x i (t)φi (z) with the approximated PDE state y(z, t) i=1 yerror (z, t) y(z, t) − y(z, t).

(32)

The simulation results of four cases are shown in Figs. 1–7. Fig. 1 shows the state profiles of open-loop PDE system and Fig. 2 shows the first three EEFs. Conducting closedloop simulations on the PDE system with the developed adaptive optimal control method, Fig. 3 shows the first three representative critic NN weights. Figs. 4–7 show the control action, the state profiles of the closed-loop PDE system, the state trajectories of the slow subsystem, and PDE state estimation error, respectively. It can be found that under the convection weight αdc = 0.2, 0.5, and 0.7 (i.e., the diffusion weight is 0.8, 0.5, and 0.3, respectively), the PDE system converges at about 1, 2, and 3 s to the neighborhood of zero, and the PDE state estimation error is also negligible as time increases. With the convection weight increases to 0.9, the control problem becomes much more difficult. It is observed that the PDE system converges slowly to the neighborhood of zero, and the PDE state estimation error is also large because the effects of fast dynamics cannot be neglected. Remark 6: Based on the well-known high-order Weierstrass approximation theorem [61], the solution of HJB equation is a continuous function that can be represented by any

Fig. 3. First three representative critic NN weights (blue solid, green dash, and red dot lines represent θ1 (t) ∼ θ3 (t), respectively). (a) αdc = 0.2. (b) αdc = 0.5. (c) αdc = 0.7. (d) αdc = 0.9.

infinite-dimensional linearly independent basis function set {ψ(z)}∞ i=1 . For the truncated case (22), the selections of finiteL and its size are often dimensional basis function set {ψ(z)}i=1 experience-based. Like all function approximation techniques, large size of the basis function set can improve the estimation performance at the price of highly computational effort. Thus, an appropriate selection of basis function set and its size are useful to balance the performance and computation. However, it is still difficult to develop a general optimal selection method for all systems. This is simply because the optimal selection is often different for different systems. In a word, for a specific system, prior experiences would be helpful for the selection of basis function set and its size. Remark 7: It is noted from Theorem 1 that the state of closed-loop PDE system is SGUUB in L2 -norm. The error bound of the state mainly results from two aspects: 1) the


691

Fig. 4. Control action. (a) αdc = 0.2. (b) αdc = 0.5. (c) αdc = 0.7. (d) αdc = 0.9.

Fig. 6. Actual state trajectories of closed-loop slow subsystem (blue solid, green dash, and red dot lines represent x1 (t) ∼ x3 (t), respectively). (a) αdc = 0.2. (b) αdc = 0.5. (c) αdc = 0.7. (d) αdc = 0.9.

Fig. 5. State profiles of the closed-loop PDE system. (a) αdc = 0.2. (b) αdc = 0.5. (c) αdc = 0.7. (d) αdc = 0.9.

Fig. 7. State estimation error profiles of the closed-loop PDE system. (a) αdc = 0.2. (b) αdc = 0.5. (c) αdc = 0.7. (d) αdc = 0.9.

error between the original PDE state and the approximated PDE state via the slow subsystem modes x and 2) the estimation error of critic NN for value function approximation of HJB equation. To reduce the error bound, four methods are potentially helpful: 1) increase the dimension of slow subsystem (12), i.e., N; 2) increase the number of neurons in the critic NN hidden layer, i.e., L; 3) construct NN activation functions well suited for special system; and 4) use more complicated value function approximate structure, such as nonlinear function approximation [67]. The first two methods will increase the computation load, and thus requires a tradeoff between accuracy and computation. The third and fourth methods are promising. Using system information collected in practice, it is possible to learn basis functions automatically for value function approximation. Nonlinear function approximation is still an open and difficult problem in RL community, because it may cause the system to become unstable. These issues still require a deep investigation and are left

for future research. In fact, although there exists a theoretical error bound in PDE system state, it is found from the simulation results (Figs. 5 and 7) that both the PDE state and reconstruction error approach zero at a high accuracy for cases αdc = 0.2, 0.5, and 0.7. B. Application on Temperature Cooling Fin of High-Speed Aerospace Vehicle In this section, the developed control method is applied to a complex temperature cooling fin of high-speed aerospace vehicle [2], the system dynamics of which is described with the following parabolic PDE: ρC

∂ 2 T (l, t) Ph ∂ T (l, t) =k (T − T∞1 ) − ∂t ∂l 2 A Pεσ 4 4 (T − T∞2 )+ B(l)u(t) − A

(33)

692


TABLE I S YSTEM PARAMETERS AND T HEIR VALUES

subjected to the boundary conditions ∂ T ∂ T = 1, =0 ∂l l=0 ∂l l=L and the initial condition T (l, 0) = T0 (l)

(34)

(35)

where T is the temperature of the fin, l ∈ [0, L], u(t) = [u 1 (t) u 2 (t) u 3 (t)]T is the control input vector, and B(l) = [ b1 (l) b2 (l) b3 (l)]T describe how control actions are distributed in spatial domain. The system parameters and their values are shown in Table I and the open-loop temperature profile is shown in Fig. 8(a). The control objective is to reach a desired constant temperature Td = 700 °C, and achieve minimized cost. To simplify the description of system (33)–(35), it is worthwhile to define the dimensionless temperature, desired temperature and spatial position variables as: y T /T , y d Td /T , z l/L, where T = 1000 and it also can be a large number. Then, system (33) is rewritten as a dimensionless formulation

Ph T∞1 k ∂2y ∂ y(z, t) − y − = ∂t ρC L 2 ∂z 2 ρC A T Pεσ 4 4 1 4 − + T y −T∞2 B(Lz)u(t). (36) ρC AT ρC T

Fig. 8. Simulation results on the temperature cooling fin of high-speed aerospace vehicle. (a) Temperature profile of open-loop system. (b) First three EEFs (blue solid, green dash, and red dot lines represent EEFs φ1 (z) ∼ φ3 (z), respectively). (c) First three representative critic NN weights (blue solid, green θ3 (t), respectively). (d) Actual dash, and red dot lines represent θ1 (t) ∼ state trajectories of closed-loop slow subsystem (blue solid, green dash, and red dot lines represent x1 (t) ∼ x3 (t), respectively). (e) Control action (blue solid, green dash, and red dot lines represent u 1 (t) ∼ u 3 (t), respectively). (f) Temperature profile of closed-loop system. (g) State estimation error of the closed-loop PDE system. (h) Cost J (t) with respect to time (blue solid and green dash lines are obtained by the adaptive optimal control and the existing method in [52], respectively).

where y0 (z) y 0 (z) − y d , u d α2 y d − T∞1 /T − 4 , Define state error y y − y d , and four coefficients: α1 α3 T∞2 B(z) α4 B(Lz) which is given by k Ph Pεσ 1 , α − , α − , α . Then, the 2 3 4 B(z) = [cos(π z) cos(1.5π z) cos(2π z)]T , and the initial ρC A ρC L 2 ρC AT ρC T system (33)-(35) is briefly represented as state y0 (z) = 0.5 cos(π z), i.e., T0 (l) = 0.5T cos(πl/L) + Td . Then, the objective is to design control policy u such 2 ∂y(z, t) ∂ y = α1 2 + α2 y + α3 T 4 (y + y d )4 + u d + B(z)u(t) that y(z, t) of system (37)–(39) approaches zero, and achieve ∂t ∂z (37) minimized LQ cost functional L(4) with R be an unit matrix, and objective output yh (t) = 0 y(z, t)dz. subjected to the boundary conditions Using KLD to compute EEFs, an ensemble of size 5000 (i.e., M = 5000) is collected from 50 independent simulations L ∂y ∂y = , =0 (38) with different initial conditions and inputs. Fig. 8(b) shows ∂z z=0 ∂l z=1 T the first three EEFs that are used for computing the state of and the initial condition slow subsystem. To solve the optimal control problem with the (39) developed method, let V = 0.5x T x, α = 10 and the critic NN y(z, 0) = y0 (z)


activation function vector be selected as (30). Then, closedloop simulation is conducted, and the results are demonstrated in Fig. 8. It is noted in Fig. 8(f) that the temperature of the cooling fin converges to the desired constant temperature profile Td = 700 °C. We also conduct comparative simulation study between the developed adaptive optimal control and the existing optimal control approach based on LQ regulator [52]. Since the figures of closed-loop temperature profile, slow subsystem state trajectories and control trajectories of the two methods are similar, we omit these figures under the existing control approach of [52]. To compare the cost that generated from the two control methods, +∞ we define the cost with respect yh (τ ) + u(τ )2R dτ . Fig. 8(h) to time as J (t) t shows the curves of J (t), from which it is observed that the adaptive optimal control achieves a smaller cost than the optimal method in [52] as time increases. VI. C ONCLUSION In this paper, we have developed a NDP-based adaptive optimal control method for SDPs described with highly dissipative nonlinear PDEs. The KLD is firstly employed to compute EEFs with the method of snapshots, based on which a finitedimensional ODE model is derived via the SP technique. The ODE model is then used as the basis for optimal control design, which is converted to solve the HJB equation. Subsequently, we propose a NDP-based adaptive optimal control method to solve the HJB equation online, and derive a novel critic NN weights tuning law to overcome the difficulty of the requirement of an initial stabilizing control policy. Based on the Lyapunov theory, the stability of the original closed-loop PDE system is proved by involving the NN estimation error. Finally, we demonstrate the effectiveness of the developed adaptive optimal control method on a nonlinear diffusionconvection-reaction process and a temperature cooling fin of high-speed aerospace vehicle. A PPENDIX A. Proof of Theorem 1 1) Under the adaptive optimal control policy (23) with the tuning law (25) of θ , the closed-loop PDE system is written as ⎧ n0

⎨∂y = L y, ∂y , . . . , ∂ y − 1 B(z)R −1 B T [∇]T θ . (A1) ∂t ∂z ∂z n0 2 ⎩ θ˙ = −αμρ(x)δ3 (x) + ακ∇(x)D∇V (x) Using the model reduction procedure presented in Section III-B, the system (A1) is equivalently rewritten as ⎧ 1 ⎪ ⎪ x˙ = f s (x, 0)− B R −1 B T [∇]T θ + f s (x, x f )− f s (x, 0) ⎪ ⎨ 2 1 ε x˙ f = A f ε x f + ε f f (x, x f ) − ε B f R −1 B T [∇]T θ . ⎪ ⎪ 2 ⎪ ⎩ θ˙ = −αμρδ3 + ακ∇ D∇V (A2) Since L is a sufficiently smooth nonlinear vector function, it is obvious that there exists a constant k3 > 0 such that fs (x, x f ) − f s (x, 0) k3 x f R∞ for x σ1 .

(A3)

693

According to the exponential stability property of the x f -subsystem (11), and the converse Lyapunov theorem [68], we have that there exist a Lyapunov function candidate V f (x f ), and positive real numbers l1 , l2 , l3 , and l4 such that the following conditions hold: ⎧ 2 2 ⎪ ⎨ l1 x f R∞ V f (x f ) l2 x f R∞ 1 l3 (A4) V˙ f (x f ) ∇V fT A f ε x f − x f 2R∞ ⎪ ε ε ⎩ ∇V f R∞ l4 x f R∞ where ∇V f ∂ V f /∂ x f . Before proving the stability of the closed-loop PDE system, we derive some useful conditions as follows: ∇V ∗ = (∇)T θ ∗ + ∇δ1 υ DV M 1 u ∗ = − R −1 B T ∇V ∗ υu M 2 f + Bu ∗ υ f B M υ Dm D = B R

−1

B υ D M T

(A5) (A6) (A7) (A8)

where υ Rm σ (R), υ R M σ (R), υ Bm σ (B), υ B M −1 σ (B), υ DV M ψ D M θ M + δ1D M , υu M υ Rm υ R M υ DV M /2, −1 2 υ f B M υ f M + υ B M υu M , υ Dm υ R M υ Bm and υ D M z T −1 2 υ Rm υ B M . Moreover, considering the fact z12 B (z)B(z)dz = B T B + B Tf B f , we have 1

B Tf ∇V f = ((∇V f )T B f B Tf ∇V f ) 2 l4l5 x f R∞

(A9)

z T 1 where l5 σ z12 B (z)B(z)dz − B T B 2 . Now, let us consider the following Lyapunov function candidate: L(t) = V ∗ (x) + V f (x f ) + V (x) + Vo (θ˜ )

(A10)

where V ∗ (x) is the solution of the HJB equation (16) and Vo (θ˜ ) = θ˜ T θ˜ /(2α). Differentiating V ∗ (x), V f (x f ), V (x) and Vo (θ˜ ) with respect to time along the trajectory of the system (A2), yield u + f s (x, x f ) − f s (x, 0) V˙ ∗ (x) = (∇V ∗ )T f s (x, 0) + B = −x2Q −u ∗ 2R +(u ∗ )T B T (∇)T θ˜ −(u ∗ )T B T ∇δ1 s

+(∇V ∗ )T [ f s (x, x f ) − f s (x, 0)] −1 x2 + 2 x f R∞ + 3 θ˜ + 4

(A11)

where 1 υ Qm , 2 k3 υ DV M , 3 υu M υ B M υ D M , and 4 υu M υ B M δ1D M with υ Qm σ (Q s ) and υ Q M σ (Q s ) 1 V˙f (x f ) = (∇V f )T A f ε x f + (∇V f )T f f (x, x f ) ε 1 +(∇V f )T B f u ∗ − (∇V f )T B f R −1 B T (∇)T θ˜ 2 1 T −1 T + (∇V f ) B f R B ∇δ1

2 l3 − + 5 x f 2R∞ + 6 xx f R∞ ε +7 x f R∞ θ˜ + 8 x f R∞ (A12)

694


−1 where 5 l4 k2 , 6 l4 k1 , 7 l4l5 υ Rm υ B M ψ D M /2 and −1 8 l4l5 (υu M + υ Rm υ B M δ1D M /2)

V˙ (x) = (∇V )T ( f + B u) 1 (A13) = (∇V )T f − (∇V )T D(∇)T (θ˜ + θ ∗ ). 2 V˙o (θ˜ ) = −μθ˜ T ρδ3 + κ θ˜ T ∇ D∇V = −μθ˜ T ∇( f + B u )δ3 + κ θ˜ T ∇ D∇V 1 = −μθ˜ T ∇( f + Bu ∗ ) + μθ˜ T ∇2D 2 1 − μθ˜ T ∇ D∇δ1 2 1 × (θ˜ T ∇ − (∇δ1 )T )( f + Bu ∗ ) − θ˜ T ∇2D 4 1 + ∇δ1 2D + κ θ˜ T ∇ D∇V 4 = −9 θ˜ 4 + 10 θ˜ 3 + 11 θ˜ 2 + 12 θ˜ + κ θ˜ T ∇ D∇V

(A14)

2 υ 2 /8, 3 where 9 μψ Dm 10 3μψ D M υ D M υ f B M /4 + Dm 3 δ 2 2 μψ D M 1D M υ D M /8, 11 μψ D M δ1D M υ D M υ f B M /2, and 2 2 12 μψ D M δ1D M υ f B M + 3μψ D M δ1D M υ D M υ f B M /3 + 2 3 2 /8. μψ D M δ1D M υ D M υ f B M /2 + μψ D M δ1D M υ D M Moreover, it is easy to verify that with the choice of κ in (26), the following condition holds:

V˙ (x) + κ θ˜ T ∇ D∇V 0.

(A15)

This is because, according to the definition (26) of κ, and using (A13) and (27), we have that when κ = 0, V˙ (x) + κ θ˜ T ∇ D∇V = θ˜ T ( f + B u ) 0, and when κ = 0.5 V˙ (x) + κ θ˜ T ∇ D∇V = θ˜ T ( f + B u ) + 0.5θ˜ T ∇ D∇V T = (∇V ) f −0.5(∇V )T D(∇)T θ ∗ 0. Thus, using (A11), (A12), (A12), and (A15), the time derivative of L(t) in (A10) satisfies ˙ L(t) = V˙ ∗ (x) + V˙ f (x f ) + V˙ (x) + V˙o (θ˜ ) L (θ˜ ) − ν T (ε)ν + b T ν + V˙ (x) θ

xθ

xθ

+ κ θ˜ T ∇ D∇V

xθ

(A16)

θ˜ ]T ,

L θ (θ˜ ) −9 θ˜ 4 + where νxθ [x x f R∞ 3 2 10 θ˜ +(11 +13 )θ˜ +(3 +12 )θ˜ +4 , 13 > 0, b [0 2 + 8 0]T , and ⎤ ⎡ −6 /2 0 1 (ε) ⎣ −6 /2 l3 /ε − 5 −7 /2 ⎦ . 0 −7 /2 13 It is noted that function L θ (θ˜ ) can be rewritten as L θ (θ˜ ) = α1 (θ˜ ) + α2 (θ˜ ) + α3 (θ˜ ) + α4 (θ˜ ) (A17) ˜ 3 (9 θ˜ /4 − 10 ), α2 (θ˜ ) −θ˜ 2 where α1 (θ˜ ) −θ (9 θ˜ 2 /4 − (11 + 13 )), α3 (θ˜ ) −θ˜ (9 θ˜ 3 /4 − (3 + 12 )), and α4 (θ˜ ) −(9 θ˜ 4 /4 − 4 ). Letting α1 (θ˜ ) 0, α2 (θ˜ ) 0, α3 (θ˜ ) 0 and α4 (θ˜ ) √0, we get respectively that θ˜ ) ϑ1 with ϑ1 410 /9 , θ˜ ) ϑ2 with ϑ2

√ 4(11 + 13 ))/9 , θ˜ ) ϑ3 with √ ϑ3 √ 3 4(3 + 12 ))/9 , and θ˜ ) ϑ4 with ϑ4 4 44 /9 . Defining ϑmax max{ϑ1 , ϑ2 , ϑ3 , ϑ4 }, it is clear from (A17) that if θ˜ ϑmax

(A18)

L θ (θ˜ ) 0.

(A19)

then

Letting ε∗ min{ε1 , ε2 } where ε1 41l3 /(41 5 + 62 ) and ε2 41 13l3 /(72 1 + 62 13 + 41 5 13 ), then we have that, if ε ∈ (0, ε∗ ), then (ε) > 0. Thus, we have that T T −νxθ (ε)νxθ + b νxθ −14 νxθ 2 + (2 + 8 )νxθ

0

(A20)

where 14 σ (), if νxθ (2 + 8 )/14 .

(A21)

Therefore, if (A18) and (A21) hold, it follows from (A16), ˙ (A19) and (A20) that L(t) 0. That is to say, for closed-loop system (A2), we have νxθ < (2 + 8 )/14 .

(A22)

This means that x, x f and θ˜ are SGUUB. Due to the fact that y(·, t)22 = x(t)2 + x f (t)2R∞ , the state y(z, t) of the original closed-loop PDE system (1)-(3) is SGUUB in L2 -norm. This completes the proof of the part 1) of Theorem 1. 2) Define Bx (2 + 8 )/14 .

(A23)

According to (A22) and the definition of νxθ , we have that x2 + x f 2R∞ + θ˜ 2 < Bx2 , then x Bx . This completes the proof of the part 2) of Theorem 1. 3) It follows from the part 1) of Theorem 1 that θ˜ is SGUUB, thus there exists a υθ˜ M > 0 such that θ˜ υθ˜ M . Considering V˜ (x) = θ˜

T

(x) − δ1

and 1 1 u(x) ˜ = − R −1 B T (∇)T θ˜ + R −1 B T ∇δ1 2 2 we have that V˜ θ˜ T + δ1 υθ˜ M ψ M + δ1M 1 1 u ˜ R −1 B T (∇)T θ˜ + R −1 B T ∇δ1 2 2 1 −1 υ B M (ψ D M υθ˜ M + δ1D M ). υ Rm 2 This implies that V˜ and u˜ are SGUUB.


R EFERENCES [1] P. D. Christofides, Nonlinear and Robust Control of PDE Systems: Methods and Applications to Transport-Reaction Processes. Boston, MA, USA: Birkhäuser, 2001. [2] V. Yadav, R. Padhi, and S. Balakrishnan, “Robust/optimal temperature profile control of a high-speed aerospace vehicle using neural networks,” IEEE Trans. Neural Netw., vol. 18, no. 4, pp. 1115–1128, Jul. 2007. [3] H.-N. Wu and H.-X. Li, “Adaptive neural control design for nonlinear distributed parameter systems with persistent bounded disturbances,” IEEE Trans. Neural Netw., vol. 20, no. 10, pp. 1630–1644, Oct. 2009. [4] A. Armaou and P. D. Christofides, “Dynamic optimization of dissipative PDE systems using nonlinear order reduction,” Chem. Eng. Sci., vol. 57, no. 24, pp. 5083–5114, 2002. [5] A. Armaou and P. D. Christofides, “Crystal temperature control in the Czochralski crystal growth process,” AIChE J., vol. 47, no. 1, pp. 79–106, 2001. [6] Y. Ou et al., “Optimal tracking control of current profile in tokamaks,” IEEE Trans. Control Syst. Technol., vol. 19, no. 2, pp. 432–441, Mar. 2011. [7] J.-W. Wang, H.-N. Wu, and H.-X. Li, “Distributed proportional–spatial derivative control of nonlinear parabolic systems via fuzzy PDE modeling approach,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 3, pp. 927–938, Jun. 2012. [8] M. Krstic, “On global stabilization of Burgers’ equation by boundary control,” Syst. Control Lett., vol. 37, no. 3, pp. 123–141, 1999. [9] P. Holmes, J. L. Lumley, and G. Berkooz, Turbulence, Coherent Structures, Dynamical Systems and Symmetry. Cambridge, U.K.: Cambridge Univ. Press, 1998. [10] M. J. Balas, “The Galerkin method and feedback control of linear distributed parameter systems,” J. Math. Anal. Appl., vol. 91, no. 2, pp. 527–546, 1983. [11] W. H. Ray, Advanced Process Control. New York, NY, USA: McGraw-Hill, 1981. [12] E. S. Titi, “On approximate inertial manifolds to the Navier–Stokes equations,” J. Math. Anal. Appl., vol. 149, no. 2, pp. 540–557, 1990. [13] A. Armaou and P. D. Christofides, “Robust control of parabolic PDE systems with time-dependent spatial domains,” Automatica, vol. 37, no. 1, pp. 61–69, 2001. [14] A. Varshney, S. Pitchaiah, and A. Armaou, “Feedback control of dissipative PDE systems using adaptive model reduction,” AIChE J., vol. 55, no. 4, pp. 906–918, 2009. [15] S. Pitchaiah and A. Armaou, “Output feedback control of distributed parameter systems using adaptive proper orthogonal decomposition,” Ind. Eng. Chem. Res., vol. 49, no. 21, pp. 10496–10509, 2010. [16] S. Pitchaiah and A. Armaou, “Output feedback control of dissipative PDE systems with partial sensor information based on adaptive model reduction,” AIChE J., vol. 59, no. 3, pp. 747–760, 2013. [17] E. Bendersky and P. D. Christofides, “Optimization of transport-reaction processes using nonlinear model reduction,” Chem. Eng. Sci., vol. 55, no. 19, pp. 4349–4366, 2000. [18] D. V. Prokhorov, “Optimal neurocontrollers for discretized distributed parameter systems,” in Proc. Amer. Control Conf., vol. 1. Jun. 2003, pp. 549–554. [19] A. Alessandri, M. Gaggero, and R. Zoppoli, “Feedback optimal control of distributed parameter systems by using finite-dimensional approximation schemes,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 6, pp. 984–996, Jun. 2012. [20] C. Xu, Y. Ou, and E. Schuster, “Sequential linear quadratic control of bilinear parabolic PDEs based on POD model reduction,” Automatica, vol. 47, no. 2, pp. 418–426, 2011. [21] B. Luo and H.-N. Wu, “Approximate optimal control design for nonlinear one-dimensional parabolic PDE systems using empirical eigenfunctions and neural network,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 6, pp. 1538–1549, Dec. 2012. [22] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, U.K.: Cambridge Univ. Press, 1998. [23] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific, 1996. [24] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.

695

[25] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May 2009. [26] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Third Quarter, 2009. [27] F. L. Lewis and D. Liu, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, vol. 17. Hoboken, NJ, USA: Wiley, 2013. [28] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 943–949, Aug. 2008. [29] Z. Chen and S. Jagannathan, “Generalized Hamilton–Jacobi–Bellman formulation-based neural network control of affine nonlinear discretetime systems,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 90–106, Jan. 2008. [30] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008. [31] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [32] F.-Y. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36, Jan. 2011. [33] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 621–634, Mar. 2014. [34] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking control for a class of nonlinear discrete-time systems with time delays based on heuristic dynamic programming,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 1851–1862, Dec. 2011. [35] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for MIMO system based on adaptive dynamic programming,” IEEE Trans. Neural Netw., vol. 22, no. 7, pp. 1133–1148, Jul. 2011. [36] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 145–157, Jan. 2013. [37] Q. Yang and S. Jagannathan, “Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 377–390, Apr. 2012. [38] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1118–1129, Jul. 2012. [39] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming,” IEEE Trans. Autom. Sci. Eng., vol. 9, no. 3, pp. 628–634, Jul. 2012. [40] D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, “Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming,” Automatica, vol. 48, no. 8, pp. 1825–1832, 2012. [41] D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs,” Inform. Sci., vol. 220, pp. 331–342, Jan. 2013. [42] R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized Hamilton–Jacobi–Bellman equation,” Automatica, vol. 33, no. 12, pp. 2159–2177, 1997. [43] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005. [44] K. Doya, “Reinforcement learning in continuous time and space,” Neural Comput., vol. 12, no. 1, pp. 219–245, 2000. [45] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive dynamic programming,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 32, no. 2, pp. 140–153, May 2002.

696


[46] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009. [47] D. Vrabie and F. L. Lewis, “Neural network approach to continuoustime direct adaptive optimal control for partially unknown nonlinear systems,” Neural Netw., vol. 22, no. 3, pp. 237–246, 2009. [48] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878–888, 2010. [49] B. Luo, H.-N. Wu, T. Huang, and D. Liu. (2013). Data-based approximate policy iteration for nonlinear continuous-time optimal control design. arXiv preprint arXiv:1311.0396 [Online]. Available: http://arxiv.org/abs/1311.0396 [50] B. Luo, H.-N. Wu, and T. Huang, “Off-policy reinforcement learning for H∞ control design,” IEEE Trans. Cybern., 2014, doi: 10.1109/TCYB.2014.2319577. [51] D. Liu, D. Wang, and H. Li, “Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp. 418–428, Feb. 2014. [52] M. Li and P. D. Christofides, “Optimal control of diffusion-convectionreaction processes using reduced-order models,” Comput. Chem. Eng., vol. 32, no. 9, pp. 2123–2135, 2008. [53] M. Li and P. D. Christofides, “An input/output approach to the optimal transition control of a class of distributed chemical reactors,” Chem. Eng. Sci., vol. 62, no. 11, pp. 2979–2988, 2007. [54] J. Baker and P. D. Christofides, “Finite-dimensional approximation and control of non-linear parabolic PDE systems,” Int. J. Control, vol. 73, no. 5, pp. 439–456, 2000. [55] A. Armaou and P. D. Christofides, “Finite-dimensional control of nonlinear parabolic PDE systems with time-dependent spatial domains using empirical eigenfunctions,” Int. J. Appl. Math. Comput. Sci., vol. 11, no. 2, pp. 287–318, 2001. [56] H.-X. Li and C. Qi, Spatio-Temporal Modeling of Nonlinear Distributed Parameter Systems. New York, NY, USA: Springer-Verlag, 2011. [57] B. Luo, H.-N. Wu, and H.-X. Li, “Data-based suboptimal neuro-control design with reinforcement learning for dissipative spatially distributed processes,” Ind. Eng. Chem. Res., 2014, doi: 10.1021/ie4031743. [58] H.-N. Wu and B. Luo, “L 2 disturbance attenuation for highly dissipative nonlinear spatially distributed processes via HJI approach,” J. Process Control, vol. 24, no. 5, pp. 550–567, 2014. [59] P. D. Christofides, “Robust control of parabolic PDE systems,” Chem. Eng. Sci., vol. 53, no. 16, pp. 2949–2965, 1998. [60] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control. Hoboken, NJ, USA: Wiley, 2013. [61] R. Courant and D. Hilbert, Methods of Mathematical Physics, vol. 1. New York, NY, USA: Wiley, 2004. [62] S. Ge, C. C. Hang, T. H. Lee, and T. Zhang, Stable Adaptive Neural Network Control. Norwell, MA, USA: Kluwer, 2001. [63] P. D. Christofides and P. Daoutidis, “Robust control of hyperbolic PDE systems,” Chem. Eng. Sci., vol. 53, no. 1, pp. 85–105, 1998. [64] H.-N. Wu and B. Luo, “Heuristic dynamic programming algorithm for optimal control design of linear continuous-time hyperbolic PDE systems,” Ind. Eng. Chem. Res., vol. 51, no. 27, pp. 9310–9319, 2012. [65] I. Aksikas, A. Fuxman, J. F. Forbes, and J. J. Winkin, “LQ control design of a class of hyperbolic PDE systems: Application to fixed-bed reactor,” Automatica, vol. 45, no. 6, pp. 1542–1548, 2009. [66] B. Luo and H.-N. Wu, “Online policy iteration algorithm for optimal control of linear hyperbolic PDE systems,” J. Process Control, vol. 22, no. 7, pp. 1161–1170, 2012. [67] H. R. Maei, C. Szepesvári, S. Bhatnagar, D. Precup, D. Silver, and R. S. Sutton, “Convergent temporal-difference learning with arbitrary smooth function approximation,” in Advances in Neural Information Processing Systems. Red Hook, NY, USA: Curran & Associates Inc., 2009, pp. 1204–1212. [68] H. K. Khalil and J. Grizzle, Nonlinear Systems, vol. 3. Upper Saddle River, NJ, USA: Prentice-Hall, 2002.

Biao Luo received the B.E. degree in measuring and control technology and instrumentations and the M.E. degree in control theory and control engineering from Xiangtan University, Xiangtan, China, in 2006 and 2009, respectively. He is currently pursuing the Ph.D. degree in control science and engineering with Beihang University, Beijing, China. He was a Research Assistant with the Department of System Engineering and Engineering Management, City University of Hong Kong, Hong Kong, and a Research Assistant with the Department of Mathematics and Science, Texas A&M University at Qatar, Doha, Qatar, in 2013. His current research interests include distributed parameter systems, optimal control, data-based control, fuzzy/neural modeling and control, hypersonic entry/reentry guidance, learning and control from big data, reinforcement learning, approximate dynamic programming, and evolutionary computation. Mr. Luo was a recipient of the Excellent Master Dissertation Award of Hunan Province in 2011.

Huai-Ning Wu was born in Anhui, China, in 1972. He received the B.E. degree in automation from the Shandong Institute of Building Materials Industry, Jinan, China, and the Ph.D. degree in control theory and control engineering from Xi’an Jiaotong University, Xi’an, China, in 1992 and 1997, respectively. He was a Post-Doctoral Researcher with the Department of Electronic Engineering, Beijing Institute of Technology, Beijing, China, from 1997 to 1999. In 1999, he joined the School of Automation Science and Electrical Engineering, Beihang University, Beijing. From 2005 to 2006, he was a Senior Research Associate with the Department of Manufacturing Engineering and Engineering Management (MEEM), City University of Hong Kong, Hong Kong, where he was a Research Fellow with the Department of MEEM from 2006 to 2008 and in 2010, and the Department of Systems Engineering and Engineering Management in 2011 and 2013. He is currently a Professor with Beihang University. His current research interests include robust control, fault-tolerant control, distributed parameter systems, and fuzzy/neural modeling and control. Dr. Wu serves as an Associate Editor of the IEEE T RANSACTIONS ON S YSTEMS , M AN AND C YBERNETICS : S YSTEMS . He is a member of the Committee of Technical Process Failure Diagnosis and Safety, Chinese Association of Automation.

Han-Xiong Li (S’94–M’97–SM’00–F’11) received the B.E. degree in aerospace engineering from the National University of Defense Technology, Changsha, China, the M.E. degree in electrical engineering from the Delft University of Technology, Delft, The Netherlands, and the Ph.D. degree in electrical engineering from the University of Auckland, Auckland, New Zealand. He is currently a Professor with the Department of Systems Engineering and Engineering Management, City University of Hong Kong, Hong Kong. He has been involved in different fields, including military service, industry, and academia, over the last 30 years. He authored more than 140 SCI journal papers with an SCI h-index of 27. His current research interests include system intelligence and control, process design and control integration, and distributed parameter systems with applications to electronics packaging. Dr. Li serves as an Associate Editor of the IEEE T RANSACTIONS ON C YBERNETICS and the IEEE T RANSACTIONS ON I NDUSTRIAL E LECTRON ICS . He was a recipient of the Distinguished Young Scholar (overseas) from the China National Science Foundation in 2004, the Chang Jiang Professor Award from the Ministry of Education, China, in 2006, and the National Professorship in the China Thousand Talents Program in 2010. He serves as a Distinguished Expert for the Hunan Government and the China Federation of Returned Overseas.

Value Iteration Adaptive Dynamic Programming for Optimal Control of Discrete-Time Nonlinear Systems.

Nonlinear model predictive control based on collective neurodynamic optimization.

Robust adaptive dynamic programming and feedback stabilization of nonlinear systems.

Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics.

Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming.

Error bounds of adaptive dynamic programming algorithms for solving undiscounted optimal control problems.

Synthetic consciousness: the distributed adaptive control perspective.

Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator.

SVR learning-based spatiotemporal fuzzy logic controller for nonlinear spatially distributed dynamic systems.

Bayesian nonparametric adaptive control using Gaussian processes.

Adaptive dynamic programming as a theory of sensorimotor control.

Optimal adaptive control for quantum metrology with time-dependent Hamiltonians.

Stabilizing spatially-structured populations through adaptive Limiter Control.

Distributed optimal power and rate control in wireless sensor networks.

Adaptive neural control of MIMO nonlinear systems with a block-triangular pure-feedback control structure.

Nonlinear versus Ordinary Adaptive Control of Continuous Stirred-Tank Reactor.

Heterogeneous recurrence monitoring and control of nonlinear stochastic processes.

Output-feedback adaptive neural control for stochastic nonlinear time-varying delay systems with unknown control directions.

Optimal second order sliding mode control for nonlinear uncertain systems.

Distributed Cooperative Optimal Control for Multiagent Systems on Directed Graphs: An Inverse Optimal Approach.

Finding Bayesian Optimal Designs for Nonlinear Models: A Semidefinite Programming-Based Approach.

Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems.

Distributed Containment Control for Multiple Unknown Second-Order Nonlinear Systems With Application to Networked Lagrangian Systems.

Distributed model predictive control for constrained nonlinear systems with decoupled local dynamics.