Reinforcement learning for port-hamiltonian systems.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 5, MAY 2015

1003

Reinforcement Learning for Port-Hamiltonian Systems Olivier Sprangers, Robert Babuška, Subramanya P. Nageshrao, and Gabriel A. D. Lopes, Member, IEEE

Abstract—Passivity-based control (PBC) for port-Hamiltonian systems provides an intuitive way of achieving stabilization by rendering a system passive with respect to a desired storage function. However, in most instances the control law is obtained without any performance considerations and it has to be calculated by solving a complex partial differential equation (PDE). In order to address these issues we introduce a reinforcement learning (RL) approach into the energy-balancing passivity-based control (EB-PBC) method, which is a form of PBC in which the closed-loop energy is equal to the difference between the stored and supplied energies. We propose a technique to parameterize EB-PBC that preserves the systems’s PDE matching conditions, does not require the specification of a global desired Hamiltonian, includes performance criteria, and is robust. The parameters of the control law are found by using actor-critic (AC) RL, enabling the search for near-optimal control policies satisfying a desired closed-loop energy landscape. The advantage is that the solutions learned can be interpreted in terms of energy shaping and damping injection, which makes it possible to numerically assess stability using passivity theory. From the RL perspective, our proposal allows for the class of port-Hamiltonian systems to be incorporated in the AC framework, speeding up the learning thanks to the resulting parameterization of the policy. The method has been successfully applied to the pendulum swing-up problem in simulations and real-life experiments. Index Terms—Actor-critic (AC), energy-balancing (EB), passivity-based control (PBC), port-Hamiltonian (PH) systems, reinforcement learning (RL).

I. I NTRODUCTION ASSIVITY-based control (PBC) [1] is a methodology that achieves the control objective by rendering a system passive with respect to a desired storage function [2]. Different forms of PBC have been successfully applied to design robust controllers [3] for mechanical systems and electrical circuits [2], [4]. In this paper, we consider the passivity-based control (PBC) of systems endowed with a special structure, called port-Hamiltonian (PH) systems. These are important as they establish a unified energy-based structure that can be used to model and interconnect elements from multiple domains. PBC has been applied to PH systems in various applications [5], [6]. Although PBC of PH systems is general

P

Manuscript received August 22, 2013; revised February 13, 2014 and May 14, 2014; accepted July 9, 2014. Date of publication August 26, 2014; date of current version April 13, 2015. This paper was recommended by Associate Editor H. Gao. The authors are with the Delft Center for Systems and Control, Delft University of Technology, Delft 2628 CD, The Netherlands (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2343194

and provides a physical interpretation of the resulting control law, it has a number of challenges that hinder its widespread use as follows. 1) The geometric structure of PH systems reformulates the PBC problem in terms of solving a set of partial differential equations (PDEs) or nonlinear ordinary differential equations, which can be challenging to solve [2]. 2) Most solutions are founded on stability considerations and overlook performance. 3) Model uncertainty can affect the performance of the PBC law. The aim of this paper is to address these three hurdles by incorporating learning. Reinforcement learning (RL) [7] is a semi-supervised learning control method that can solve optimal (stochastic) control problems for nonlinear systems, without the need for a process model or for explicitly solving complex equations. In RL the controller receives an immediate numerical reward as a function of the process state and possibly control action. The goal is to find an optimal control policy that maximizes the cumulative long-term rewards, which corresponds to maximizing a value function [7]. In this paper, we use actor-critic (AC) techniques [8], which are a class of RL methods in which a separate actor and critic are learned. The critic approximates the value function and the actor the policy (control law). AC-RL is suitable for problems with continuous state and action spaces. It employs an explicitly parameterized policy function which fits well the PH approach. A general disadvantage of RL is that the progress of learning can be very slow and nonmonotonic. However, it has already been shown in the literature that by incorporating (partial) model knowledge, learning can be sped up [9]. This is also the approach we take in this paper. We propose a learning control structure within the PH framework that retains important properties of the PBC. To this end, first a parameterization of a particular type of PBC, called energy-balancing passivity-based control (EB-PBC), is proposed such that the PDE arising in EB-PBC can be split into a nonassignable part satisfying the matching condition and an assignable part that can be parameterized. Then, by applying AC-RL the parameterized part can be learned while automatically verifying the matching PDE. This can be seen as a paradigm shift from the traditional model-based control synthesis for PH systems. We do not seek to synthesize a controller in closed-form, but we aim instead to learn one online with proper structural constraints. This brings a number of advantages. 1) It allows to specify the control goal in a “local” fashion through a reward function, without having to consider

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1004

the entire global behavior of the system. The simplest example to illustrate this idea is by considering a reward function to be 1 when the system is in a small neighborhood of the desired goal and 0 everywhere else [7]. The learning algorithm will eventually find a global control policy, whereas in the model-based PBC synthesis one needs to specify a desired global Hamiltonian. 2) Learning brings performance in addition to the intrinsic stability properties of PBC. The structure of RL is such that the rewards are maximized, and these can include performance criteria, such as minimal time, energy consumption, etc. 3) Learning offers additional robustness and adaptability since it tolerates model uncertainty in the PH framework. From a learning control point of view, we present a systematic way of incorporating a priori knowledge into the RL problem. The approach proposed in this paper yields, after learning, a controller that can be interpreted in terms of energy shaping control strategies. The same interpretability is typically not found in the traditional RL solutions. Thus, this paper combines the advantages of both aforementioned control techniques, PBC and RL, and mitigates some of their respective disadvantages. Historically, the trends in control synthesis have oscillated between performance and stabilization. PBC of PH systems is rooted in the stability of multidomain nonlinear systems. By including learning we aim to address performance in the PH framework. In the experimental section of the paper, we show that our method is also robust to unmodeled nonlinearities, such as control input saturation. Control input saturation in PBC for PH systems has been addressed explicitly in the literature [10]–[16]. We show that our approach solves the problem of control input saturation on the learning side without the need of augmenting the model-based PBC. The work presented in this paper draws an interesting parallel with the application of iterative feedback tuning (IFT) [17] in the PH framework [18]. Both techniques optimize the parameters of the controller online, with the difference that in IFT the objective is to minimize the error between the desired output and the measured output of the system, while our approach aims at maximizing a reward function, which can be very general. The choice of RL is warranted by its semi-supervised paradigm, as opposed to other traditional fully-supervised learning techniques, such as artificial neural networks or fuzzy approximators where the control specification (function approximation information) is input/output data instead of reward functions. Such fully-supervised techniques can be used within RL as function approximators to represent the value functions and control policies. Genetic algorithms can also be considered as an alternative to RL since they rely on fitness functions that are analogous to the rewards functions of RL. However, these methods are usually computationally involved and therefore not always suitable for real-time learning control. We aim to explore such classes of algorithms in our future work. Finally, the challenges faced by datadriven approaches in the process industry [19] can offer an interesting benchmark problem for the approach we propose in this paper.


The theoretical background on PH systems and AC-RL is described in Sections II and III, respectively. In Section IV, our proposal for a parameterization of input-saturated EB-PBC control, compatible with AC-RL, is introduced. We then specialize this result to mechanical systems in Section V. Section VI provides simulation and experimental results for the problem of swinging up an input-saturated inverted pendulum and Section VII concludes the paper. II. PH S YSTEMS PH systems are a natural way of representing a physical system in terms of its energy exchange with the environment through ports [4]. The general framework of PH systems was introduced in [20] and was formalized in [3] and [21]. In this paper, we consider the input-state-output representation of the PH system which is of the form1 x˙ = [J(x) − R(x)] ∇x H(x) + g(x)u (1) : y = gT (x)∇x H(x) where x ∈ Rn is the state vector, y ∈ Rm is the process output, u ∈ Rm , m ≤ n is the control input, J, R : Rn → Rn×n with J(x) = −J(x)T and R(x) = R(x)T ≥ 0 are the interconnection and damping matrix, respectively, H : Rn → R the Hamiltonian which is the stored energy in the system, u and y are conjugated variables whose product has the unit of power and g : Rn → Rn×m is the input matrix assumed to be full rank. For the remainder of this paper, we denote F(x) : = J(x) − R(x).

(2)

This matrix satisfies F(x)+F T (x) = −2R(x) ≤ 0. System (1) satisfies the power-balance equation [2] ˙ H(x) = (∇x H(x))T x˙ = − (∇x H(x))T R(x)∇x H(x) + uT y.

(3)

Since R(x) ≥ 0, we obtain ˙ H(x) ≤ uT y

(4)

which is called the passivity inequality, if H(x) is positive semi-definite, and cyclo-passivity inequality, if H(x) is not positive semi-definite nor bounded from below [2]. Hence, systems satisfying (4) are called (cyclo-)passive systems. The goal is to obtain the target closed-loop system cl : x˙ = [J(x) − Rd (x)]∇x Hd (x)

(5)

through energy shaping using EB-PBC [2] and damping injection, such that Hd (x) is the desired closed-loop energy which has a minimum at the desired equilibrium x∗ and satisfies ˙ d (x) = − (∇x Hd (x))T Rd (x)∇x Hd (x) H

(6)

which implies (cyclo-)passivity according to (3) and (4) if the desired damping Rd (x) ≥ 0. Hence, the control objective is achieved by rendering the closed-loop system passive with respect to the desired storage function Hd (x). 1 We use the notation ∇ : = ∂/∂x. Furthermore, all (gradient) vectors are x column vectors.

SPRANGERS et al.: REINFORCEMENT LEARNING FOR PORT-HAMILTONIAN SYSTEMS

A. Energy Shaping Define the added energy function Ha (x) : = Hd (x) − H(x).

(7)

A state-feedback law ues (x) is said to satisfy the EB property if it satisfies ˙ a (x) = −uTes (x)y. H

(8)

If (8) holds, the desired energy Hd (x) is the difference between the stored and supplied energy. Assuming g(x) ∈ Rn×m , m < n, rank{g(x)} = m, the control law ues (x) = g (x)F(x)∇x Ha (x) †

g† (x)

(9)

(gT (x)g(x))−1 gT (x)

= solves the EB-PBC problem with with Ha (x) a solution of the following set of PDEs [2]: g⊥ (x)F T (x) (10) ∇x Ha (x) = 0 gT (x) with g⊥ (x) ∈ R(n−m)×n the full rank left-annihilator of g(x), i.e., g⊥ (x)g(x) = 0. B. Damping Injection Damping is injected by feeding back the (new) passive output gT (x)∇x Hd (x) udi (x) = −K(x)gT (x)∇x Hd (x) with K(x) ∈

Rm×m , K(x)

=

K T (x)

(11)

≥ 0 such that

Rd (x) = R(x) + g(x)K(x)gT (x).

(12)

If the second part of the matching condition (10) is satisfied, then gT (x)∇x Hd (x) can be rewritten as gT (x)∇x Hd (x) = gT (x)(∇x H(x) + ∇x Ha (x)) = gT (x)∇x H(x) =y

(13)

that is, the new passive output is nothing but the original system output. Hence, the full control law consists of an energy shaping part and a damping injection part u(x) = ues (x) + udi (x) = g† (x)F(x)∇x Ha (x) − K(x)gT (x)∇x Hd (x) = g† (x)F(x)∇x Ha (x) − K(x)y.

(14)

III. AC-RL In RL, the system to be controlled (called “environment” in the RL literature) is modeled as a Markov decision process (MDP). In a deterministic setting, this MDP is defined by the tuple M(X, U, f , ρ), where X is the state space, U the action space and f : X×U → X the state transition function that describes the process to be controlled. This function returns the state xk+1 after applying action uk in state xk . The vector xk is obtained by applying a zero-order hold discretization xk = x(kTs ) with Ts the sampling time. The reward function

1005

is defined by ρ : X × U → R and returns a scalar reward rk+1 = ρ(xk+1 , uk ) after each transition. The goal of RL is to find an optimal control policy π : X → U by maximizing an expected cumulative or total reward described as some function of the immediate expected rewards. In this paper, we consider the discounted sum of rewards. The value function Vπ : X → R V π (x) =

∞

π γ i rk+i+1

i=0

=

∞

γ i ρ(xk+i+1 , π(xk+i )),

x = xk

(15)

i=0

approximates this discounted sum of rewards while following the policy π where γ ∈ [0, 1) is the discount factor. For an MDP having a discrete-state and finite action space the value function can be rewritten in a recursive form called the Bellman equation (for definition and properties see [7]). Then (15) can be solved recursively resulting in an optimal value function. From this value-function, an optimal control action can easily be obtained by a one-step search. This is an indirect approach. Alternatively, there exist RL methods that directly search for an optimal control law. Depending on whether we search for a value function or a control law or both, RL methods can be broadly classified into three categories [22]. 1) Actor-only learning algorithms search explicitly for an optimal control law. 2) Critic-only algorithms first learn an optimal value function, from which an optimal control law is derived by an one-step optimization. 3) AC algorithms [8], [23] search explicitly for an optimal control law. Additionally, they also learn a valuefunction which provides an evaluation on the controller’s performance. Using actor-only methods one directly obtains a policy that can result in an optimal controller. However, actor-only algorithms generally suffer from high-variance resulting in slow and nonmonotonous convergence [8], [24]. In addition, they are computationally involved and therefore not always suitable for real-time learning control. Some of the widely used RL algorithms using temporal difference (TD) learning fall in the category of critic-only methods. They are based on learning a state-value function and the policy is only implicitly defined via the value function. As such, they are not suitable for our purpose. The AC methods combine the advantages of both actor-only and critic-only methods. While the actor provides a continuous action using the parameterized policy, the critic provides a low-variance gradient thus resulting in a faster convergence of the actor [8], [24]. Generally, most of the physical systems operate on a continuous state-space, requiring continuous control laws. For these scenarios it is necessary to approximate the value function and the policy. AC algorithms learn a separate parameterized actor (policy π ) and critic (value function V π ). In every time sample the critic improves the value function. The actor then updates the parameters in the direction of that improvement. The actor and the critic are usually defined by a

1006


differentiable parameterization such that gradient ascent can be used to update the parameters. This is beneficial when dealing with continuous action spaces [25]. In this paper, the temporal-difference based standard actor-critic (S-AC) algorithm from [9] is used. Define the approximated policy πˆ : Rn → Rm and the approximated value function as Vˆ : Rn → R. Denote the parameterization of the actor by ϑ ∈ Rp and of the critic by θ ∈ Rq . The TD [7] δk+1

ˆ k+1 , θk ) − V(x ˆ k , θk ) : = rk+1 + γ V(x

(16)

is used to update the critic parameters using the following gradient ascent update rule: ˆ k , θk ) θk+1 = θk + αc δk+1 ∇θ V(x

(17)

in which αc > 0 is the learning rate. In the above temporaldifference update rule, the critic is updated using information obtained in two consecutive time steps, while the reward received is often the result of a series of steps. As a consequence, the plain TD update results in slow and sampleinefficient learning. Eligibility traces ek ∈ Rq offer a way of assigning credit also to states visited several steps earlier and as such can speed up learning. The update for the critic parameters becomes ˆ k , θk ) ek+1 = γ λek + ∇θ V(x

(18)

θk+1 = θk + αc δk+1 ek+1

(19)

where λ ∈ [0, 1) is the trace-decay rate. The policy approximation can be updated in a similar fashion, as described below. RL needs exploration in order to visit new, unseen parts of the state-action space so as to possibly find better policies. This is achieved by perturbing the policy with an exploration term uk . Many techniques have been developed for choosing the type of exploration term (see [26]). In this paper, we consider uk to be random with zero-mean. In the experimental section we choose uk to be have a normal distribution. The overall control action now becomes ˆ k , ϑk ) + uk . uk = π(x

(20)

The policy update is such that the policy parameters are updated toward (away from) uk if the TD (16) is positive (negative). This leads to the following policy update rule: ϑk+1 = ϑk + αa δk+1 uk ∇ϑ πˆ (xk , ϑk )

(21)

with αa > 0 the actor learning rate. IV. EB-AC In this section, we present our main results. We use the PDE (10) and split it into an assignable, parameterizable part and an unassignable part that satisfies the matching condition. In this way, it is possible to parameterize the desired closed-loop Hamiltonian Hd (x) and simultaneously satisfy (10). After that, we parameterize the damping matrix K(x). The two parameterized variables—the desired closedloop energy Hd (x) and damping K(x)—are then suitable for AC-RL by defining two actors for these variables. First, we

reformulate the PDE (10) in terms of the desired closed-loop energy Hd (x) by applying (7) ⊥ g (x)F T (x) (22) (∇x Hd (x) − ∇x H(x)) = 0 gT (x)

A(x)

and we denote the kernel of A(x) as ker(A(x)) = {N(x) ∈ Rn×b : A(x)N(x) = 0}

(23)

such that (22) reduces to ∇x Hd (x) − ∇x H(x) = N(x)a

(24)

with a ∈ Rb . Suppose that (an example is given further on) the state vector x can be written as x = [wT zT ]T , where z ∈ Rc and w ∈ Rd , c + d = n corresponding to the zero and nonzero elements of N(x) such that ∇w Hd (x) ∇ H(x) Nw (x) − w = a. (25) ∇z Hd (x) ∇z H(x) 0 We assume that the matrix Nw (x) is rank d, which is always true for fully actuated mechanical systems (see Section V). It is clear that ∇z Hd (x) = ∇z H(x), which we call the matching condition, and hence ∇z Hd (x) cannot be chosen freely. Thus, only the desired closed-loop energy gradient vector ∇w Hd (x) is free for assignment. We consider a ξ -parameterized total desired energy with the following form: ¯ d (w) + C ˆ d (x, ξ ) := H(x) + ξ T φH (w) + H H

(26)

where ξ T φH (w) represents a linear-in-parameters basis function approximator (ξ ∈ Re a parameter vector and φH (w) ∈ Re the basis function, with e chosen sufficiently large to repre¯ d (w) is an sent the assignable desired closed-loop energy), H ˆ d (x, ξ ) nonarbitrary function of w, and C chosen to render H ˆ d (x, ξ ) automatically verifies (25). To negative. The function H guarantee that the system is passive in relation to the storage ˆ d (x, ξ ) the basis functions φH (w) should be chosen function H to be bounded such that ξ T φH (w) does not grow unbounded toward −∞ when ||w|| → ∞. Moreover, it is important to ˆ d (x, ξ ) to be the desired equilibconstrain the minima of H ∗ rium x , via the choice of the basis functions. The elements ˆ of the desired damping matrix K(x) of (14), denoted K(x, ), can be parameterized in a similar way f

ˆ K(x, ) = []ijl [φK (x)]l ij

(27)

l=1

with ∈ Rm×m×f and []ijl = []jil

(28) Kˆ T (x, ),

ˆ a parameter vector such that K(x, ) = (i, j) = 1, . . . , m and φK (x) ∈ Rf basis functions. We purposefully ˆ do not impose K(x, ) ≥ 0 to allow the injection of energy in the system via the damping term. This idea has been used in [27] to overcome the dissipation obstacle when synthesizing controllers by interconnection. Although this breaches the passivity criterion of (4) in our particular case we show that local stability can still be numerically demonstrated using passivity analysis in Section VI-C. This choice is made based on


1007

the knowledge that the standard EB-PBC method (without any extra machinery to accommodate saturation) cannot stabilize in the up position a saturated-input pendulum system starting from the down position. As such, this choice illustrates the power of RL in finding alternative routes to obtain control policies, such as injecting energy though the damping term. ˆ In other settings enforcing that K(x, ) ≥ 0 benefits the stability analysis. The control law (14) now becomes (when no ambiguity is present, the function arguments are dropped to improve readability) ˆ d (x, ξ ) − ∇w H(x) ∇w H † u(x, ξ, ) = g (x)F(x) 0 T ˆ d (x, ξ ) ˆ − K(x, )g (x)∇x H T φ ξ + ∇ H D H w ¯d † w =g F 0 ˆ − Ky. (29)

Algorithm 1 Energy-Balancing Actor-Critic Input: System (1), λ, γ , αa for each actor, αc . 1: e0 (x) = 0 ∀x 2: Initialize x0 , θ0 , ξ0 , 0 3: k ← 1 4: loop 5: Execute: 6: Draw uk ∼ N (0, σ2 ), calculate action uk = ς πˆ (xk , ξk , ψk ) + uk , ¯uk = uk −πˆ (xk , ξk , ψk )

Now, we are ready to introduce the update equations for the parameter vectors ξ, []ij . Denote by ξk , [k ]ij the value of the parameters at the discrete time step k. The policy πˆ of the AC-RL algorithm is chosen equal to the control law parameterized by (29)

14: 15: 16:

πˆ (xk , ξk , k ) : = u(xk , ξk , k ).

(30)

In this paper, we take the control input saturation problem into account by considering a generic saturation function ς : Rm → S, S ⊂ Rm , such that ς (u(x)) ∈ S ∀u

(31)

where S is the set of valid control inputs. The control action with exploration (20) becomes uk = ς π(x (32) ˆ k , ξk , ψk ) + uk where uk is drawn from a desired distribution. The exploration term to be used in the actor update (21) must be adjusted to respect the saturation ¯uk = uk − πˆ (xk , ξk , ψk ).

(33)

Note that due to this step, the exploration term ¯uk used in the learning algorithm is no longer drawn from the chosen distribution present in uk . Furthermore, we obtain the following gradients of the saturated policy: ˆ = ∇πˆ ς (π)∇ ˆ ξ πˆ ∇ξ ς (π) ∇[]ij ς (π) ˆ = ∇πˆ ς (π)∇ ˆ []ij πˆ .

(34) (35)

Although not explicitly indicated in the previous equations, the (lack of) differentiability of the saturation function ς has to be considered for the problem at hand such that the computation of the gradient of ς can be made. For a traditional saturation in ui ∈ R of the form max(umin , min(umax , ui )), i.e., assuming each input ui is bounded by umin and umax , then the gradient of ς is the zero matrix outside the unsaturated set S (i.e., when ui < umin or ui > umax ). For other types of saturation the function ∇ς must be computed. Finally, the actor

7: 8: 9: 10: 11: 12: 13:

17: 18:

Observe next state xk+1 and calculate reward rk+1 = ρ(xk+1 , uk ) Critic: ˆ k+1 , θk ) − V(x ˆ k , θk ) TD: δk+1 = rk+1 + γ V(x ˆ k , θk ) Eligibility trace: ek+1 = γ λek + ∇θ V(x Critic update: θk+1 = θk + αc δk+1 ek+1 Actors: ˆ d (x, ξ )): ξk+1 = ξk + αa,ξ δk+1 Actor 1 (H ¯uk ∇ξ ς πˆ (xk , ξk , k ) ˆ Actor 2 (K(x, )): for i, j = 1, . . . , m do [k+1 ]ij = [k ]ij + αa,[]ij δk+1 ¯uk ∇[k ]ij ς π(x ˆ k , ξk , k ) end for end loop

parameters ξk , [k ]ij are updated according to (21), respecting the saturated policy gradients. For the parameters of the desired Hamiltonian we obtain ξk+1 = ξk + αa,ξ δk+1 ¯uk ∇ξ ς π(x ˆ k , ξk , k )

(36)

and for the parameters of the desired damping we have [k+1 ]ij = [k ]ij + αa,[]ij δk+1 ¯uk ∇[k ]ij ς π(x ˆ k , ξk , k )

(37)

where (i, j) = 1, . . . , m, while observing (28). Algorithm 1 gives the entire EB-AC algorithm with input saturation. The dynamics of the EB-AC Algorithm 1 raise a number of questions regarding stability and convergence: are the good stability properties of the traditional EB-PBC lost? In effect this is not the case, as if the parameter ξ is fixed then stability is preserved, in the sense that the system is ˆ d (x, ξ ), assuming bounded passive to the storage function H ˆ d (x, ξ ) as discussed after (26) basis functions describing H and that the dissipation matrix Kˆ is semi-positive definite. A related question is if during learning (while the paramˆ d (x, ξ ) capture eter ξ is evolving) will the Hamiltonian H the desired control specification. One cannot assume that the desired Hamiltonian will immediately fulfil the control specification, since if that was the case then no learning is needed. In the RL community it is generally accepted that during learning no stability and convergence guarantees can be given [7], as exploration is a necessary component of the framework. In our framework, we cannot guarantee convergence during learning, but by constraining the desired Hamiltonian we

1008


can prevent the total energy to grow unbounded, avoiding possible instabilities. Another relevant question is: will RL converge in this setting? The control law in (29) can be rewritten as ¯ j φ¯ 3,j (x) ξi φ¯ 2,i (x) + u = φ¯ 1 (x) + i

j

¯ representing the stacked version of , meaning that with the policy is parameterized in an affine way. This is similar to most of the standard linear in parameter representations found in the AC literature. However, at this moment we do not yet take the advantage of the existing AC-RL convergence proofs to demonstrate convergence of Algorithm 1. However, in practice not only will the RL component converge faster than traditional model-free RL (see Section VI-E), but throughout learning the system never gets unstable, as presented next in Section VI. Additionally, the resulting policy performs as well as standard model-free approximated RL algorithms. The last point to consider is that since the EB-PBC suffers from the dissipation obstacle [2], limiting its applicability to special classes of systems such as mechanical systems, the algorithm we present contains the same limitation. Eliminating such limitation is ongoing work (see also the results presented in [27]). V. M ECHANICAL S YSTEMS To illustrate an application of the method, consider a fully actuated mechanical system of the form ⎧ 0 I q˙ ∇q H(q, p) 0 ⎪ ⎪ = + u ⎨ p˙ I −I −R¯ ∇p H(q, p) m : (38) ∇q H(q, p) ⎪ ⎪ ⎩ y= 0 I ∇p H(q, p) with q ∈ Rn¯ , p ∈ Rn¯ (¯n = (n/2), n even) the generalized positions and momenta, respectively, and R¯ ∈ Rn¯ ×¯n the damping matrix. The system admits (1) with R¯ > 0 and the Hamiltonian H(q, p) =

1 T −1 p M (q)p + P(q) 2

(39)

with M(q) = M T (q) > 0 the inertia matrix and P(q) the potential energy. For the system (38) it holds that rank{g(x)} = n¯ and the state vector can be split into part w = [q1 , q2 , . . . , qn¯ ]T and part z = [p1 , p2 , . . . , pn¯ ]T . Since g(x) = [0 I]T its annihilator can be written as g⊥ (x) = [¯g(x) 0], for an arbitrary matrix g¯ (x). This means that only the potential energy can be shaped, which is widely known in EB-PBC for mechanical systems. The approximated desired closed-loop energy (26) reads 1 ˆ d (x, ξ ) = pT M −1 (q)p + ξ T φH (q) H (40) 2 where the first term represents the unassignable part, i.e., the kinetic energy of the system Hamiltonian (39), and the second term ξ T φH (q) the assignable desired potential energy. The actor updates can then be defined for each parameter according to (36) and (37). For underactuated mechanical systems, e.g., G = [0 I]T , the split state vector z is enlarged with those q-coordinates that cannot be actuated directly because these

Fig. 1.

Inverted pendulum setup.

coordinates correspond to the zero elements of N(x), (i.e., the matrix Nq (x) is no longer rank n¯ ). VI. E XAMPLE : P ENDULUM S WING -U P To validate our method, the problem of swinging up an inverted pendulum subject to control saturation is studied in simulation and using the actual physical setup depicted in Fig. 1. The pendulum swing-up is a low-dimensional, but highly nonlinear control problem commonly used as a benchmark in the RL literature [9] and it has also been studied in PBC [11]. The equations of motion admit (38) and read ⎧ 0 ⎪ 0 1 q˙ ∇q H(q, p) ⎪ ⎪ = ⎨ ¯ q) ∇p H(q, p) + Kp u p˙ −1 −R(˙ Rp (41) p :

⎪ ⎪ ∇q H(q, p) Kp ⎪ ⎩ y= 0 R p ∇p H(q, p) with q the angle of the pendulum and p the angular momentum, thus we denote the full measurable state x = [q, p]T . The damping term is ¯ q) = bp + R(˙

Kp2 Rp

+

σp |˙q|

(42)

¯ q) > 0, ∀˙q. Note that the for which it holds that R(˙ fraction σp /|˙q| arises from the modeled Coulomb friction fc = σp sign(˙q) = σp /|˙q|˙q = σp /|˙q| p/Jp . Furthermore, we denote the Hamiltonian H(q, p) =

p2 + P(q) 2Jp

(43)

with P(q) = Mp gp lp (1 + cos q).

(44)

The model parameters are given in Table I. The desired Hamiltonian (26) reads ˆ d (x, ξ ) = H

p2 + ξ T φH (q). 2Jp

(45)

Only the potential energy can be shaped that we denote by Pˆ d (q, ξ ) = ξ T φH (q). Furthermore, as there is only one input, ˆ K(x, ) becomes a scalar ˆ K(x, ψ) = ψ T φK (x).

(46)


TABLE I I NVERTED P ENDULUM M ODEL PARAMETERS

TABLE II S IMULATION PARAMETERS

Thus, control law (29) results in T T ξ ∇q φH − ∇q P † T ξ ∇q φ H ˆ − Kg u(x, ξ, ψ) = g F Jp−1 p 0 Rp T ξ ∇q φH + Mp gp lp sin(q) =− Kp Kp T ψ φK q˙ (47) − Rp which we define as the policy πˆ (x, ξ, ψ). Hence, we have two actor updates ξk+1 = ξk + αa,ξ δk+1 ¯uk ∇ξ ς πˆ (xk , ξk , ψk ) (48) ψk+1 = ψk + αa,ψ δk+1 ¯uk ∇ψ ς πˆ (xk , ξk , ψk ) (49) for the desired potential energy Pˆ d (q, ξ ) and the desired ˆ damping K(x, ψ), respectively. A. Function Approximation To approximate the critic and the two actors, function approximators are necessary. In this paper, we use the Fourier basis [28] because of its ease of use, the possibility to incorporate information about the symmetry in the system and the ability to ascertain properties useful for stability analysis of this specific problem. The periodicity of the function approximators obtained via a Fourier basis is compatible with the topology of the configuration space of the pendulum, defined to be S1 ×R. We define a multivariate Nth-order2 Fourier basis for n dimensions as φi (¯x) = cos(π cTi x¯ ), i ∈ 1, . . . , (N + 1)n (50) with ci ∈ Zn , which means that all possible N + 1 integer values, or frequencies, are combined in a vector in Zn to cren ate a matrix c ∈ Zn×(N+1) containing all possible frequency combinations. For example c1 = [0 0]T , c2 = [1 0]T , . . . , c(3+1)2 = [4 4]T

1009

(51)

for a 3rd-order Fourier basis in 2-D. The state x is scaled according to xi − xi,min x¯ i,max − x¯ i,min + x¯ i,min x¯ i = (52) xi,max − xi,min for i = 1, . . . , n with (¯xi,min , x¯ i,max ) = (−1, 1). Projecting the state variables onto this symmetrical range has several advantages. First, this means that the policy is periodic with period T = 2, such that it wraps around (i.e., modulo 2π ) and 2 “Order” refers to the order of approximation; “dimensions” to the number of states in the system.

prevents discontinuities at the boundary values of the angle (x = [π ± , p], very small). This way we are taking into consideration the topology of the system: for the inverted pendulum we have that q ∈ S1 and p ∈ R. Second, learning is faster because updating the value function and policy for some x also applies to the sign-opposite value of x. Third, P˙ˆ d (0, ξ ) = 0 by the choice of parameterization, which is beneficial for stability analysis. Although the momentum will now also be periodic in the value function and policy, this is not a problem because the value function and policy approximation are restricted to a domain and the momentum itself is also restricted to the same domain. We adopt the adjusted learning rate from [28] such that αa ,ψ αa ,ξ αai ,ψ = b (53) αai ,ξ = b , ci 2 ci 2 for i = 1, . . . , (N + 1)n with αab ,ξ , αab ,ψ the base learning rate for the two actors (Table II) and αa1 ,ξ = αab ,ξ , αa1 ,ψ = αab ,ψ to avoid division by zero for c1 = [0 0]T . Equation (53) implies that parameters corresponding to basis functions with higher (lower) frequencies are learned slower (faster). Note that in this specific example we assign two different values to the base learning rate due to the different scalars premultiplying the basis functions in (47). This results in a “balanced” learning process for the parameters ξ and ψ. The parameteri˙ˆ (x∗ , ξ ) = 0 for all ξ , where zations described above result in H d x∗ = [0, 0]T is the goal state. This entails that the goal state is a critical point of the Hamiltonian throughout the learning process, in effect speeding up the RL convergence. B. Simulation The task is to learn to swing up and stabilize the pendulum from the initial position pointing down x0 = [π, 0]T to the desired equilibrium position at the top x∗ = [0, 0]T . Since the control action is saturated, the system is not able to swing up the pendulum directly, but rather it must swing back and forth to build up momentum to eventually reach the equilibrium. The reward function ρ is defined such that it has its maximum in the desired unstable equilibrium and penalizes other states via ρ(x, u) = Qr (cos(q) − 1) − Rr p2

(54)

with Qr = 25, Rr =

0.1 . Jp2

(55)

1010


This reward function is consistent with the mapping S1 × R → R for the pair angle and angular velocity, and proved to improve performance over a purely quadratic reward, such as the one used in [9]. For the critic, we define the basis function approximation as ˆ θ ) = θ T φc (x) V(x,

(56)

with φc (x) a 3rd-order Fourier basis resulting in 16 learnable parameters θ in the domain [qmin , qmax ] × [pmin , pmax ] = [−π, π ] × [ − 8π Jp , 8π Jp ]. Actor 1 (Pˆ d (q, ξ )) is parameterized using a 3rd-order Fourier basis in the range [−π, π ] resultˆ ing in four learnable parameters. Actor 2 (K(x, ψ)) is also parameterized using a 3rd-order Fourier basis for the full state space, in the same domain as the critic. Exploration is done at every time step by randomly perturbing the action with a normally distributed zero-mean white noise with standard deviation σ = 1, that is u ∼ N (0, 1).

(57)

Normally, the AC scheme requires that the exploration term decays to zero as time progresses. However, in practice it is difficult to find an appropriate decay schedule without extensive experiments with the system. An alternative, which we follow in this paper, is to chose a sufficiently small σ so that the learning converges to the neighborhood of the optimal policy. While the eventual policy may be suboptimal, it tends to be more robust with respect to disturbances, as it is learned under exploration. Moreover, since the initial position of the pendulum is a stable equilibrium it is useful to introduce an initial perturbation, which is achieved via exploration. We incorporate saturation by defining the saturation function (31) as if |uk | ≤ umax uk (58) ς (uk ) = sgn(uk )umax otherwise. Recall that the saturation must be taken into account in the policy gradients by applying (34) and (35). The parameters were all initialized with zero vectors of appropriate dimensions, i.e., (θ0 , ξ0 , ψ0 ) = 0. The algorithm was first run with the system simulated in MATLAB for 200 trials of 3s each (with a near-optimal policy, the pendulum needs approximately one second to swing up). Each trial begins in the initial position x0 . This simulation was repeated 50 times to get an estimate of the average, minimum, maximum and confidence regions for the learning curve. The simulation parameters are given in Table II. Fig. 2 shows the average learning curve obtained after 50 simulations. The algorithm shows good convergence and on average needs about 2 min (40 trials) to reach a near-optimal policy. The initial drop in performance is caused by the zero-initialization of the value function (critic), which is too optimistic compared to the true value function. Therefore, the controller explores a large part of the state space and receives a lot of negative rewards before it learns the true value of the states. Note that the learning rates were tuned such that there are no failed runs, i.e., all 50 consecutive simulations converge to a near-optimal policy. We have observed that if the learning rates are set higher, faster learning can be achieved at the cost of a higher number of failed runs. A simulation using

Fig. 2. Learning curve for the proposed energy-balancing actor-critic method for 50 simulations.

Fig. 3. Simulation results for (a) angle q (top) and momentum p (bottom) and (b) desired closed-loop Hamiltonian Hd (x, ξ, ψ) including the simulated trajectory (black dots) using the policy learned.

the policy learned in a typical experiment is given in Fig. 3(a). As can be seen, the pendulum swings back once to build up momentum to eventually get to the equilibrium. The desired ˆ d (x, ξ ) (45), acquired through learning, is given Hamiltonian H in Fig. 3(b). There are three minima, of which one corresponds to the desired equilibrium. The other two equilibria are undesirable wells that come from the shaped potential energy Pˆ d (q, ξ ) [Fig. 4(a)]. These minima are the result of the algorithm trying to swing up the pendulum in a single swing, which is not possible due to the saturation. Hence, a swing-up strategy is necessary to avoid staying in these wells. The number of these undesirable wells is a function of the control saturation and of the number of basis functions chosen to approximate ˆ ψ) [Fig. 4(b)] is positive Pˆ d (q, ξ ). The learned damping K(x, (white) toward the equilibrium thus extracting energy from the system, while it is negative (gray) in the region of the initial state. The latter corresponds to pumping energy into the system, which is necessary to build up momentum for the swing-up and to escape the undesirable wells of Pˆ d (q, ξ ) [see discussion of expression (27)]. A disadvantage is that control law (47), with the suggested basis functions, is always zero for the set = {x | x = (0 + jπ, 0), j = 1, 2, . . .} which implies that it is zero not only at the desired equilibrium, but also at the initial state x0 . During learning this is not a problem


1011

Fig. 4. (a) Desired potential energy and (b) desired damping (gray: negative, white: positive) for a typical learning experiment. The black dots indicate the individual samples of the simulation in Fig. 3(a).

Fig. 6. Learning curve for the proposed energy-balancing actor-critic method for 20 experiments with the real physical system.

the latter two of which naturally result from the basis function definition. Furthermore, from Fig. 4(b) it can be seen that ˆ ψ) > 0. Hence, in a region δ around x∗ around x∗ , K(x, ˆ d (x, ξ ) > 0; H ˙ˆ (x, ψ) indicating positive (white) and negative (gray) Fig. 5. Signum of (a) H d ˙ˆ (x, ψ) − H ˙ ˙ˆ ˆ d,diff (x, ψ) = sgn H regions and (b) H d d,sat (x, ψ) indicating ˙ˆ ˙ˆ (x, ψ) = H ˆ˙ d (x, ψ) = H ˆ˙ d,sat (x, ψ) regions where H d,sat (x, ψ) (gray) and H d (white). Black dots indicate the simulated trajectory.

since there is constant exploration, but after learning the system should not be initialized at exactly x0 otherwise it will stay in this point. This can be solved by initializing with a small perturbation around x0 . In real-life systems it will also be less of a problem as noise is generally present in the sensor readings. C. Stability of the Learned Controller Since control saturation is present, the target dynamics do not satisfy (5). Hence, to conclude local stability of x∗ ˙ˆ (x, ξ ) for the unsaturated case based on (6), we calculate H d ˙ˆ [Fig. 5(a)]3 and the saturated case (H d,sat (x, ξ )) and compute the sign of the difference [Fig. 5(b)]. By looking at Fig. 5(b), it appears that ∃δ ⊂ Rn : |x − x∗ | < δ such that ˙ˆ (x, ξ ). It can be seen from Fig. 5(b) that such ˙ˆ H d,sat (x, ξ ) = H d a δ exists, i.e., a small gray region around the equilibrium x∗ ˙ˆ (x, ξ ) around x∗ and assess staexists. Hence, we can use H d ˆ d (x, ξ ) > 0 bility using (6). From Fig. 3(b) it follows that H for all states in Fig. 3(b). From Fig. 4(a) we infer that locally arg min Pˆ d (q, ξ ) = x∗ ;

P˙ˆ d (x∗ , ξ ) = 0;

P¨ˆ d (x∗ , ξ ) > 0 (59)

3 Fig. 5(a) is sign-opposite to Fig. 4(b), which is logical, because the neg-

ˆ ative (positive) regions of K(x, ψ) correspond to negative (positive) damping ˙ˆ (x, ξ ) based on (6). which corresponds to a positive (negative) value of H d

˙ˆ (x, ξ ) ≤ 0; H d

˙ˆ (x∗ , ξ ) = 0 H d

(60)

which implies local asymptotic stability of x∗ . Extensive simulations show that similar behavior is always achieved. D. Real-Time Experiments Using the physical setup shown in Fig. 1, 20 learning experiments were run using identical settings as in the simulations. There were no failures in this set of experiments (i.e., the learning process always converged to a near-optimal policy capable of swinging up and stabilizing the pendulum). The results are illustrated in Fig. 6. The algorithm shows slightly slower convergence—about three min of learning (60 trials) to reach a near-optimal policy instead of 40—and a less consistent average when compared to Fig. 2. This can be attributed to a combination of model mismatch and the symmetrical basis functions (through which it is not possible to incorporate nonsymmetrical friction that is present in the real system). Overall though, the performance can be considered good when compared to the simulation results. Also, the same performance dip is present which can again be attributed to the optimistic value function initialization. E. Comparison With Standard AC We have applied the standard AC [9, Algorithm 1] to the same swing-up and stabilization task for the pendulum, represented by system (41). We considered a linearly parameterized policy (actor) πˆ (x, ϑ) = ϑ T φa (x)

(61)

where ϑ is the parameter to be learned. The resulting control input is uk = ς (π(x, ˆ ϑ) + uk )

(62)

1012


of parameters in the approximators. The EB-AC method yields faster convergence with fewer basis functions and no failed trials throughout the entire learning experiment. An additional advantage of EBAC is the physical interpretation of the control law learned. We are currently working on the extension of the algorithms presented to an IDA-PBC setting, such that more classes of systems can be addressed and more freedom is given in shaping the desired Hamiltonian (e.g., kinetic energy shaping). R EFERENCES Fig. 7.

Learning curve for the standard AC method for 50 simulations.

where the exploration term uk is a zero-mean random variable. Here, we have used the simulation parameters from Table II, together with the actor learning rate αa,ϑ = 5 · 10−5 . Similarly to the energy balancing actor critic (EBAC) simulations of Section VI-B, the learning rate has been tuned such that no failed experiments occur. Fig. 7 shows the maximum, average and the minimum of the learning curve obtained from 50 consecutive simulations. The S-AC algorithm generally needs around 60 trials before a near-optimal policy is obtained, while EBAC needs around 40 trials as mentioned in Section VI-B. Faster convergence for the S-AC can be achieved by increasing the learning rate αa,ϑ . However, when doing so we observed a considerable number of failures. For S-AC we used a much larger number of basis functions—one hundred in total for the critic and actor, compared to 36 basis functions of EBAC. In addition to faster convergence and fewer basis functions, one major advantage of EBAC as compared to S-AC is the physical interpretability of the learned control law in terms of energy shaping and damping injection. VII. C ONCLUSION In this paper, we have presented a method to systematically parameterize the policy in RL control, using the PH framework. The parameters are found by means of AC-RL. In this way, we are able to learn a closed-loop energy landscape for PH systems. The advantages are that optimal controllers can be generated using energy-based control techniques, there is no need to specify a global system Hamiltonian, and the solutions acquired by means of RL can be interpreted in terms of energy shaping and damping injection, which makes it possible to numerically assess stability using passivity theory. By making use of the model knowledge the AC method is able to quickly learn near-optimal policies. We have found that the proposed EB-AC algorithm performs very well in a physical setup. Due to the intrinsic energy boundedness of the learned Hamiltonian, we have observed that the system never gets unstable during learning. In comparison with the standard AC method, we found that incorporating model knowledge through the PH framework makes the learning process more effective in terms of the learning speed within the individual trials, the overall learning performance over a series of trials, and the number

[1] R. Ortega and M. W. Spong, “Adaptive motion control of rigid robots: A tutorial,” Automatica, vol. 25, no. 6, pp. 877–888, Nov. 1989. [2] R. Ortega, A. van der Schaft, F. Castanos, and A. Astolfi, “Control by interconnection and standard passivity-based control of port-Hamiltonian systems,” IEEE Trans. Autom. Control, vol. 53, no. 11, pp. 2527–2542, Dec. 2008. [3] A. van der Schaft, L2 -Gain and Passivity Techniques in Nonlinear Control. Berlin, Germany: Springer, 2000. [4] R. Ortega, A. van der Schaft, I. Mareels, and B. Maschke, “Putting energy back in control,” IEEE Control Syst. Mag., vol. 21, no. 2, pp. 18–33, Apr. 2001. [5] V. Duindam and S. Stramigioli, Modeling and Control for Efficient Bipedal Walking Robots: A Port-Based Approach. Berlin, Germany: Springer, 2009. [6] C. Secchi, S. Stramigioli, and C. Fantuzzi, Control of Interactive Robotic Interfaces: A Port-Hamiltonian Approach. Berlin, Germany: Springer, 2007. [7] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. London, U.K.: MIT Press, 1998. [8] V. Konda and J. Tsitsiklis, “On actor-critic algorithms,” SIAM J. Control Optim., vol. 42, no. 4, pp. 1143–1166, 2003. [9] I. Grondman, M. Vaandrager, L. Bu¸soniu, R. Babuška, and E. Schuitema, “Efficient model learning methods for actor-critic control,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 3, pp. 591–602, Jun. 2012. [10] Y. W. A. Wei, “Stabilization and H ∞ control of nonlinear port-controlled Hamiltonian systems subject to actuator saturation,” Automatica, vol. 46, no. 12, pp. 2008–2013, 2010. [11] K. Åstrom, J. Aracil, and F. Gordillo, “A family of smooth controllers for swinging up a pendulum,” Automatica, vol. 44, no. 7, pp. 1841–1848, Jul. 2008. [12] G. Escobar, R. Ortega, and H. Sira-Ramirez, “Output-feedback global stabilization of a nonlinear benchmark system using a saturated passivity-based controller,” IEEE Trans. Control Syst. Technol., vol. 7, no. 2, pp. 289–293, Mar. 1999. [13] K. Fujimoto and T. Sugie, “Iterative learning control of Hamiltonian systems: I/O based optimal control approach,” IEEE Trans. Autom. Control, vol. 48, no. 10, pp. 1756–1761, Oct. 2003. [14] A. Macchelli, “Port Hamiltonian systems: A unified approach for modeling and control finite and infinite dimensional physical systems,” Ph.D. dissertation, Dept. Electron. Comput. Sci. Syst., Univ. Bologna, Bologna, Italy, 2002. [15] A. Macchelli, C. Melchiorri, C. Secchi, and C. Fantuzzi, “A variable structure approach to energy shaping,” in Proc. Eur. Control Conf., 2003. [16] W. Sun, Z. Lin, and Y. Wang, “Global asymptotic and finite-gain L2 stabilization of port-controlled Hamiltonian systems subject to actuator saturation,” in Proc. Amer. Control Conf., St. Louis, MO, USA, 2009. [17] H. Hjalmarsson, “Iterative feedback tuning: An overview,” Int. J. Adapt. Control Signal Process., vol. 16, no. 5, pp. 373–395, May 2002. [18] K. Fujimoto and I. Koyama, “Iterative feedback tuning for Hamiltonian systems,” in Proc. 17th Int. Federat. Autom. Control World Congr., Seoul, Korea, Jul. 2008, pp. 15678–15683. [19] S. Yin, H. Luo, and S. Ding, “Real-time implementation of faulttolerant control systems with performance optimization,” IEEE Trans. Ind. Electron., vol. 61, no. 5, pp. 2402–2411, May 2014. [20] B. Maschke and A. van der Schaft, “Port-controlled Hamiltonian systems: Modelling origins and system theoretic properties,” in Proc. 3rd Conf. Nonlin. Control Syst. (NOLCOS), Bordeaux, France, 1992, pp. 282–288. [21] A. van der Schaft and B. Maschke, “On the Hamiltonian formulation of nonholonomic mechanical systems,” Rep. Math. Phys., vol. 34, no. 2, pp. 225–233, Aug. 1994.


[22] I. Grondman, L. Bu¸soniu, and R. Babuška, “Model learning actorcritic algorithms: Performance evaluation in a motion control task,” in Proc. IEEE Conf. Decis. Control (CDC), Maui, HI, USA, Dec. 2012, pp. 5272–5277. [23] A. Barto, R. Sutton, and C. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 13, no. 5, pp. 835–846, Sep./Oct. 1983. [24] I. Grondman, L. Bu¸soniu, G. Lopes, and R. Babuška, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 42, no. 6, pp. 1291–1307, Nov. 2012. [25] R. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, vol. 12. Cambridge, MA, USA: MIT Press, 2000, pp. 1057–1063. [26] J. Asmuth, L. Li, M. Littman, A. Nouri, and D. Wingate, “A Bayesian sampling approach to exploration in reinforcement learning,” in Proc. 25th Conf. Uncertainty Artif. Intell., Edinburgh, U.K., 2009, pp. 19–26. [27] J. Koopman and D. Jeltsema, “Casimir-based control beyond the dissipation obstacle,” arXiv.org, Jun. 2012. [28] G. Konidaris and S. Osentoski, “Value function approximation in reinforcement learning using the Fourier basis,” Auton. Learn. Lab., Comput. Sci. Dept., Univ. Massachusetts, Amherst, MA, USA, Tech. Rep. 101, 2008.

Olivier Sprangers received the M.Sc. degree in systems and control from the Delft Center for Systems and Control, Delft University of Technology, Delft, The Netherlands, in 2012. He is currently an Investment Professional with HAL Investments, Rotterdam, The Netherlands. His current research interests include combining modelbased control technologies with machine-learning techniques.

1013

Robert Babuška received the M.Sc. degree (cum laude) in control engineering from the Czech Technical University, Prague, Czech Republic, and the Ph.D. degree (cum laude) from the Delft University of Technology, Delft, The Netherlands, in 1990 and 1997, respectively. He has had faculty appointments with the Department of Technical Cybernetics, Czech Technical University, and the Electrical Engineering Faculty, Delft University of Technology. He is currently a Professor of Intelligent Control and Robotics with the Delft Center for Systems and Control, Delft University of Technology. His current research interests include reinforcement learning, neural and fuzzy systems, identification and state-estimation, nonlinear model-based and adaptive control, and dynamic multiagent systems.

Subramanya P. Nageshrao received the B.E. degree in electronics and communication from Visvesvaraya Technological University, Belgaum, India, in 2005, and the M.S. degree in mechatronics from Technische Universität Hamburg-Harburg, Hamburg, Germany, in 2011. He is currently pursuing the Ph.D. degree from the Delft Center for Systems and Control, Delft University of Technology, Delft, The Netherlands. From 2005 to 2009, he was a Software Engineer with Bosch Ltd., Bangalore, India. His current research interests include machine learning and nonlinear control, where he is actively combining reinforcement-learning techniques with passivity-based control.

Gabriel A. D. Lopes (M’03) received the M.Sc. and the Ph.D. degrees in electric engineering and computer science from the University of Michigan, Ann Arbor, MI, USA, in 2003 and 2007, respectively. He was with the GRASP Laboratory, University of Pennsylvania, Philadelphia, PA, USA, from 2005 to 2008. He is currently an Assistant Professor with the Delft Center for Systems and Control, Delft University of Technology, Delft, The Netherlands. His current research interests include nonlinear control, robotics, discrete-event systems, and machine learning.

Discrete-time online learning control for a class of unknown nonaffine nonlinear systems using reinforcement learning.

Benchmarking for Bayesian Reinforcement Learning.

MEC--a near-optimal online reinforcement learning algorithm for continuous deterministic systems.

Risk-sensitive reinforcement learning.

Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations.

Reinforcement Learning Trees.

Reinforcement learning with Marr.

Personality, reinforcement and learning.

Reinforcement learning and human behavior.

Reinforcement learning and Tourette syndrome.

H ∞ tracking control of completely unknown continuous-time systems via off-policy reinforcement learning.

Homeostatic reinforcement learning for integrating reward collection and physiological stability.

Reinforcement learning accounts for moody conditional cooperation behavior: experimental results.

Model-Based Reinforcement Learning for Infinite-Horizon Approximate Optimal Tracking.

Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.

Off-policy reinforcement learning for H∞ control design.

Reinforcement learning output feedback NN control using deterministic learning technique.

Learning strategies in table tennis using inverse reinforcement learning.

Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

Reinforcement learning for routing in cognitive radio ad hoc networks.

A neural model of hierarchical reinforcement learning.

Reinforcement learning improves behaviour from evaluative feedback.

Human-level control through deep reinforcement learning.

Credit assignment in movement-dependent reinforcement learning.