Adaptive dynamic programming as a theory of sensorimotor control.

Biol Cybern (2014) 108:459–473 DOI 10.1007/s00422-014-0613-7

ORIGINAL PAPER

Adaptive dynamic programming as a theory of sensorimotor control Yu Jiang · Zhong-Ping Jiang

Received: 27 March 2013 / Accepted: 22 May 2014 / Published online: 25 June 2014 © Springer-Verlag Berlin Heidelberg 2014

Abstract Many characteristics of sensorimotor control can be explained by models based on optimization and optimal control theories. However, most of the previous models assume that the central nervous system has access to the precise knowledge of the sensorimotor system and its interacting environment. This viewpoint is difficult to be justified theoretically and has not been convincingly validated by experiments. To address this problem, this paper presents a new computational mechanism for sensorimotor control from a perspective of adaptive dynamic programming (ADP), which shares some features of reinforcement learning. The ADPbased model for sensorimotor control suggests that a command signal for the human movement is derived directly from the real-time sensory data, without the need to identify the system dynamics. An iterative learning scheme based on the proposed ADP theory is developed, along with rigorous convergence analysis. Interestingly, the computational model as advocated here is able to reproduce the motor learning behavior observed in experiments where a divergent force field or velocity-dependent force field was present. In addition, this modeling strategy provides a clear way to perform stability analysis of the overall system. Hence, we conjecture that human sensorimotor systems use an ADP-type mechanism to control movements and to achieve successful adaptation to uncertainties present in the environment.

This work has been supported in part by the National Science Foundation Grants DMS-0906659, ECCS-1101401, and ECCS-1230040. Y. Jiang · Z.-P. Jiang (B) Control and Networks Laboratory, Department of Electrical and Computer Engineering, Polytechnic School of Engineering, New York University, 5 Metrotech Center, Brooklyn, NY 11201, USA e-mail: [email protected]

Keywords Optimal control · Motor learning · Endpoint stiffness · Adaptive dynamic programming

1 Introduction Over the years, extensive research has been performed on goal-oriented movements to study why human beings choose stereotyped movements, such as roughly straight hand paths with bell-shaped speed profiles (Morasso 1981), among infinitely many possibilities. The computational nature of sensorimotor control has been studied from many aspects (Franklin and Wolpert 2011). One of the widely accepted views is that the central nervous system (CNS) favors trajectories for which a certain cost function is minimized. Indeed, many optimization and optimal feedback control models have been proposed (Diedrichsen et al. 2010; Flash and Hogan 1985; Harris and Wolpert 1998; Hogan and Flash 1987; Jiang et al. 2011b; Qian et al. 2013; Scott 2004; Todorov and Jordan 2002; Todorov 2005; Uno et al. 1989, and references therein). Although these models can explain many aspects of motor behaviors, they assume that the motor system dynamics and the environment are known a priori. As a result, the CNS has to first identify the system dynamics and then calculate the optimal control policy. In the presence of perturbations, most of the previous models have relied on sensory prediction errors to form an estimate of the perturbation (Berniker and Kording 2008; Bhushan and Shadmehr 1999; Davidson and Wolpert 2003; Kording et al. 2007; Shadmehr and Mussa-Ivaldi 1994; Wolpert and Ghahramani 2000). However, this viewpoint is difficult to be justified theoretically and has not been convincingly validated by experiments. Indeed, evidence against source-identified adaptation is reported by Hudson and Landy (2012), where a self-generated perturbation was created, but this perturbation was not identified or

123

460

used in formulating the control policy. Therefore, the mechanism of human sensorimotor control is an important yet interesting research topic that deserves further investigations, especially when uncertainties are taken into account. Sensorimotor control in the presence of uncertainties can be studied in the context of reinforcement learning (RL) (Sutton and Barto 1998). RL is a theory in machine learning that studies how an agent iteratively modifies its actions based on the observed responses from its interacting environment (Doya 2000; Doya et al. 2001; Sutton and Barto 1998). The study of RL was first inspired by the decisionmaking process of animals and humans. It is observed that animals and humans take and improve actions based on the reward from the uncertain environment. Doya et al. (2001) discussed certain brain areas that can realize the steps of RL and suggested a learning scheme for the neurons based on temporal difference. An actor-critic-based optimal learner has been used by Izawa and Shadmehr (2011) where an RL scheme was proposed to directly update the motor command. However, it is difficult to perform convergence and stability analysis for this model. Franklin et al. (2008) introduced a simple V-shape algorithm which accounts for the ability of the CNS to improve motor skills with practice by iteratively adjusting motor commands. This model is further validated by Tee et al. (2010), and its convergence and stability properties have been studied in Yang et al. (2011). A model reference iterative learning control (ILC) method was developed by Zhou et al. (2012). Nevertheless, in Yang et al. (2011) and Zhou et al. (2012), knowledge of the motor system has to be either fully known or identifiable using nonlinear adaptive control schemes. In addition, the resultant control policies are generally not optimal. The objective of this paper is to investigate sensorimotor control behavior from a perspective of adaptive dynamic programming (ADP). The research in ADP can be traced back to the seminal work of Werbos (1968) and Werbos (1974) and is a systematic method to perform RL using iterative techniques from dynamic programming (Bellman 1957), but avoids the so-called Curse of Dimensionality of dynamic programming (Powell 2011). Similar as in other RL methods, ADP is a non-model-based approach that directly updates the control policy without the need to identify the system dynamics. Hence, it can reasonably be used to model the sensorimotor learning behavior. ADP theory has been extensively studied for discrete-time systems. In particular, the action-dependent heuristic dynamic programming (Werbos 1989), also known as Q-learning (Watkins 1989), has found numerous applications in engineering. ADP is considerably more difficult for continuous-time systems than for discrete-time systems, and many results developed for discrete-time systems cannot be directly generalized to a continuous-time setting (Lewis and Vrabie 2009). Early work for continuous-time systems was made by Murray et al. (2002) and Vrabie et al. (2009), where

123

Biol Cybern (2014) 108:459–473

partial knowledge of the system dynamics must be known. Recently, an ADP method for deterministic continuous-time systems with completely unknown dynamics was developed by Jiang and Jiang (2012a). To address the sensorimotor control problem, in this paper, we will first extend the main result of Jiang and Jiang (2012a) to the context of stochastic linear systems with signal-dependent noise (Harris and Wolpert 1998), by using Itô calculus (Itô 1944). Through numerical studies and rigorous analysis, we argue that the recently developed ADP theory is a desirable candidate for the mechanism of human sensorimotor control. Fundamental limitations in previous optimization models will be overcome by means of this novel approach. It is expected that the ADP approach provides further insights into how CNS processes sensory information to coordinate movements in the presence of uncertainties. A detailed learning algorithm for the sensorimotor system is developed, and its convergence property is rigorously analyzed. In addition, stability analysis based on our proposed ADP method is surprisingly simple, compared with previous optimization models. To validate this new computational mechanism of sensorimotor control, we show that it can reproduce nearly identical results as observed in experiments conducted by Burdet et al. (2001) and Franklin et al. (2003). Also, it can be used to study the modification of stiffness by the CNS.

2 ADP theory 2.1 Problem formulation To study sensorimotor control, we consider the following system governed by stochastic differential equations: q

dx = Axdt + Budt + B Ci udηi

(1)

i=1

where A ∈ Rn×n and B ∈ Rn×m are constant matrices describing the system dynamics with the pair (A, B) assumed to be stabilizable (i.e., there exists some constant matrix K 0 ∈ Rm×n such that A − B K 0 is a Hurwitz matrix in the sense that all its eigenvalues are in the open left-half plane), u ∈ Rm is the control signal, ηi are independent scalar Brownian motions and Ci ∈ Rm×m are constant matrices, for i = 1, 2, . . . , q. The control objective is to determine a linear control policy u = −K x

(2)

which minimizes the following cost function ∞ J = (x T Qx + u T Ru)dt 0

(3)

Biol Cybern (2014) 108:459–473

461

for the nominal system of (1) (i.e., system (1) with ηi = 0, ∀i = 1, 2, . . . , q), where Q = Q T ≥ 0, R = R T > 0, with (A, Q 1/2 ) observable. By observability, we mean that the solution x(t) of the system dx = Axdt is identically zero when the output y = Q 1/2 x is identically zero (Lewis and Syrmos 1995). According to linear optimal control theory (Lewis and Syrmos 1995), when both A and B are accurately known, solution to this problem can be found by solving the following well-known algebraic Riccati equation (ARE) A T P + P A + Q − P B R −1 B T P = 0.

(4)

By the assumptions mentioned above, (4) has a unique symmetric positive definite solution P ∗ ∈ Rn×n . The optimal feedback gain matrix K ∗ in (2) can thus be determined by K ∗ = R −1 B T P ∗ .

(5)

In the presence of the signal-dependent noise ηi , the closed-loop system is mean–square stable (Kleinman 1969a), if 0 > (A − B K ∗ ) ⊗ In + In ⊗ (A − B K ∗ ) q + BCi K ∗ ⊗ BCi K ∗ .

(6)

i=1

Kleinman (1969b) has proved that the sequences {Pk } and {K k } iteratively determined from policy iteration (7) and (8) have the following properties: 1) A − B K k is Hurwitz, 2) P ∗ ≤ Pk+1 ≤ Pk , and 3) limk→∞ K k = K ∗ , limk→∞ Pk = P ∗ . 2.3 ADP for linear stochastic systems with signal-dependent noise The policy iteration algorithm relies on the perfect knowledge of the system dynamics, because the system matrices A and B are involved in the Eqs. (7) and (8). Jiang and Jiang (2012a) have shown that in the deterministic case, when A and B are unknown, equivalent iterations can be achieved using online measurements. Here, we extend the methodology by Jiang and Jiang (2012a) to deal with stochastic linear systems with signal-dependent noise and to find online the optimal control policy without assuming the a priori knowledge of A and B. To begin with, let us rewrite the original system (1) as dx = (A − B K k )xdt + B(dw + K k xdt)

(9)

where

If the constant matrices Ci , i = 1, 2, . . . , q, are so small that (6) holds, the control policy u = −K ∗ x is called robust optimal, i.e., it is optimal in the absence of the noise ηi , and is stabilizing in the presence of ηi . 2.2 Policy iteration To solve (4) which is nonlinear in P, the following policy iteration algorithm from reinforcement learning can be applied (Kleinman 1969b):

q

dw = udt + Ci udηi

(10)

i=1

represents the combined signal received by the motor plant from the input channel. Now, let us define Ak = A − B K k , Q k = Q + K kT R K k , and Mk = B T Pk B. Then, by Itô’s lemma (Itô 1944), along the solutions of (9), it follows that d(x T Pk x) = x T (AkT Pk + Pk Ak )xdt + 2(K k xdt + dw)T B T Pk x +dw T B T Pk Bdw = −x T Q k xdt + 2(K k xdt + dw)T B T Pk x + dw T Mk dw

Policy iteration

(11) 1. Find an initial stabilizing feedback gain matrix K 0 , such that A − B K 0 is Hurwitz. Let k = 0. 2. Solve Pk from

Notice that dw T Mk dw = 0 because dw T Mk dw q

0 = (A − B K k )T Pk + Pk (A − B K k ) + Q + K kT R K k (7)

q

= (udt + Ci udηi )T Mk (udt + Ci udηi ) i=1

i=1

q

= u T Mk u(dt)2 + u T CiT Mk Ci u(dηi )2 i=1

+

3. Improve the control policy by K k+1 = R −1 B T Pk 4. Let k ← k + 1, and go to Step 2.

(8)

=

u T CiT Mk Ci udηi dη j

1≤i= j≤q q u T CiT Mk Ci udt i=1

Next, integrating both sides of (11) from t to t + δt, we obtain

123

462

Biol Cybern (2014) 108:459–473

x(t + δt)T Pk x(t + δt) − x(t)T Pk x(t) t+δt t+δt T T x Qx + u k Ru k dt + =− dw T Mk dw t

t t+δt

k

(K k x + dw)T R K k+1 x

+2

(12)

t

where u k = −K k x. Notice that (12) plays an important role in separating accurately the system dynamics from the iterative process. As a result, the requirement of the system matrices in (7) and (8) can now be replaced by the state and input information measured in real-time. We now show that given a matrix K k such that A − B K k is Hurwitz, a pair of matrices (Pk , K k+1 ), with Pk = PkT > 0, satisfying (7) and (8) can be uniquely determined without knowing A or B . To this end, we define the following two operators: P ∈ Rn×n → Pˆ ∈ R 2 n(n+1) 1

1

x ∈ Rn → x¯ ∈ R 2 n(n+1) where T Pˆ = p11 , 2 p12 , . . . , 2 p1n , p22 , 2 p23 , . . . , 2 pn−1,n , pnn ,

T x¯ = x12 , x1 x2 , . . . , x1 xn , x22 , x2 x3 , . . . , xn−1 xn , xn2 . In addition, by Kronecker product representation (Horn 1990), we have (K k xdt + dw)T R K k+1 x = [x ⊗ R(K k xdt + dw)]T vec(K k+1 ). Further, for a sufficiently large positive integer lk > 0, 1 we define matrices δx,k ∈ Rlk × 2 n(n+1) , Iq,k ∈ Rlk , I xv,k ∈ 1 Rlk ×mn , and Iu,k ∈ Rlk × 2 m(m+1) , such that ⎤ ⎡ T x¯ (t1,k ) − x¯ T (t0,k ) ⎢ x¯ T (t2,k ) − x¯ T (t1,k ) ⎥ ⎥ ⎢ δx,k = ⎢ ⎥, .. ⎦ ⎣ . T T x¯ (tlk ,k ) − x¯ (tlk −1,k )

Iq,k

⎡ t1,k t0,k ⎢ t2,k ⎢ t1,k =⎢ ⎣ tlk ,k tlk

I xv,k

T ⎤ x Qx + u kT Ru k dt T x Qx + u kT Ru k dt ⎥ ⎥ ⎥, ··· ⎦ T T x Qx + u k Ru k dt −1,k

⎡ t ⎤ 1,k T x ⊗ (K k xdt + dw)T R t 0,k ⎢ t2,k T ⎥ ⎢ t x ⊗ (K k xdt + dw)T R ⎥ ⎢ 1,k ⎥ =⎢ ⎥, .. ⎢ ⎥ . ⎣t ⎦ lk ,k T T tl −1,k x ⊗ (K k xdt + dw) R k

123

Iu,k

⎡ t1,k T ⎤ w¯ dτ t 0,k ⎢ t2,k w¯ T dτ ⎥ ⎢ ⎥ = ⎢ t1,k ⎥ ··· ⎣ ⎦ tlk ,k T ¯ dτ tl −1,k w

where 0 ≤ tlk−1 ,k−1 ≤ t0,k < t1,k < · · · < tlk ,k < t0,k+1 . Therefore, (12) implies the following compact form of linear equations ⎤ ⎡ Pˆk ⎦ = −Iq,k (13) Θk ⎣ Mˆ k vec (K k+1 ) l ×

1

n(n+1)+mn

where Θk ∈ R k 2 is defined as: Θk = δx,k , −Iu,k , −2I xv,k , To guarantee the existence and uniqueness of solution to (13), we assume Θk has full column rank for all k ∈ Z+ . As a result, (13) can be directly solved as follows: ⎤ ⎡ Pˆk ⎦ = −(ΘkT Θk )−1 ΘkT Iq,k . ⎣ (14) Mˆ k vec (K k+1 ) It is worth noticing that when the sensory noise is taken into account, numerical errors may occur when computing the matrices Iq,k , I xv,k , and Iu,k . Consequently, the solution of (14) can be viewed as the least-squares solution of (13). Alternatively, an approximation of the unique solution of (13) can be obtained using a recursive least-squares method (Ljung 1999). 2.4 The ADP algorithm Now, we are ready to give the following ADP algorithm for practical online implementation. Adaptive dynamic programming algorithm 1. Find an initial stabilizing control policy u 0 = −K 0 x, and set k = 0. 2. Apply u k = −K k x as the control input on the time interval [t0,k , tlk ,k ]. Compute δx,k , Iq,k , I xv,k , and Iu,k . 3. Solve Pk , Mk , and K k+1 from (14). 4. Let k ← k + 1, and go to Step 2. Compared with most of the existing models for motor adaptation, the proposed ADP algorithm can be used to study both the online learning during one single trial and the learning among different trials. In the latter case, the interval [t0,k , tlk ,k ] should be taken from the time duration of a single trial.

Biol Cybern (2014) 108:459–473

463

2.5 Convergence analysis

Table 1 Parameters of the linear model

The convergence property of the proposed algorithm can be summarized in the following Theorem. Theorem 1 Suppose Θk has full column rank for all k ∈ Z+ and A − B K 0 is Hurwitz. Then, the sequences {K k }, {Pk }, and {Mk } obtained from (14) satisfy lim Pk = P ∗ , lim K k =

k→∞

K ∗,

and lim Mk = k→∞

BT

k→∞ P ∗ B.

Proof Given a stabilizing feedback gain matrix K k , if Pk = PkT is the solution of (7), K k+1 and Mk are uniquely determined by K k+1 = R −1 B T Pk and Mk = B T Pk B, respectively. By (12), we know that Pk , K k+1 , and Mk satisfy (14). On the other hand, let P = P T ∈ Rn×n , M ∈ Rm×m , and K ∈ Rm×n , such that ⎡ ⎤ Pˆ (15) Θk ⎣ Mˆ ⎦ = k . vec(K ) Then, we immediately have Pˆ = Pˆk , Mˆ = Mˆ k , and vec(K ) = vec(K k+1 ). Since k has full column rank, P = P T , M = M T , and K are unique. In addition, by ˆ M, ˆ and vec(K ), Pk = P, Mk = M, the definitions of P, and K k+1 = K are uniquely determined. Therefore, the policy iteration (14) is equivalent to (7) and (8). By Kleinman (1969b), the convergence is thus proved.

3 Numerical results 3.1 Open-loop model of the motor system We adopted the proposed ADP algorithm to model arm movements in force fields and to reproduce similar results observed from experiments (Burdet et al. 2001; Franklin et al. 2003). For simulation purpose, we used the mathematical model describing two-joint arm movements (Liu and Todorov 2007), as shown below. d p = vdt

(16)

mdv = (a − bv + f )dt

(17)

τ da = (u − a)dt + dξ

(18)

where p = [ px , p y ]T , v = [vx , v y ]T , a = [ax , a y ]T , u = [u x , u y ]T , f = [ f x , f y ] are two-dimensional hand position, velocity, acceleration state, control signal, and external force generated from the field, respectively, m denotes the mass of the hand, b is the viscosity constant, τ is the time constant, dξ denotes the signal-dependent noise, and is given by ux ux 0 c2 c1 0 dη1 + dη2 (19) dξ = uy 0 c1 uy c2 0

Parameters

Description

Value

Dimension

m

Hand mass

1.3

kg

b

Viscosity constant

10

N s/m

τ

Time constant

0.05

s

c1

Noise magnitude

0.075

c2

Noise magnitude

0.025

where η1 and η2 are two standard and independent Brownian motions, c1 > 0 and c2 > 0 are constants describing the magnitude of the signal-dependent noise. The values of the parameters are specified in Table 1. It is worth mentioning that this model is similar to many other linear models for describing arm movements (see, for example, Harris and Wolpert 1998; Flash and Hogan 1985; Tanaka et al. 2006; Izawa et al. 2008; Todorov and Jordan 2002; Wei and Körding 2010; Zhou et al. 2012), to which our ADP theory is also applicable. 3.2 Determining the initial stabilizing control policy The proposed ADP-based online learning methodology requires an initial stabilizing control policy. To be more specific, we need to find an initial stabilizing feedback gain matrix K 0 ∈ R2×6 , such that the closed-loop matrix A− B K 0 is Hurwitz. By robust control theory (Zhou 1996), it is possible to find such a matrix K 0 if upper and lower bounds of the elements in both A and B are available and the pair (A, B) is stabilizable. Indeed, these bounds can be estimated by the CNS during the first several trials. For example, in the absence of disturbances (i.e., f ≡ 0), we have ⎡

⎤

⎡

⎤ 0 ⎥ ⎢ A − B K 0 = ⎣ 0 − mb I2 m1 I2 ⎦ − ⎣ 0 ⎦ K 0 1 0 0 τ1 I2 τ I2 0

I2

0

(20)

Then, the first several trials in the NF can be interpreted as the exploration of an initial stabilizing feedback gain matrix K 0 , by estimating the bounds on the parameters b, m, and τ , and solving a robust control problem. Indeed, if the CNS finds out that b ∈ [−8, 12], m ∈ [1, 1.5], and τ ∈ [0.03, 0.07] through the first several trials, the feedback gain matrix K 0 can thus be selected as K0 =

100 0 10 0 10 0 . 0 100 0 10 0 10

(21)

Once K 0 is obtained, the CNS can use the proposed ADP method to approximate the optimal control policy.

123

464

Biol Cybern (2014) 108:459–473 Perturbed movement p∗(t) p(t)

dm

y

θ

Unperturbed movement x

Fig. 1 Illustration of the parameter θ and dm that determine the weighting matrix Q. p(t) and p ∗ (t) are the perturbed and unperturbed movement trajectories, respectively. dm denotes the magnitude of the largest deviation of the hand position from the unperturbed position. The angular position of this deviation with respect to the x-axis is denoted by θ. Notice that dm and θ need to be changed only when the CNS realizes that the force field has been changed

3.3 Selection of the weighting matrices Here, we model the selection of the Q and R by the CNS. To begin with, notice that the symmetric weighting matrices Q ∈ R6×6 and R ∈ R2×2 contain as many as 24 independent parameters. To reduce redundancy, we assume ⎤ ⎡ Q0 0 0 ⎦, 0 (22) Q = ⎣ 0 0.01Q 0 0 0 0.00005Q 0 (23) R = I2 . where Q 0 ∈ R2×2 is a symmetric matrix. Now, we assume the CNS specifies the task-dependent matrix Q 0 as follows. T cos θ cos θ α0 0 + α2 dm Q 0 (dm , θ ) = , (24) sin θ sin θ 0 α1 where dm denotes the magnitude of the largest deviation of the hand position in the previous trials from the unperturbed trajectory, θ is the angular position of the largest deviation (see Fig. 1). Notice that both dm and θ can be approximated by the CNS from previous trials. Positive constants α0 , α1 , and α2 are determined based on long-term experiences of the subject’s and are assumed to be invariant in our simulations. The model (24) provides a unified way for the CNS to specify the weighting matrices with respect to different tasks. In the next two subsections, we show that this model is compatible with experimental observations. 3.4 Sensorimotor control in a velocity-dependent force field We used the proposed ADP method to simulate the experiment conducted by Franklin et al. (2003), and the simulation results are shown in Figs. 2, 3, 4, and 5. In that experiment,

123

human subjects were seated and asked to move a parallellink direct drive air-magnet floating manipulandum (PFM) to perform a series of forward arm reaching movements in the horizontal plane. All the subjects performed reaching movements from a start point located 0.25 m away from the target. The experiment tested human muscle stiffness and motor adaptation in a velocity-dependent force field (VF). The VF produced a stable interaction with the arm. The force exerted on the hand by the robotic interface in the VF was set to be 13 −18 vx fx =χ (25) fy vy 18 13 where χ ∈ [2/3, 1] is a constant that can be adjusted to the subject’s strength. In our simulation, we set χ = 0.7. Subjects in the experiment (Franklin et al. 2003) first practiced in the null field (NF). Trials were considered successful if they ended inside the target within the prescribed time 0.6 ± 0.1 s. After enough successful trials were completed, the force field was activated without notifying the subjects. Then, the subjects practiced in the VF until enough successful trials were achieved. After a short break, the subjects then performed several movements in the NF. These trials were called after effects and were recorded to confirm that adaptation to the force field did occur. More details of the experimental setting can be found in Franklin et al. (2003) and Gomi and Kawato (1996). First, we simulated the movements in the NF. The simulation started with an initial stabilizing control policy that can be found by the CNS as explained in Sect. 3.2. During each trial, we collected the online data to update the control policy once. After enough trials, an approximate optimal control policy in the NF can be obtained. Also, the stiffness, which is defined as graphical depiction of the elastic restoring force corresponding to the unit displacement of the hand for the subject in the force fields (Burdet et al. 2001), can be numerically computed. In addition, it can be represented in terms of an ellipse by plotting the elastic force produced by a unit displacement (Mussa-Ivaldi et al. 1985). We ran the simulation multiple times with different values of α0 and α1 . Then, we found that, by setting α0 = 5 × 104 and α1 = 1 × 105 , the resultant stiffness ellipse has good consistency with experimental observations (Franklin et al. 2003). Since in the NF we have dm ≈ 0, the parameter α2 does not contribute to the specification of Q 0 . Then, we proceeded with the simulations in the VF. The first trial was under the same control policy as used in the NF. After the first trial in the VF, the CNS can find the largest deviation of the first trial from the unperturbed movement. Hence, dm and θ can be determined. Then, the CNS can modify the weighting matrix Q using the model (24). In our simulation, dm and θ were set to be 0.15 and 144◦ . Then, the same Q matrix was unchanged for the rest trails in our simulation, since the CNS can soon realize that the new movements were

Biol Cybern (2014) 108:459–473

465

y−position (m)

A

B

0.05

0.05

0

C 1st trial

2nd trial

D

0.05

0.05

0

0

0

−0.05

−0.05

−0.05

−0.05

−0.1

−0.1

−0.1

−0.1

−0.15

−0.15

−0.2

−0.2

−0.25

−0.25

−0.15

−0.15

−0.2

−0.2

−0.25

−0.25

−0.1

0

x−position (m)

0.1

3rd trial 4th trial 5th trial

−0.1 −0.05

0

0.05

−0.1

x−position (m)

0

0.1

x−position (m)

−0.1

0

0.1

x−position (m)

Fig. 2 Simulated movement trajectories using the proposed learning scheme. a Five successful movement trajectories of one subject in the NF. b The first five consecutive movement trajectories of the subject when exposed to the VF. c Five consecutive movement trajectories of

the subject in the VF after 30 trials. d Five independent after-effect trials. The simulated movements in a and c correspond to the data we fit. Notice that the evolution in b and the after effects in d are predicted by our model

conducted in the same force field. The ADP-based learning mechanism was applied starting from the second trial. The initial feedback gain matrix we used for the second trial was obtained by tripling the gains of the control policy in the NF. This is to model the experimental observation by Franklin et al. (2008) that muscle activities increased dramatically after the first trial. We ran the simulate multiple times with different values of α2 and found good consistency with experimental results (Franklin et al. 2003) by setting α2 = 1 × 106 . The stiffness ellipses are shown in Fig. 4. One can compare it with the experimental observations in Franklin et al. (2003). After 30 trials, the feedback control gain was updated to

the same feedback control policy as in the NF. This is because subjects in the experiment were not notified when the external force was activated. Apparently, this control policy was not optimal because the system dynamics was changed and the cost function was also different. Then, the ADP algorithm proceeded. In Fig. 2, we see the first trial gives a movement trajectory which deviated far away from the straight path but eventually reached the target. Motor adaptation can be observed by comparing the first five consecutive trials. After 30 trials, the movement trajectories return to be straight lines, and the velocity curves become bell-shaped again. It implies that after 30 trials in the VF, the CNS can learn well the optimal control policy using real-time data, without knowing or using the precise system parameters. Finally, our numerical study shows clearly the after effects of the subject behavior when the VF was suddenly deactivated. To better illustrate the learning behavior in the VF, we define the movement time t f of each trial as the time duration from the beginning of the trial until the hand-path enters and remains in the target area. Then, the movement time and distance were calculated and are shown in Fig. 5.

K 30

289.93 79.02 95.11 −17.78 2.56 −0.55 = −243.53 376.30 −6.49 135.07 −0.55 2.86

For comparison, the optimal feedback gain matrix is provided as follows: 293.88 79.91 95.92 −17.92 2.59 −0.56 K = . −248.62 381.36 −7.17 136.41 −0.56 2.89 ∗

The simulated movement trajectories, the velocity curves, and the endpoint force curves are shown in Figs. 2 and 3. It should be pointed out that the simulated performance in the NF and the after-learning behavior in the VF correspond to the data we fit for α0 , α1 , and α2 . On the contrary, the evolution in the NF and the after effects are predicted by our model. It can be seen that the simulated movement trajectories in the NF are approximately straight lines, and the velocity curves along the y-axis are bell-shaped curves. These simulation results are consistent with experimental observations as well as the curves produced by the previous models (Morasso 1981; Flash and Hogan 1985; Todorov 2005). After the subject was exposed to the VF, the first trial was simulated with

3.5 Sensorimotor control in a divergent field Now, let us describe how we simulated the sensorimotor control system in a divergent field (DF) using the proposed ADP theory (Figs. 6, 7, and 8). In the experiment conducted by Burdet et al. (2001), the DF produced a negative elastic force perpendicular to the target direction and was computed as

β 0 f = 0 0

px 0

(26)

123

Biol Cybern (2014) 108:459–473

x−velocity (m/s)

466

B

A

0.5

0.5

0

0

0

0

−0.5

0

0.2

0.4

0.6

−0.5

0

0.2

y−velocity (m/s)

0.4

0.6

−0.5

0

time (s)

0.2

0.4

0.6

−0.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0.5

0

0

0

0

−0.5

−0.5

−0.5

0.4

0.6

0

0.2

0.4

0.6

0.4

0.6

1.5

−0.5

0.2

0.2

time (s)

1

0

0

time (s)

1.5

time (s) x−endpoint force (N)

D

0.5

time (s)

y−endpoint force (N)

C

0.5

0

time (s)

0.2

0.4

0.6

0

time (s)

0.2

0.4

0.6

time (s)

20 10 0

0

−10

10

10

0

0

−10 0

0.2

0.4

0.6

−20

0

0.2

time (s)

0.4

0.6

−10 0

time (s)

0.2

0.4

0.6

0

time (s)

0.2

0.4

0.6

time (s)

20 20

10

0

0

−20 0.2

0.4

0.6

20

0

0

−20

−10 0

20

0

0.2

time (s)

0.4

time (s)

0.6

−20 0

0.2

0.4

time (s)

0.6

0

0.2

0.4

0.6

time (s)

Fig. 3 Simulated velocity and endpoint force curves show strong consistency with the experimental observations by Franklin et al. (2003). a Simulated trajectories of one subject in the NF. b Simulated trajectories of the subject when first exposed into the VF. c Simulated trajectories of the subject in the VF after 30 trials. d After-effect trials. Velocity curves are shown in the first and second rows, in which bellshaped velocity curves along the y-axis (i.e., the movement direction)

are clearly observed. Endpoint force curves are shown in the third and fourth rows. By comparing the first and third figures in the third row, we see subjects adapted to the VF by generating compensation force to counteract the force produced by the field. The shapes of the afterlearning endpoint force curves are nearly identical to the experimentally measured endpoint forces reported by Franklin et al. (2003)

where β > 0 is a sufficiently large constant such that the overall system is unstable. In our simulations, we set β = 150. The simulation of the movements before the DF was applied is identical to the one described in the previous subsection, and an approximate optimal control policy in the NF has been obtained. However, this control policy is not stabilizing in the DF, and therefore, an initial stabilizing control policy in the DF is needed. To be more specific, we need a matrix K 0 ∈ R2×6 such that ⎤ ⎡ ⎤ ⎡ 0 0 1 0 0 0 0 0 ⎢0 0 0 1 0 0 ⎥ ⎢0 0⎥ ⎥ ⎢ ⎥ ⎢β b 1 ⎢0 0⎥ ⎢ 0 − 0 0⎥ m m m ⎢ ⎥ ⎥ K0 ⎢ A − B K0 = ⎢ b 1 ⎥−⎢ ⎥ ⎢ 0 0 0 −m 0 m ⎥ ⎢ 0 0 ⎥ 1 1 ⎣ ⎦ ⎦ ⎣0 0 0 0 0 0

Therefore, we applied the same control policy learned in the NF to control the movements for the first five trials in the DF. As a result, unstable behaviors were observed in the first several trials (see Fig. 7b). Then, a stabilizing feedback control gain matrix K 0 was assumed available to the CNS, since the CNS has estimated the range of the unknown parameters β, b, m, and τ and should be able to find K 0 by solving a robust control problem. Here, we increased the first entry in the first row of the matrix Kˆ n f by 350 and set the resultant matrix to be K 0 , which is stabilizing. Then, we applied the proposed ADP algorithm with this K 0 as the initial stabilizing feedback gain matrix. Of course, K 0 can be selected in different ways. Some alternative models describing the learning process from instability to stability can also be found in Franklin et al. (2008), Tee et al. (2010), Yang et al. (2011) and Zhou et al. (2012). The weighting matrix Q in the DF was determined using the same model (24). Values for the parameters α0 , α1 , and

0 0

0

0

τ

0

1 τ

τ

0

1 τ

(27) is Hurwitz.

123

Biol Cybern (2014) 108:459–473

467 1000 800

Force direction

600

Movement direction

y component of stiffness (N m−1)

y component of stiffness (N m−1)

800

400 200 0 −200 −400

Divergent force filed

Null filed

−600 −800

0

200

400

600 −1

800

Movement duration (sec)

0 Force direction

−200 −400

Null filed −600

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 5

10

15

20

25

30

Number of trials

B 0.55 Movement distance in the VF Expected movement distance in the NF

0.5

−1000

−500

0

500

1000

−1

Movement duration in the VF Expected movement duration in the NF

1.8

Movement distance (m)

200

x component of stiffness (N m )

A

2

0.45 0.4

Fig. 6 Illustration of stiffness geometry to the DF. The stiffness in the DF (red) increased significantly in the direction of the external force, compared with the stiffness in the NF (green) (color figure online)

α2 in (24) were the same as the ones used for the VF. Notice that in the DF we have θ = 0 and dm = 0.03, which can be observed by the subject after the first trial in the DF. It can be seen in Figs. 7 and 8 that the simulated movement trajectories in the NF are approximately straight lines and their velocity curves are bell-shaped. It is easy to notice that the movement trajectories differ slightly from trial to trial. This is due to the motor output variability caused by the signal-dependent noise. When the subject was first exposed to the DF, these variations were further amplified by the DF. As a result, unstable behaviors were observed in the first several trials. In Figs. 7 and 8, it is clear that, a stabilizing control policy is obtained, the proposed ADP scheme can be applied to generate an approximate optimal control policy. After 30 trials, the hand-path trajectories became approximately straight as in the NF. It implies that the subject has learned to adapt to the dynamics of the DF. Indeed, after 30 trials in the DF, the feedback gain matrix has been updated to 911.62 0 104.04 0 2.18 0 . K 30 = 0 312.94 0 69.47 0 2.06 For comparison purpose, the optimal feedback gain matrix for the ideal case with no noise is shown below: 918.78 0 104.83 0 2.20 0 K d∗ f = . 0 316.23 069.99 0 2.08

0.35 0.3 0.25 0.2

400

−1000

−800 −600 −400 −200

Fig. 4 Illustration of the stiffness geometry to the VF. The stiffness in the VF (red) increased significantly in the direction of the external force, compared with the stiffness in the NF (green) (color figure online)

0

Movement direction

−800

x component of stiffness (N m )

0

Divergent force filed

600

0

5

10

15

20

25

30

Number of trials

Fig. 5 a Movement duration as a function of the number of learning trials in the VF. b Movement distance as a function of the number of learning trials in the VF

Finally, we simulated behavior of the subject when the force field is unexpectedly removed. From our simulation results, it is clear to see that the movement trajectories are even straighter than the trajectories in the NF. This is because the CNS has modified the weighting matrices and put more weights on the displacement along the x-axis. As a result, the

123

468

Biol Cybern (2014) 108:459–473

y−position (m)

A

B

C

0.05

0.05

0.05

0

0

0

0

−0.05

−0.05

−0.05

−0.05

−0.1

−0.1

−0.1

−0.1

−0.15

−0.15

−0.15

−0.15

−0.2

−0.2

−0.2

−0.2

−0.25

−0.25

−0.25

−0.25

−0.05

0

0.05

−0.05

x−position (m)

0

0.05

x−velocity (m/s)

A

0.5

−0.5

0.2

0.4

0.6

−0.5

0

y−velocity (m/s)

0

0.2

0.4

0.4

0.6

1.5 1 0.5 0 −0.5

0

0.2

0.4

0.6

−0.5

0

0.2

0.4

0.6

−0.5

0.6

1.5 1 0.5 0 −0.5

0

time (s)

0.2

0.4

0.6

1.5 1 0.5 0 −0.5

10 0

0

−10

−20

−10

−10

0.6

0

0.2

0.4

0.6

0.2

0.4

0.6

0.2

0.4

0.6

time (s)

0

0.4

0

time (s)

20

time (s)

0

time (s)

0

0.2

0.05

D

time (s)

10

0

0

x−position (m)

0

time (s)

time (s)

y−endpoint force (N) x−endpoint force (N)

0.2

−0.05

0.5

0

time (s) 1.5 1 0.5 0 −0.5

0.05

C

0.5

0

0

0

x−position (m)

indicate the safety zone, outside of which the force field was turned off. c Five consecutive movement trajectories of the subject in the divergent field after 30 trials. d Five consecutive after-effect trials

B

0.5

0

−0.05

x−position (m)

Fig. 7 Simulated movement trajectories using the proposed learning scheme. a Five successful movement trajectories of one subject in the NF. b Five independent movement trajectories of the subject when exposed to the DF. The black lines on either side of trials in the DF

10

0

time (s)

0.2

0.4

0.6

0

time (s)

0.2

0.4

0.6

time (s)

50 20

20

0

0

−20

20

0

0

−20

−20

−50 0

0.2

0.4

time (s)

0.6

0

0.2

0.4

time (s)

Fig. 8 Simulated velocity and endpoint force curves using the proposed learning scheme. a Simulated trajectories of one subject in the NF. b Simulated trajectories of the subject when first exposed into the DF. Some trials were terminated earlier than 0.6 s because they went out of the save zone. c Simulated trajectories of the subject in the divergent force field after 30 trials. d After-effect trials. Velocity curves are shown in the first and second rows, in which bell-shaped velocity curves along

123

D

0.05

0.6

0

0.2

0.4

time (s)

0.6

0

0.2

0.4

0.6

time (s)

the y-axis (i.e., the movement direction) are clearly observed. Endpoint force curves are shown in the third and fourth rows, in which we see subjects adapted to the DF by generating compensation force in the x-direction to counteract the force produced by the DF. In addition, the endpoint force curves are nearly identical to the experimentally measured data (Burdet et al. 2001)

Biol Cybern (2014) 108:459–473

469

A

B

0.8

0 Movement duration of each trial Least−squares fit using the log law

0.7

Movement duration of each trial Least−squares fit using the power law

−0.2

−0.6 0.5

−0.8

ln(tf)

Movement time t f (sec)

−0.4 0.6

0.4

−1 −1.2

0.3

−1.4 0.2 −1.6 0.1 0

−1.8 1

2

3

4

5

6

−2

7

0

1

2

3

4

ln(d/s)

log2(2d/s)

C

D

0.9 Movement duration of each trial Least−squares fit using the log law

0.8


−0.2

0.7

−0.4

0.6

−0.6

0.5

−0.8

ln(tf)


0

0.4

−1 −1.2

0.3 −1.4 0.2

−1.6

0.1 0

−1.8 1

2

3

4

5

6

−2

7

0

1

2

3

4

ln(d/s)

log2(2d/s)

E

F

0.8

0 Movement duration of each trial Least−squares fit using the log law

0.7


−0.2

−0.6 0.5

−0.8

ln(tf)


−0.4 0.6

0.4

−1 −1.2

0.3

−1.4 0.2 −1.6 0.1 0

−1.8 1

2

3

4

5

6

7

log2(2d/s)

Fig. 9 Log and power forms of Fitts’s law. Crosses in the first row, the second row, and the third row represent movement times simulated in the NF, the VF, and the DF, respectively. Solid lines in a, c, e are

−2

0

1

2

3

4

ln(d/s)

least-squares fits using the log Fitts’s law, and solid lines in b, d, f are the least-squares fits using the power Fitts’s law

123

470

Biol Cybern (2014) 108:459–473

Table 2 Data fitting for the log law and power law Parameters

NF

VF

DF

a (Log law)

0.0840

0.1137

0.0829

b (Log law)

0.0254

−0.0376

0.0197

a (Power law)

0.3401

0.4101

0.3468

b (Power law)

−1.7618

−1.7796

−1.8048

stiffness ellipses in the NF and DF are apparently different, because the stiffness increased significantly in the direction of the divergence force (Fig. 6). The change of stiffness along the movement direction is not significant, as shown in our simulations. Again, our simulation results show that the CNS can learn and find an approximate optimal control policy using realtime data, without knowing the precise system parameters. 3.6 Fitts’s law According to Fitts (1954), the movement duration t f required to rapidly move to a target area is a function of the distance d to the target and the size of the target s, and a logarithmic law is formulated to represent to the relationship among the three variables t f , d, and s as follows d (28) t f = a + b log2 s where a and b are two constants. In 1998, Schmidt and Lee (2011) proposed the following power law: b d tf = a . (29) s Here, we validated our model by using both the log law and the power law. The target size s is defined as its diameter, and the distance is fixed as d = 0.24 m. We simulated the movement times from the trials in the NF, the after-learning trials in the VF and DF. The data fitting results are shown in Fig. 9 and Table 2. It can be seen that our simulation results are consistent with Fitts’s law predictions.

tion. By contrast, the proposed ADP methodology is a nonmodel-based approach and informs that the optimal control policy is derived using the real-time sensory data and is robust to dynamic uncertainties such as signal-dependent noise. In the absence of external disturbances, our model can generate typically observed position, velocity, and endpoint force curves as produced by the previous models. As one of the key differences with existing sensorimotor models, our proposed computational mechanism suggests that, when confronted with unknown environments and imprecise dynamics, the CNS may update and improve its command signals for movement through learning and repeated trials. In the presence of perturbations, most of the previous models have relied on sensory prediction errors to form an estimate of the perturbation (Berniker and Kording 2008; Kording et al. 2007; Wolpert and Ghahramani 2000). However, this viewpoint is difficult to be justified theoretically and has not been convincingly validated by experiments. Indeed, evidence against this source-identified adaptation is reported by Hudson and Landy (2012), where a selfgenerated perturbation was created but it was not identified or used in formulating the control policy. This is consistent with the learning scheme we proposed in this paper. Indeed, our ADP-based learning scheme does not identify the dynamics of the force fields. Instead, optimal control policies in the presence of force fields are directly obtained through successive approximations. By simulating the experiments conducted by Burdet et al. (2001) and Franklin et al. (2003), we have found that our computational results match well with the experimental results. In particular, our simulation results show gradual adaptation to the unknown force fields, with nearly identical movement trajectories in the first several consecutive trials reported in the experiment (Franklin et al. 2003). The simulated post-learning velocity and endpoint force curves fit well with the experimental observations (Burdet et al. 2001; Franklin et al. 2003). Our simulations clearly demonstrated the after-effect phenomenon.

4 Discussion

4.2 Stability and convergence properties

4.1 Non-model-based learning

Several reinforcement-learning-based models for motor adaptation have been developed in the past literature (Doya 2000; Izawa and Shadmehr 2011). However, it is not easy to analyze the convergence and the stability properties of the learning schemes. In this paper, we have extended the ADP theory to continuous-time linear systems with signal-dependent noise and applied this theory to model sensorimotor control. An added value of the proposed ADP methodology is that rigorous convergence and stability analysis is given and

Most of the previous models for sensorimotor control have concluded that the CNS knows precisely the knowledge of the motor system and its interacting environment (Diedrichsen et al. 2010; Flash and Hogan 1985; Harris and Wolpert 1998; Jiang et al. 2011b; Qian et al. 2013; Scott 2004; Todorov and Jordan 2002; Todorov 2005; Uno et al. 1989).The computation of optimal control laws is based on this assump-

123

Biol Cybern (2014) 108:459–473

is surprisingly simple by means of linear optimal control theory. 4.3 Muscle stiffness Several practical methods for experimentally measuring stiffness have been proposed in the past literature (Burdet et al. 2000; Gomi and Kawato 1996; Hogan 1985), and the changes of stiffness in force fields were reported by Burdet et al. (2001) and Franklin et al. (2003). However, how the CNS modifies the stiffness geometry and achieves optimal motor behavior remains a largely open question. The optimization mechanism suggested by Burdet et al. (2001) and later extended by Franklin et al. (2008) suggests that the CNS minimizes both the hand-path error relative to the undisturbed trajectory and the effort. Although this optimization mechanism predicts experimental results rather well, the model proposed by Burdet et al. (2001) may not be directly related to classical optimal control theory. On the other hand, the stiffness may not be well analyzed using other models based on finite-horizon optimal control theory (see, for example,Todorov 2005; Harris and Wolpert 1998). This is because those models use time-varying control policies, leading to continuous change of the stiffness during the movement. In the ADP-based model, time-invariant control policies are computed, and it is comparably easy to analyze the muscle stiffness by studying the position feedback gains. Our modeling methodology implies that the change of stiffness results from the modification of the weighing matrices by the CNS and the change of the system dynamics. In addition, our model suggests that different stiffness geometries of different individuals may be a consequence of different weighting matrices they selected. Therefore, compared with other models of motor control and motor adaptation, our modeling strategy can explain naturally the change of stiffness observed in experiments from the viewpoint of optimal feedback control (Lewis and Syrmos 1995.)

471

4.5 Infinite-horizon optimal control During each trial, a time-invariant control policy is suggested in our methodology. The time-invariant control policy has a main advantage that movement duration does not need to be prefixed by the CNS. This seems more realistic because the duration of each movement is different from each other due to the signal-dependent noise and is difficult to be prefixed. Todorov and Jordan (2002) suggested that if the target is not reached at the predicted reaching time, the CNS can similarly plan an independent new trajectory between the actual position of the hand and the final target, and the final trajectory will be the superposition of all the trajectories. By contrast, our model matches the intuitive notion that the motor system keeps moving the hand toward the target until it is reached, and much less computational burden is required. Our simulation results match well with Fitts’s law predictions. In addition, this type of control policies can also be used to analytically derive the Fitts’s law as illustrated by Qian et al. (2013). 4.6 Comparison with iterative learning control Iterative learning control (ILC) is an open-loop control scheme that iteratively learns a feedforward signal based on the tracking error of the previous trials (Bristow et al. 2006). It has also been used to model motor learning (Zhou et al. 2012). To some extent, the ILC theory is compatible with experimental observations, in which the hand, after perturbation, returned to the unperturbed trajectories (e.g., Milne 1993). However, conventional ILC methods use open-loop control policy for each individual trial. Hence, it may not explain the feedback control mechanism involved in each individual movement, which is essential in generating bellshaped velocity curves. Furthermore, compared with ILC, the ADP-based learning method is capable of handling perturbations caused by the change of the target during the movement.

4.4 Data fitting for the weighting matrices

5 Conclusions and future work

The weighting matrices we used in the numerical simulation are selected such that our resultant computational results can have qualitative consistency with experimental results (Burdet et al. 2001; Franklin et al. 2003). If accurate human motor data become available, better fits for the weights can be obtained using a two-loop optimization approach (Jiang et al. 2011a). The inner-loop uses the proposed ADP method to approximate an optimal control policy and generate the stiffness ellipses. The outer-loop compares the error between the simulated stiffness ellipses with experimental observations and adjusts the parameters α1 , α2 , and α3 to minimize the error.

We have developed an ADP method for linear stochastic systems with signal-dependent noise with an objective to model goal-oriented sensorimotor control systems. An appealing feature of this new computational mechanism of sensorimotor control is that the CNS does not rely upon the a priori knowledge of systems dynamics and the environment to generate a command signal for hand movement. Our model reproduced nearly identical behaviors as those observed from experiments (Burdet et al. 2001; Franklin et al. 2003). Our theory can explain the change of stiffness geometry from a perspective of adaptive optimal feedback control versus nonadaptive optimal control theory (Qian et al. 2013; Todorov

123

472

and Jordan 2002; Todorov 2005). Therefore, we argue that human sensorimotor systems may use a mechanism in the spirit of ADP to control movements and to achieve successful adaptation when uncertainties are present in the environment. It will be interesting to extend the proposed methodology from a more generalized perspective of robust-ADP (Jiang and Jiang 2012b, 2013b) and to deal with sensorimotor systems with nonlinear dynamics (such as the model of a robotic manipulator) and dynamic uncertainties. Some preliminary work in a deterministic setting has been accomplished by Jiang and Jiang (2013a) and Jiang and Jiang (2014). Also, experimental human motor data will be collected to better validate and improve the proposed computational methodology. Acknowledgments We would like to thank the Editor and anonymous reviewers for the constructive comments that are helpful for improving the presentation of this paper.

References Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton, NJ Berniker M, Kording K (2008) Estimating the sources of motor errors for adaptation and generalization. Nat Neurosci 11(12):1454–1461 Bhushan N, Shadmehr R (1999) Computational nature of human adaptive control during learning of reaching movements in force fields. Biol Cybern 81(1):39–60 Bristow DA, Tharayil M, Alleyne AG (2006) A survey of iterative learning control. IEEE Control Syst Mag 26(3):96–114 Burdet E, Osu R, Franklin D, Yoshioka T, Milner T, Kawato M (2000) A method for measuring endpoint stiffness during multi-joint arm movements. J Biomech 33(12):1705–1709 Burdet E, Osu R, Franklin DW, Milner TE, Kawato M (2001) The central nervous system stabilizes unstable dynamics by learning optimal impedance. Nature 414(6862):446–449 Davidson PR, Wolpert DM (2003) Motor learning and prediction in a variable environment. Curr Opin Neurobiol 13(2):232–237 Diedrichsen J, Shadmehr R, Ivry RB (2010) The coordination of movement: optimal feedback control and beyond. Trends Cognit Sci 14(1):31–39 Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput 12(1):219–245 Doya K, Kimura H, Kawato M (2001) Neural mechanisms of learning and control. IEEE Control Syst Mag 21(4):42–54 Fitts PM (1954) The information capacity of the human motor system in controlling the amplitude of movement. J Exp Psychol 47(6):381– 391 Flash T, Hogan N (1985) The coordination of arm movements: an experimentally confirmed mathematical model. J Neurosci 5(7):1688– 1703 Franklin DW, Wolpert DM (2011) Computational mechanisms of sensorimotor control. Neuron 72(3):425–442 Franklin DW, Burdet E, Osu R, Kawato M, Milner TE (2003) Functional significance of stiffness in adaptation of multijoint arm movements to stable and unstable dynamics. Exp Brain Res 151(2):145–157 Franklin DW, Burdet E, Tee KP, Osu R, Chew CM, Milner TE, Kawato M (2008) CNS learns stable, accurate, and efficient movements using a simple algorithm. J Neurosci 28(44):11165–11173 Gomi H, Kawato M (1996) Equilibrium-point control hypothesis examined by measured arm stiffness during multijoint movement. Science 272:117–120

123

Biol Cybern (2014) 108:459–473 Harris CM, Wolpert DM (1998) Signal-dependent noise determines motor planning. Nature 394:780–784 Hogan N (1985) The mechanics of multi-joint posture and movement control. Biol Cybern 52(5):315–331 Hogan N, Flash T (1987) Moving gracefully: quantitative theories of motor coordination. Trends Neurosci 10(4):170–174 Horn RA (1990) Matrix analysis. Cambridge University Press, Cambridge Hudson TE, Landy MS (2012) Adaptation to sensory-motor reflex perturbations is blind to the source of errors. J Vis 12(1):1–10 Itô K (1944) Stochastic integral. Proc Jpn Acad Ser A Math Sci 20(8):519–524 Izawa J, Shadmehr R (2011) Learning from sensory and reward prediction errors during motor adaptation. PLoS Comput Biol 7(3):e1002,012 Izawa J, Rane T, Donchin O, Shadmehr R (2008) Motor adaptation as a process of reoptimization. J Neurosci 28(11):2883–2891 Jiang Y, Jiang ZP (2012a) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10):2699–2704 Jiang Y, Jiang ZP (2012b) Robust adaptive dynamic programming. In: Liu D, Lewis F (eds) Reinforcement learning and adaptive dynamic programming for feedback control, Chap 13. Wiley, New York, pp 281–302 Jiang Y, Jiang ZP (2013a) Robust adaptive dynamic programming for optimal nonlinear control design. arXiv, preprint arXiv:13032247v1 [mathDS] Jiang ZP, Jiang Y (2013b) Robust adaptive dynamic programming for linear and nonlinear systems: an overview. Eur J Control 19(5):417– 425 Jiang Y, Jiang ZP (2014) Robust adaptive dynamic programming and feedback stabilization of nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(5):882–893 Jiang Y, Chemudupati S, Jorgensen JM, Jiang ZP, Peskin CS (2011a) Optimal control mechanism involving the human kidney. In: The 50th IEEE conference on decision and control and European control conference (CDC–ECC), Orlando, FL, pp 3688–3693 Jiang Y, Jiang ZP, Qian N (2011b) Optimal control mechanisms in human arm reaching movements. In: Proceedings of the 30th Chinese control conference, IEEE, Yantai, China, pp 1377–1382 Kleinman D (1969a) On the stability of linear stochastic systems. IEEE Trans Autom Control 14(4):429–430 Kleinman D (1969b) Optimal stationary control of linear systems with control-dependent noise. IEEE Trans Autom Control 14(6):673 –677 Kording KP, Tenenbaum JB, Shadmehr R (2007) The dynamics of memory as a consequence of optimal adaptation to a changing body. Nat Neurosci 10(6):779–786 Lewis F, Syrmos V (1995) Optimal control. Wiley, New York Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 9(3):32–50 Liu D, Todorov E (2007) Evidence for the flexible sensorimotor strategies predicted by optimal feedback control. J Neurosci 27(35):9354– 9368 Ljung L (1999) System identification. Wiley, London Milne TE (1993) Dependence of elbow viscoelastic behavior on speed and loading in voluntary movements. Exp Brain Res 93(1):177–180 Morasso P (1981) Spatial control of arm movements. Exp Brain Res 42(2):223–227 Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans Syst Man Cybern C Appl Rev 32(2):140– 153 Mussa-Ivaldi FA, Hogan N, Bizzi E (1985) Neural, mechanical, and geometric factors subserving arm posture in humans. J Neurosci 5(10):2732–2743

Biol Cybern (2014) 108:459–473 Powell WB (2011) Approximate dynamic programming: solving the curses of dimensionality, 2nd edn. Wiley, London Qian N, Jiang Y, Jiang ZP, Mazzoni P (2013) Movement duration, Fitts’s law, and an infinite-horizon optimal feedback control model for biological motor systems. Neural Comput 25(3):697–724 Schmidt RA, Lee TD (2011) Motor control and learning: a behavioral emphasis, 5th edn. Human Kinetics Scott SH (2004) Optimal feedback control and the neural basis of volitional motor control. Nat Rev Neurosci 5(7):532–546 Shadmehr R, Mussa-Ivaldi FA (1994) Adaptive representation of dynamics during learning of a motor task. J Neurosci 14(5):3208– 3224 Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge Tanaka H, Krakauer JW, Qian N (2006) An optimization principle for determining movement duration. J Neurophysiol 95(6):3875–3886 Tee KP, Franklin DW, Kawato M, Milner TE, Burdet E (2010) Concurrent adaptation of force and impedance in the redundant muscle system. Biol Cybern 102(1):31–44 Todorov E (2005) Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system. Neural Comput 17(5):1084–1108 Todorov E, Jordan MI (2002) Optimal feedback control as a theory of motor coordination. Nat Neurosci 5(11):1226–1235 Uno Y, Kawato M, Suzuki R (1989) Formation and control of optimal trajectory in human multijoint arm movement: minimum torquechange model. Biolog Cybern 61(2):89–101

473 Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis F (2009) Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45(2):477–484 Watkins C (1989) Learning from delayed rewards. PhD thesis. University of Cambridge, Cambridge Wei K, Körding K (2010), Uncertainty of feedback and state estimation determines the speed of motor adaptation. Front Comput Neurosci 4:1–9 Werbos P (1968) The elements of intelligence. Cybernetica (Namur) (3) Werbos P (1974) Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis. Harvard University, Harvard Werbos PJ (1989) Neural networks for control and system identification. In: Proceedings of the 28th IEEE conference on decision and control, pp 260–265 Wolpert DM, Ghahramani Z (2000) Computational principles of movement neuroscience. Nat Neurosci 3:1212–1217 Yang C, Ganesh G, Haddadin S, Parusel S, Albu-Schaeffer A, Burdet E (2011) Human-like adaptation of force and impedance in stable and unstable interactions. IEEE Trans Robot 27(5):918–930 Zhou K, Doyle JC, Glover K (1996) Robust and optimal control, vol 272. Prentice Hall, New Jersey Zhou SH, Oetomo D, Tan Y, Burdet E, Mareels I (2012) Modeling individual human motor behavior through model reference iterative learning control. IEEE Trans Biomed Eng 59(7):1892–1901

123

Clipping in neurocontrol by adaptive dynamic programming.

Adaptive optimal control of highly dissipative nonlinear spatially distributed processes with neuro-dynamic programming.

Error bounds of adaptive dynamic programming algorithms for solving undiscounted optimal control problems.

Value Iteration Adaptive Dynamic Programming for Optimal Control of Discrete-Time Nonlinear Systems.

Air-Breathing Hypersonic Vehicle Tracking Control Based on Adaptive Dynamic Programming.

Robust adaptive dynamic programming and feedback stabilization of nonlinear systems.

Finite-approximation-error-based discrete-time iterative adaptive dynamic programming.

Robust adaptive dynamic programming with an application to power systems.

Dynamic programming.

Adaptive temperature regulation in the little bird in winter: predictions from a stochastic dynamic programming model.

An equivalence between adaptive dynamic programming with a critic and backpropagation through time.

Fair Energy Scheduling for Vehicle-to-Grid Networks Using Adaptive Dynamic Programming.

QoS Differential Scheduling in Cognitive-Radio-Based Smart Grid Networks: An Adaptive Dynamic Programming Approach.

Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems.

Adaptive dynamic programming algorithms for sequential appointment scheduling with patient preferences.

Optimal control strategy design based on dynamic programming for a dual-motor coupling-propulsion system.

Thyroid Allostasis-Adaptive Responses of Thyrotropic Feedback Control to Conditions of Strain, Stress, and Developmental Programming.

Predictive processing, perceptual presence, and sensorimotor theory.

Visual perception: a dynamic theory.

Eradication of Ebola Based on Dynamic Programming.

Model-free optimal controller design for continuous-time nonlinear systems by adaptive dynamic programming based on a precompensator.

Dynamic learning from adaptive neural network control of a class of nonaffine nonlinear systems.

Adaptive Neural Network Dynamic Surface Control for a Class of Time-Delay Nonlinear Systems With Hysteresis Inputs and Dynamic Uncertainties.

Pareto optimization in algebraic dynamic programming.