This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Stochastic Abstract Policies: Generalizing Knowledge to Improve Reinforcement Learning Marcelo L. Koga, Valdinei Freire, and Anna H. R. Costa, Member, IEEE

Abstract—Reinforcement learning (RL) enables an agent to learn behavior by acquiring experience through trial-and-error interactions with a dynamic environment. However, knowledge is usually built from scratch and learning to behave may take a long time. Here, we improve the learning performance by leveraging prior knowledge; that is, the learner shows proper behavior from the beginning of a target task, using the knowledge from a set of known, previously solved, source tasks. In this paper, we argue that building stochastic abstract policies that generalize over past experiences is an effective way to provide such improvement and this generalization outperforms the current practice of using a library of policies. We achieve that contributing with a new algorithm, AbsProb-PI-multiple and a framework for transferring knowledge represented as a stochastic abstract policy in new RL tasks. Stochastic abstract policies offer an effective way to encode knowledge because the abstraction they provide not only generalizes solutions but also facilitates extracting the similarities among tasks. We perform experiments in a robotic navigation environment and analyze the agent’s behavior throughout the learning process and also assess the transfer ratio for different amounts of source tasks. We compare our method with the transfer of a library of policies, and experiments show that the use of a generalized policy produces better results by more effectively guiding the agent when learning a target task. Index Terms—Knowledge transfer, machine learning.

I. I NTRODUCTION HERE are many decision problems an autonomous agent faces for which a solution cannot be found in advance, because the agent may not have all the information needed to model them. Reinforcement learning (RL) is a very powerful framework to tackle such scenarios [1], [2], as the agent learns by interacting with the environment. One issue with RL is that it may take a long time to show appropriate behavior, as it is based on repetitive interactions of the agent with the environment by trial-and-error. Besides composing their direct interaction in the current environment, agents can also benefit from their own past experiences, i.e., the knowledge acquired from solving previous source tasks may be used to solve a new target task more effectively. To improve RL, recent researches show that prior knowledge obtained from similar tasks can

T

Manuscript received August 13, 2013; revised January 6, 2014 and April 12, 2014; accepted April 16, 2014. This work was supported in part by FAPESP under Proc. 2011/19280-8, Proc. 2012/02190-9, Proc. 2012/19627-0, and in part by CNPq under Proc. 311058/2011-6. This paper was recommended by Associate Editor M. Last. M. L. Koga and A. H. R. Costa are with the Escola Politécnica, Universidade de São Paulo, São Paulo, SP 05508-970, Brazil (e-mail: [email protected]; [email protected]). V. Freire is with Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, São Paulo, SP 05508-970, Brazil (e-mail: [email protected]). Digital Object Identifier 10.1109/TCYB.2014.2319733

better guide the learner agent in the exploration of the environment, and therefore the agent shows better behavior from the start of the learning process [3], [4]. Efforts made on leveraging past experiences generally follow two main approaches: case-based reasoning and knowledge generalization. One of the main concerns of the former lies in the retrieval, among a collection of past experiences (cases), of the relevant case(s) to aid learning in the target task [5]. Examples of case-based reasoning are the use of options [6], which are a set of policies able to solve small subtasks that the agent might use when facing a similar situation, and analogical model formulation [7], which uses a set of solutions to different problems to retrieve an analogous solution to the new problem. Celiberto Jr., et al. [8] also used cases as heuristics to achieve transfer learning combining case-based reasoning with heuristically accelerated RL. Fernández and Veloso [9]–[11] proposed an RL algorithm with the reuse of a policy library, which contains policies of previous tasks. Then, during the learning process, the probabilities of reusing a policy in the library are adjusted according to their usefulness. On the knowledge generalization side, the focus is on how to combine and to represent the solutions of a number of past experiences, extracting their similarities and hence generalizing previous knowledge. One example is the TILDE algorithm [12], which induces a first-order logical decision tree (that represents a policy) from examples of solved tasks. Martín and Geffner [13] also explored this kind of approach by creating generalized policies, which are policies that, based on a number of solved problem instances, are suitable to solve any problem in a domain. Besides choosing the best approach to take advantage of the past experience, one also needs to choose what type of knowledge is transferred from one task to another, and how to represent such knowledge. Different kinds of knowledge have been explored: the transfer of value functions [14], [15], features extracted from the value functions [16], [17], heuristics [18], and policies [19]; and different representation types have been proposed, for example, graphs [20], relational representation [21], and tables [15]. Our approach for transferring knowledge encompasses several features: we generalize all previous solutions of source tasks into a single abstract policy, which is represented in a relational way and is used during the learning of a new target task, so that the learner agent shows appropriate behavior early in learning.

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

We use a relational representation of states and actions because this representation provides a natural abstraction of domains, thus enabling generalization of states and actions (and therefore, policies). A single abstract state (or action) encodes several ground states (or actions) [22], [23]. This abstraction facilitates the transfer of knowledge between different, but related, tasks, which only need to share features at an abstract level. We focus on methods which are policy-based, i.e., all the transferred knowledge is encoded by policies. An advantage of policy-based transfer is that it requires at most just a mapping between the source and the target state spaces (as well as the action spaces) [10], and searching directly in the policy space outperforms value-based algorithm when using abstraction [24], since abstraction in policies allows higher levels of abstraction than abstraction in value functions. In particular, we explore the use of stochastic abstract policies. In a previous work, we have introduced AbsProb-PI [25], a planning algorithm that builds a stochastic policy at an abstract level. The algorithm uses an abstract relational representation of states and actions to find an approximation of an optimal abstract policy for a given problem. We extend AbsProb-PI in this paper, enabling it to build a single generalized policy that generalizes past experiences from a set of tasks, the AbsProb-PI-multiple. Finally, we propose a framework that allows the learner to show appropriate behavior by orchestrating exploitation of past experiences, random exploration, and exploitation of new knowledge being acquired during learning in the new task. This framework is based on PRQ-learning algorithm [9]–[11] which we modified for the use of abstract policies. Thus, the main contributions of this paper are: 1) a new algorithm, AbsProb-PI-multiple, that builds a stochastic abstract policy derived from a set of previously known tasks and 2) a framework that uses this abstract policy to improve the agent behavior from the beginning of the target task learning in a similar domain. We use a robotic navigation domain for the experiments, and we evaluate how a single abstract policy obtained from a set of source tasks can effectively guide and accelerate learning onto a new target task. Its performance is compared against the use of a library of policies built with the PRPL algorithm [9], [11]. The assessment is made with various sizes of the set of source tasks. Our experiments show that the use of a single generalized policy presents better results than the use of a library of past policies. The remainder of the paper is organized as follows. Section II describes how to represent the problem with a relational representation, the relational Markov decision process (RMDP), and the abstraction derived from this representation, and how to find an abstract policy for transfer. Section III formalizes the problem we want to solve. Section IV presents how stochastic abstract policies can be built from the set of source tasks, and Section V describes how the use of abstract policies can accelerate the learning process of a new target task. Section VI shows the results of the experiments in a robotic navigation context and, finally, Section VII concludes the paper.

IEEE TRANSACTIONS ON CYBERNETICS

II. R ELATIONAL M ARKOV D ECISION P ROCESS We are interested in sequential decision problems, i.e., at each time step the agent first observes the state of the system, then chooses and executes an action that leads it to another state. A formalism widely used for this kind of problem is the Markov decision process (MDP) [26]. Its key concept is the Markov property: every state encodes all the information needed for taking the optimal decision in that state. The MDP is defined as a tuple S, {As }, T, R, G, b0 . S is the set of states  and As is the set of feasible actions at state s ∈ S. Let A = s∈S As , then T:S × A × S → [0, 1] is a transition function such that T(s, a, s ) = P(st+1 = s |st = s, at = a) is the probability of reaching state s at time t + 1 when the agent is in state s and executes action a at time t; R:S → R is a bounded reward function, such that rt = R(st ) is the reward received when the agent is in state s at time t; and G ⊂ S is a set of goal states. No transition leaves any goal state, i.e., T(s, a, s ) = 0, ∀s ∈ G, ∀a ∈ A, ∀s ∈ S; b0 : S → [0, 1] is the initial state probability distribution, such that b0 (s) is the probability of state s being the first one in an episode, i.e., P(s0 = s) = b0 (s). A. Definition of Relational MDP An RMDP [27] is an extension of the MDP formalism by using a relational alphabet to describe states and actions. A relational alphabet  is a set of constants and predicates. If t1 , . . . , tn are terms, each one being a constant (represented with lower-case letters) or simply a variable (represented with capital letters), and if p/n is a predicate symbol p with arity n ≥ 0, then p(t1 , . . . , tn ) is an atom. If an atom does not contain any variable, it is called a ground atom. Suppose a language L, which uses predicates and terms constructed from names in the signature Sig(L) = P, C, where P is the set of predicates and C is the set of constants (here we consider there are no function symbol). The Herbrand base of L, HBL , is the set of ground atoms using terms of C in the predicates of P. The Herbrand interpretations, HIL , is the set of every interpretation of HBL , i.e., every truth value combination for ground atoms in HBL . Consider a relational alphabet  = C ∪ PS ∪ PA such that C is a set of constants, which represents the objects of the environment; PS is a set of state predicates used to describe properties of and relations among objects; and PA is a set of action predicates. A background knowledge B is a set of sentences that constrains the use of . Given the language LS with signature Sig(LS ) = PS , C, the state space S of the RMDP is defined as a subset of Herbrand interpretations HILS that satisfies the constraints imposed by B, i.e., S ⊂ {s ∈ HILS |s |= B}. Action set A is a subset of HBLA , where LA is a language with signature Sig(LA ) = PA , C, again satisfying B; and the set of feasible actions As at state s is a subset of HBLA such that each action together with states s satisfy B, i.e., As = {a ∈ A|a, s |= B}. Then the RMDP is defined as a tuple , B, S, {As }, T, R, G, b0 , where , B, S, and {As } were previously defined and T, R, G, and b0 are defined as in an MDP.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KOGA et al.: STOCHASTIC ABSTRACT POLICIES: GENERALIZING KNOWLEDGE TO IMPROVE REINFORCEMENT LEARNING

B. State and Action Abstractions A relational representation enables us to aggregate states and actions by using variables instead of constants in the predicate terms. A substitution θ is a set {X1 /t1 , . . . , Xn /tn }, binding each variable Xi to a term ti ; it can be applied to a term, atom or conjunction. We call a set of atoms a conjunction and we call abstract state σ (and abstract action α) a conjunction with no ground atom, where each variable in a conjunction is implicitly assumed to be existentially quantified. A ground state s (and a ground action a) can be represented by an abstract state by replacing every constant by variables. We denote by Sσ the set of ground states covered by an abstract state σ . Each abstract state forms a partition of the set of states S and we use a fixed level of abstraction, so each ground state is abstracted to only one abstract state. Similarly, we define Aα (s) as the set of all ground actions covered by an abstract action α in ground state s. We also define Sab and Aab as the set of all abstract states and the set of all abstract actions in an RMDP, respectively. To simplify notation, here we use the assumption that if an atom does not appear in a ground sentence, the negated atom is assumed. As an example, consider an abstract state σ = {p1 (X1 ), p2 (X1 , X2 )}, σ ∈ Sab ; a ground state s1 ∈ S, s1 = {p1 (t1 ), p2 (t1 , t2 )} is abstracted by σ with θ = {X1 /t1 , X2 /t2 } and a ground state s2 ∈ S, s2 = {p1 (t3 ), p2 (t3 , t4 )} is abstracted by σ with θ = {X1 /t3 , X2 /t4 }. In this case, σ represents an abstraction of both s1 and s2 . Now, consider a background knowledge B = {a1 (X) ⇒ p1 (X)}. As this sentence must have a truth value, it means that action a1 (X) is only available if p1 (X) is true. Let us consider an abstract action α = a1 (X). In this case, if the agent is in ground state s = {p1 (t1 ), p2 (t1 , t2 )}, the abstract action α covers ground action a = a1 (t1 ) under substitution θ = {X/t1 }, but it does not cover ground action a = a1 (t2 ). On the other hand, if the agent is in ground state s = {p1 (t2 ), p2 (t1 , t2 )}, abstract action α does not cover ground action a, but it does cover ground action a under substitution θ  = {X/t2 }. C. Stochastic Abstract Policies The task of the agent in an (R)MDP is to find a policy. An optimal policy π ∗ is a policy that maximizes some function Rt of the future rewards rt , rt+1 , rt+2 , . . . We consider the usual definition of the  sum tof discounted rewards over an infinite horizon Rt = ∞ t=0 γ rt , where 0 ≤ γ < 1 is the discount factor. In a decision problem, it is of interest to determine the most specific class of policies which guarantees optimality. In a ground MDP, the set of deterministic policies, π : S → A, is known to be sufficient for optimality [26], [28]. Since abstract spaces aggregate states and actions, the Markov property at the ground level of an RMDP may not hold in an abstract level of the same RMDP [25]. In this case, stochastic policies are appropriate, because they are more flexible for offering more than one choice of action per state, and can thus be arbitrarily better than deterministic policies [29]. We define a stochastic abstract policy as πab :Sab × Aab → [0, 1], with P(α|σ ) = πab (σ, α), σ ∈ Sab , α ∈ Aab .

3

We must define the way an abstract policy πab can be used in any ground RMDP. Given a ground state s, observed by the agent, we find an abstract state σ such that s ∈ Sσ , given by a function that maps ground states into abstract states ξ : S → Sab . Then, for each abstract action αi ∈ Aab , πab points out a probability P(αi |σ ). An abstract action αi is mapped into a set of ground actions Aαi (s) for the underlying ground decision problem. To produce a particular ground action, we select an abstract action αi ∈ Aab according to P(αi |σ ), and then we randomly select (with uniform probability) a ground action a ∈ Aαi (s) from the set of ground actions associated with the selected abstract action αi . This process of finding a random ground action a given a stochastic abstract policy πab and a ground state s is denoted by a = grounding(πab , s). The use of a relational representation and RMDPs enables us to generalize experiences and to define abstract policies, making it easier to transfer knowledge between tasks in different domains. In previous works, we have shown that abstract policies can effectively be used to accelerate the building of a ground policy in RL [30], [31]. III. P ROBLEM D EFINITION We are interested in the general problem of RL, i.e., given interactions {(st , ar , rt , st+1 )} of an agent within an (R)MDP, the agent must find an optimal policy. At any time t, the agent observes the current state st , chooses an action at , receives a reward rt and the process transits to the next state st . Although an (R)MDP underlies the RL problem, the agent knows neither transition function T nor reward function R, but the agent has access to samples from both functions. Because the agent guides the process by choosing actions at each time t, the agent faces the exploitation/exploration dilemma: either the agent exploits knowledge obtained from past experiences to improve its current profit; or the agent explores actions to improve its knowledge, for improving its profit in the long run. Here, we are interested in the problem of RL when the agent can exploit knowledge obtained from previous tasks. We define a domain class DC as the tuple PS , PA , B. This means that RMDP problems that can use the same predicates to describe their states and actions with the same background knowledge, or in other words, RMDP problems that possess the same spaces of abstract states Sab and abstract actions Aab , are in the same domain class. When the set of objects and a transition function are added, we obtain DC , C, T, which specifies a domain D. It describes the dynamics of the world and also the number of states and actions are now fixed. Finally, a task is a tuple D, R, G, b0 , where D is a domain, R is the reward function, G is the set of goal states that indicates desirable states of the domain and b0 indicates the initial states. A task fully describes an RMDP. Furthermore, in this paper, we focus on knowledge transfer among different tasks within the same domain class. This is because abstract policies can be transferred among problems that are described by the same Sab and Aab . The problem we want to solve is described as follows. Problem 1: Consider that an agent knows a set of previous tasks 1 , 2 , . . . , n , all from the same domain class DC .

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON CYBERNETICS

Then, the agent is given a new and unknown task l from domain class DC , which must be solved by RL. Here, knowing a task means that besides knowing the complete task description D, R, G, b0 , the agent can also process this knowledge to obtain preprocessed knowledge. For example, the agent may calculate optimal policies for each task and may make use of them somehow. However, the agent has little information about the new task, i.e., the agent does not know what are the outcomes of its actions (T) or when it will receive reward (R). It is only known that the new task l belongs to the same domain class of the previous known tasks. All tasks also have a similar type of goals and rewards (e.g., they only receive a non-null reward when they reach the goal). Our interest here is the previous-tasks preprocessing to obtain optimal policies. In this case, the problem is analogous to the supervised learning problem, in which task-solution pairs are usually given. Usually tasks and solutions can be described in metric spaces and we consider that tasks near each other in those spaces have solutions near each other. The agent is going to use RL to find a policy πl to accomplish this new task l , but we want it to use its past knowledge in order to improve performance. In the following sections, we demonstrate how to obtain a generalized stochastic abstract policy from the set of past tasks and then how to use it on the new task. Moreover, we also compare this approach to a case-based one by building a library of policies from the past tasks and then using this library to help solving the new task. IV. L EVERAGING PAST K NOWLEDGE In this paper, the past knowledge is encoded in a policy, more specifically, in a stochastic abstract policy. We explain how a generalized policy can be built from a set of complete models of source tasks 1 , 2 , . . . , n ; first we describe AbsProb-PI algorithm [25] and then our algorithm AbsProbPI-multiple. Next, we describe the method of creating and composing a library with the policies of the source tasks, which is the PRPL algorithm [9], [11]. Finally, we give an example that illustrates the advantages and disadvantages of the two approaches. A. Building a Generalized Policy We propose that the knowledge from the known tasks is encoded in a single generalized (stochastic abstract) policy. This single policy defines a suboptimal solution for a whole set of source tasks. AbsProb-PI [25] is an algorithm that, given an RMDP, uses cumulative discounted reward evaluation to build an abstract policy. It is a model-based algorithm, i.e., it requires prior knowledge of the whole model, including transition and reward functions. Here, we extend it to deal with a set of tasks, giving rise to a new algorithm called AbsProb-PI-multiple, summarized as follows. First of all, AbsProb-PI-multiple is based on the popular policy iteration algorithm, but it is designed to perform at an abstract level. At each iteration, the idea is to find an improved version of the current policy. It is a gradient-based policy iteration algorithm [32], in which a gradient over the policy

value function is used to determine the improvement direction. As the abstract state space Sab does not necessarily hold the Markov property, it is important to consider the initial state distribution b0 . In order to do so, we must define the policy transition function T πab : S × S → [0, 1] of an abstract policy πab . Being ξ : S → Sab the function that maps ground states into abstract states, we define T πab by T πab (s, s ) =

 α∈Aab

πab (ξ(s), α)

 a∈Aα

  1 T s, a, s . (1) |Aα |

The value function V πab : S → R is the measure of the quality of an abstract policy πab and needs to be calculated at each iteration. Consider a matrix-vector representation of functions R, V πab and T πab : R, Vπab and Tπab , respectively. Being I the identity matrix, we define Vπab by −1  R. (2) Vπab = I − γ Tπab The gradient function G : Sab × Aab → R is defined over the value function to point the direction of the greatest rate of increase of the value. In the AbsProb-PI-multiple algorithm, we must define the parameter > 0 which guarantees that the policy is a -soft policy, in which all actions have a minimum probability of /|Aab (σ )| for all state σ ∈ Sab and it can at most converge to an -greedy policy, in which the greedy action has the probability 1 − + /|Aab (σ )| and all others have /|Aab (σ )|, for every state σ ∈ Sab . We must also choose the step size used in the gradient descent method with the function ρ(i). In our experiments, we use a decreasing step size, i.e., ρ(i) = 1/(1 + i), where i is the iteration step. Fig. 1 presents the AbsProb-PI-multiple algorithm. Its goal is to build a stochastic abstract policy πab given a set of n source tasks. πab is initialized arbitrarily and then the algorithm iteratively refines this policy, until some stopping criterion is met. The first step in each iteration (line 4) corresponds to the evaluation of the current policy for task j , j ∈ [1, n]. The algorithm tackles one task at a time within each iteration. The product C ∈ RS (line 5) denotes the expected cumulative occurrence of each state, considering policy πab and b0 . Next, for each abstract action α, a policy πα,

is defined. This policy chooses action α with probability 1 −

and chooses uniformly all the other actions with probability

. Tπα, is the matrix-vector representation of the transition function of policy πα, , defined similarly to 1. The difference α,πab ∈ RS each policy πα, causes in the value of each state is calculated (line 6). Then, G j is calculated for each pair abstract state-abstract action (σ, α) by adding the values of α,πab (s) weighted by C(s), for all s ∈ S σ (line 7). The previous steps are to calculate G j for each task j . To consider the whole set of tasks, G is calculated by summing all G j (line 8). The value of G represents the gradient of the value function, and its maximum value indicates which abstract action should be executed more often to achieve a better result (line 9). The policy is then updated in the direction of the maximum G, using a step size ρ (line 10), which can change at each iteration.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KOGA et al.: STOCHASTIC ABSTRACT POLICIES: GENERALIZING KNOWLEDGE TO IMPROVE REINFORCEMENT LEARNING

Fig. 2.

5

PRPL algorithm for building a library of policies [9].

finds an abstract policy for an RMDP, such as TILDE [12], ND-TILDE [30] or even AbsProb-PI [25] itself. If the average reward Wi obtained by following this policy πi is δ times greater than the maximum reward obtained in task i using any of the policies currently in the library; then, policy πi is added to the library (line 8). Wi is calculated by  ∞

  t Wi = b0 (s) E γ rt |πi , s0 = s . (3) s∈S

Fig. 1. AbsProb-PI-multiple algorithm for finding a generalized stochastic abstract policy for a set of known tasks.

AbsProb-PI-multiple treats the set of tasks as if they were one single problem, thereby trying to find a solution that satisfies all of them. The application of the algorithm in the set of known tasks finds a generalized stochastic abstract policy and this resulting policy is the knowledge that is transferred from source tasks to be applied in a target task. It is worth noticing that the method described to acquire knowledge from source tasks does not use RL algorithms. Actually, in this paper, they are treated as planning tasks because our concern is using this knowledge to accelerate RL in a target task. B. Building a Library of Policies We compare our approach to another algorithm for leveraging past knowledge, the PLPR algorithm [9]. It uses a library of policies instead of a single one. The library contains a collection of policies, each one being the solution to a source task. However, the library does not necessarily contain all possible policies; to prevent the library from having two policies that are too similar, and thereby, the overhead of selecting between them during the learning process, there must be a criterion to decide whether to include a policy in the library or not. This criterion is a key step for the PLPR algorithm [9] and we briefly describe it here. The algorithm to build the library is given in Fig. 2. To build the library, the source tasks are tackled one at a time, following a certain order. We use an algorithm to find a policy πi for each source task i (line 3). It can be any algorithm that

t=0

Parameter δ controls how permissive to new policies the library is: if δ = 0, only one policy (the first) is stored. On the other hand, if δ = 1, it is most likely that all the policies learned are included. In our experiments, various values of δ are used. Before we proceed, we would like to illustrate the use of the two approaches with a simple example, so that we can give some intuition about the use of a method based either on cases or on generalizations. C. Illustrative Example Let us consider a simple example: a grid-world where the agent actions are to move one cell north, south, east, or west, resulting in its movement unless there is a wall in the desired direction. Besides, the agent cannot see farther than one cell, i.e., it can only perceive adjacent cells, and each cell represents a state. The initial state is always the top-left one and the agent’s task is to go to the target location, minimizing the number of steps. There are some different (stationary) environments in this world and we define one task in each of these environments. There are two source tasks, illustrated in Fig. 3. One can notice that each task has six ground states. In this environment, there are four types of terrain: dirt, sand, grass, and rocks. Different states with the same type of terrain have the same abstract representation and, therefore, are in the same abstract state. There is thus a total of four abstract states in each task. As we focus on the transfer of abstract policies, actions are defined per abstract state. In this simple example, we use abstraction only for states, there is no abstraction for action. In other words, the abstract actions are exactly the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

Fig. 3. Source tasks 1 and 2 . White cells are dirt, light gray cells are grass, the dotted pattern cell is sand, and dark gray cells are rocks. Thick black lines represent walls, the initial state is the top-left cell and the target is marked with circles. (a) 1 . (b) 2 .

Fig. 4. Transferred knowledge. Optimal policies for the source tasks (π1 for 1 and π2 for 2 ); πg is a generalized policy that suboptimally satisfies both 1 and 2 . Actually, the policies are stochastic, but the arrows just represent the most likely action of each state. (a) π1 . (b) π2 , πg .

same as the ground actions. Both tasks are in the same domain class, but as they have different transition functions, they are in different domains. In both tasks, the agent always starts in a dirty terrain and has to reach a target in a rocky terrain. We can see that the agent has basically one major decision to make: should it follow the sandy or the grassy path? In 1 , the optimal policy is to follow the sandy path, whereas in 2 the best choice is the grassy one. This is reflected in policies π1 and π2 (Fig. 4), each one solving 1 and 2 , respectively, and both of which are included in the library of policies. If we follow the generalization approach, instead of having a library with two policies, a single generalized policy πg is built. The intuition is that, while each task has a different optimal path, in both tasks the grassy path can effectively lead the agent to the goal. The generalized policy thus assigns a higher probability for following the grassy path that, despite not being optimal for every source task, is the best global solution (following the sandy path in 2 would result in a dead-end). In this example, coincidentally πg is exactly equal to π2 . Now consider a target task 3 , shown in Fig. 5(a), in which the agent is learning with RL. It is similar to the source tasks, but it is a new and unseen task. Policies π1 , π2 and πg are used to guide exploration in the environment, expecting the learning process to be faster than simply using random actions. Further details on how policies are applied to improve performance in RL are in Section V. If the agent uses the policy library, it can choose either π1 or π2 to help it. When the agent chooses π1 , the guidance provided by the policy is extremely helpful, as the most probable action for the dirty and sandy terrains is “go right,” which is the same as the optimal policy for task 3 . When the agent chooses π2 , it provides some help, as it does lead the agent to the goal, but it leads through a

IEEE TRANSACTIONS ON CYBERNETICS

Fig. 5. Target tasks 3 and 4 . Tasks to be learned with the aid of past policies (π1 and π2 , or πg ). Arrows indicate the optimal policy for these tasks. (a) 3 . (b) 4 .

longer way than π1 (through the grassy path). If the agent uses the generalized policy, it does not have to perform any policy selection, it just uses πg to accelerate its learning as much as π2 would do. The purpose of using these past policies is to prevent the agent from struggling to find a way at the beginning of the RL process. With their help, the agent shows good behavior early and, as learning progresses, it eventually finds the optimal policy for that task. Let us consider another target task, 4 , shown in Fig. 5(b). In this case, π2 and πg are the ones that provide the best help. On the other hand, π1 is not very helpful; actually, it leads the agent to a dead-end. This example is designed to illustrate the intuition behind the choice of building either a generalized policy or a library. When using a library of policies, some policies might be helpful while others might not. It is important for the agent to rapidly realize this during the learning process of a target task to avoid negative transfer [33], i.e., to avoid decreasing the performance of the agent when compared to one without any transfer. The generalized policy approach is a solution to this issue, even if it may not always be the most effective guide. Nevertheless, our experiments show that, when looking for better results on average, the generalized approach performs better. Section VI covers this comparison in detail. V. I MPROVING P ERFORMANCE IN R EINFORCEMENT L EARNING An appealing idea to circumvent the inherent problem of erratic behavior in the early steps of the RL methods is that of using knowledge from tasks that the agent has already faced and successfully solved. More specifically, the agent could speed up the learning process by applying to new situations (abstract) policies it has obtained from previous tasks. To achieve the acceleration of the learning process through policy transfer, one can use the knowledge they convey to bias the exploration process. During the RL process, the agent must balance the exploration of the environment and the exploitation of both the past policy and the policy being learned. The framework we propose here combines the π -guidance exploration strategy with the PRQ-learning algorithm [11], both adapted to be able to handle stochastic abstract policies. The π -guidance exploration strategy is used to meet the need for balancing exploration and exploitation with the aid of parameter ψ (0 ≤ ψ ≤ 1). The π -guidance strategy combines ideas of other reuse strategies [11], [31], modifying for stochastic abstract policies, which is detailed in Fig. 6. By the π -guidance exploration strategy, the agent will then follow the given (or selected) abstract policy with a probability

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KOGA et al.: STOCHASTIC ABSTRACT POLICIES: GENERALIZING KNOWLEDGE TO IMPROVE REINFORCEMENT LEARNING

Fig. 6. steps.

7

π -guidance algorithm used during episodes of a maximum of H

ψ and the policy being learned with a probability 1 − ψ, in which case it is exploited following an -greedy strategy. This must be done because random exploration is always necessary for convergence. At each step within an episode, ψ decays exponentially by an amount given by parameter v. This is done to prevent the agent from greedily following the past policy over the whole episode; sometimes the past policy influence may result in getting the agent insisting in bad behavior, that is why its influence decays over the episode. At the end of π -guidance execution, the average reinforcement received in the episode is calculated and assigned to variable W. As this amount of reinforcement was obtained with the aid of the reuse of the past policy, W can be then used as a similarity metric between the past policy and the policy being learned. This idea is explored in the PRQ-learning algorithm [11], which speeds up the learning process with Q-learning by reusing one or more policies and probabilistically selecting the best policy to reuse at each episode, according to W. The objective of PRQ-learning is that of solving a task, say , with the aid of set L = {π1 , π2 . . . , πm } of m past policies. The policy to be learned, πl , is added to the set so that all policies follow the same criteria for being selected. Being Wi the average reward received when reusing policy πi , 1 ≤ i ≤ m + 1, the algorithm selects a policy from L at each episode, following a softmax strategy, i.e., the probability of choosing each policy πi is correlated to Wi . Once a policy is selected, it is used all along an episode, following the π -guidance strategy. Except if πl is selected, a case in which it just follows an -greedy strategy. The algorithm PRQ-learning is illustrated in Fig. 7. Here, we use a slightly modified version of the original PRQ-learning algorithm; we changed the exploration strategy to π -guidance (line 13), so it can handle stochastic abstract policies.

Fig. 7. PRQ-learning algorithm. We modified the original algorithm [11] to use π -guidance strategy (line 13).

Parameter τ grows along the learning process, with an increase of τ at each episode. The greater τ is, the more sensitive to differences in W the probabilities of selecting each policy are. The framework for leveraging past knowledge via stochastic abstract policies are defined. In summary, the PRQlearning algorithm is used in conjunction with the π -guidance exploration strategy, both in the case of a library of policies or the generalized policy. In the former, the set of policies comprises one or more policies, while in the latter, the set comprises a single abstract policy. The next section describes the experiments and evaluations of this framework when applied to robotic navigation tasks. VI. E XPERIMENTS This section describes the experiments made to evaluate the effectiveness of the use of stochastic abstract policies in accelerating RL. First, the domain class used is detailed and then results are presented. A. Navigation Domain Class Let us describe the domain class used in our experiments. There is an indoor stationary environment which contains two

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON CYBERNETICS

Fig. 8. Navigation domain. Black areas represent walls, darker cells represent rooms, and lighter cells represent corridors. Doors are marked with “d,” the goals of each source task with “S,” and the goals of each target task with “T.”

kinds of places: rooms and corridors, with doors connecting them. There are several markers in these places for identification (room markers, corridor markers) and these markers, as well the doors, are used as objects in the relational representation. A robot (our agent) can navigate through the rooms and corridors and the space is discretized in cells, each cell possessing a marker and corresponding to a different state. A task always defines a single location the robot must reach, which is always a cell inside a room. Thus, there is one goal state for each task and the task is finished when the robot reaches it. One example can be seen in Fig. 8. Our concern is to control the robot at a higher level of abstraction, so that the robot can sense doors and identify whether it is inside a room or a corridor. Additionally, the robot also has a positioning system, making it able to tell whether it is near or far from the goal location and also if some object is closer or farther from the goal than the robot itself. We describe each ground state using the vocabulary C ∪ PL ∪ PO ∪ PP , detailed below. C is the set of objects: rooms markers, corridor markers, and doors. Room markers are objects inside a room, that identify the room. Analogously, corridor markers are objects inside corridors. Doors are passages from one room to another or to a corridor. The robot has a set of sensors that can sense a marker at a distance of one cell, while it may sense doors that are up to two cells away. All state predicates are divided into three sets: PL , PO , and PP . PL contains predicates related to the location of the agent; PO contains predicates related to observations and PP contains predicates related to the relative position of the objects. Here, we describe them all. 1) PL : a) inRoom(R) indicates that the agent is inside a room and the closest marker is room marker R; b) inCorridor(C) indicates that the agent is in a corridor and the closest marker is corridor marker C.

2) PO : a) seeDoor(D) means that the agent can see a door, but it is at least two cells distant (as the robot cannot see doors farther than two cells away, this predicate is valid only when the robot is exactly two cells away from a door); b) seeAdjRoom(R) indicates that the agent is close to a door (one cell away) and can even see that a room (with room marker R) lies through it; c) seeAdjCorridor(C) indicates that the agent is close to a door (one cell away) and can even see that a corridor (with corridor marker C) lies through it; d) seeEmptySpace(M) means that the agent sees a free space close to a room or corridor marker M, where it could move to; e) nearGoal is true if the agent is at a close distance to the goal, i.e., at a Manhattan distance to the goal lower than 5. 3) PP : a) closeGoal(X) indicates that object X is closer than the agent to the goal; b) farGoal(X) indicates that object X is farther than the agent to the goal. Background knowledge B defines rules and restrictions to the state description. A state must always contains one, and only one, predicate from PL , which means that the agent is either inside a room or a corridor, never in both or none. For each predicate from PO in the state description, there must always be one from PP , both with the same object, except if the predicate has zero arity (nearGoal). For example, one possible ground state is s1 = inRoom(r11 ) ∧ seeDoor(d1 ) ∧ closeGoal(d1 ) ∧ seeEmptySpace(r12 ) ∧ closeGoal(r12 ) ∧ seeEmptySpace(r13 ) ∧ farGoal(r13 ) ∧ nearGoal. The agent is inside a room and at a close distance to the goal. The closest marker is r11 . seeDoor(d1 ) ∧ closeGoal(d1 ) means that the agent is seeing a door d1 that is closer to the goal than the agent is. It also sees two other markers in the room, r12 and r13 , one closer to the goal and the other farther from the goal, respectively. Regarding actions, they are described by the set of predicates PA . All action predicates have one of two suffixes: AppGoal(X) or AwayGoal(X), which indicate whether the agent is going to a direction approaching the goal or moving away from the goal when going toward object X, respectively. Actions ending with AppGoal(X) are only available in states with closeGoal(X). Likewise, actions ending with AwayGoal(X) are only available in states with farGoal(X). Remember that all actions are atoms, i.e., they are formed by just one predicate. The set of available actions PA contains: 1) goToDoor[App|Away]Goal(D) means that the agent must go closer to door D; this action is available in states with seeDoor(D) and appropriate element from PP ;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KOGA et al.: STOCHASTIC ABSTRACT POLICIES: GENERALIZING KNOWLEDGE TO IMPROVE REINFORCEMENT LEARNING

2) goToRoom[App|Away]Goal(R) means that the agent must enter the room with room marker R; this action is available in states with seeAdjRoom(D) and appropriate element from PP ; 3) goToCorridor[App|Away]Goal(C) means that the agent must enter the corridor with corridor marker C; this action is available in states with seeAdjCorridor(D) and appropriate element from PP ; 4) goToEmpty[App|Away]Goal(M) means that the agent must go to an area where the closest marker will be M; this action is available in states with seeEmptySpace(M) and appropriate element from PP . Not all actions are available at all states, e.g., goToDoorAppGoal(D) is only available at states with seeDoor(D) ∧ closeGoal(D) in their description. These suffixes in actions may seem redundant information but they play an important role at the abstract level. This means that, for example, at the abstract level, the agent could decide between going to a room that is closer to the goal (goToRoomAppGoal(X)) or going to a corridor that is farther (goToCorridorAwayGoal(Y)). Moreover, the ground level is not affected by this extra information; at any ground state there are at most four available ground actions (North, South, East, West), given the nature of each state. Transitions probabilities are higher than zero between two adjacent states (two adjacent cells). There is always an error probability (i.e., it is a nondeterministic transition function) of transiting between states of 0.05, in which case the agent does not change state. In other words, after executing action a in state s, the agent transits to the expected next state s with probability 0.95 and stays in state s with probability 0.05. For each task, the reward function is fairly simple. Goal states give positive rewards and all others have zero reward value  1 if s ∈ G, R(s) = 0 otherwise. Each task has only one goal state sg , therefore G = {sg }. This goal state also has the property of always being a state inside a room. The agent tasks are always to arrive to a room location. Initial state probability distribution b0 is uniform for all states. We choose this domain because not only is it exactly the same as that used in the paper that introduced the concept of a policy library and the PRQ-learning algorithm [9], but also similar ones have been used in transfer learning works [21], [25], [31]. B. Experimental Setup We conducted a series of experiments to test the effectiveness of the knowledge transfer in the learning process. All the experiments are executed in a simulated environment. They took place in the environment illustrated in Fig. 8. Each experiment has two phases: knowledge building and target learning. In the first phase, the agent learns one or more policies from a set of n source tasks 1 , 2 , . . . , n and keeps them encoded

in to to in

9

some way as its knowledge. We use our proposed method create a generalized policy: all source tasks are presented the agent simultaneously and AbsProb-PI-multiple (Fig. 1 Section IV) is used to find one single generalized policy. In all the experiments, we compare our method to another policy transfer method, the PRPL (Fig. 2 described in Section IV). It builds a library of policies: the source tasks are presented sequentially, one at a time; an abstract policy is found for each one and it may or may not be included in the library depending on δ. Furthermore, as the δ parameter of the library builder algorithm influences the size of the library, four values are used: δ = 0, δ = 0.25, δ = 0.50, and δ = 1.0. We call the libraries formed with each value L0, L25, L50, and L100, respectively. In the second phase, the agent faces a new task, called target task, which differs from the source tasks. PRQ-learning is applied to this new task, i.e., the agent learns the optimal policy with Q-learning but improves the performance of the learning process by reusing the policies learned in the first phase. The PRQ-learning algorithm uses a set of policies. In the case of the generalized policy, the set contains just one element. In the case of the library, it contains the whole library. Always starting in a random initial state (b0 is a uniform probability distribution to all states), the agent tries to reach the goal state in an episode with a maximum of 100 steps. When it reaches the goal or fails to find it after 100 steps, the episode ends. The agent runs 2000 episodes (K = 2000) for each of the four library versions (L25, L50, L100, and the library with one generalized policy). Parameters used for PRQ-learning are: H = 100, μ = 0.05, γ = 0.95, = 0.05, ψ0 = 1, v = 0.95, τ0 = 0, and τ = 0.005. In our set of experiments, we also want to evaluate the evolution of the libraries size and the impact of the number of source tasks in the transfer. In order to do so, we first choose a total of 20 source tasks (depicted as S in Fig. 8). Then, given a random sequence of these 20 tasks, we run experiments by gradually increasing n, from 1 to 20. In other words, the first experiment is run with n = 1 and the agent builds a generalized policy of just one task and also builds a library with just one policy. In the second experiment, n = 2, the next task is added and the agent builds a new generalized policy aggregating both tasks as well as a new library with size ≤ 2, containing the previous policy and possibly the new policy that solves the second task (it can be added or not depending on the value of δ). This process continues until n = 20. Additionally, as the order the source tasks are presented may change the contents of a library, the whole process is repeated four times, always randomizing the ordering of the 20 source tasks. Furthermore, in the Target Learning phase, 15 target tasks (depicted as T in Fig. 8) are used and the results we show are the average of the results of these 15 tasks. These experiments assessed the transfer among tasks in the same domain. To show the transfer effect in tasks in a different domain, the knowledge from the same 20 source tasks is taken into consideration to improve performance in the learning of new tasks in a broader and never-seen-before domain. These tasks (and the domain) are illustrated in Fig. 9.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

Fig. 9. New domain. 10 target tasks to evaluate transfer among tasks of different domains.

Fig. 10. Performance results for n = 20. Performance of the generalized policy using knowledge of 20 source tasks, compared with libraries of policies (L0, L25, L50, and L100), and Q-learning without past knowledge (no transfer). Each point represents the average value of 100 executions and the error bars show the confidence interval of this average for a confidence level of 95%.

C. Results Fig. 10 shows the results when n = 20, i.e., the whole set of 20 source tasks is considered in knowledge building. After 20 source tasks, the average sizes of the libraries are: |L0| = 1, |L25| = 5, |L50| = 11, and |L100| = 20. Their performance is compared in terms of the cumulative average of the rewards received in each episode. The performance of an agent learning from scratch with Q-learning, without using any kind of previous knowledge is also shown for comparison (label No transfer in Fig. 10). This agent, as all others, uses the -greedy exploration strategy with a fixed value = 0.05, a learning rate μ = 0.05 and a discount factor γ = 0.95. First of all, we can see that knowledge transfer methods do present a better performance. The amount of reward received by the agent in the first episodes is much higher when it reuses previous knowledge. That is why it used the already-learned

IEEE TRANSACTIONS ON CYBERNETICS

Fig. 11. Evolution of size of source task set. Performance after 1000 episodes for different numbers of source tasks. Each point represents the average value of 100 executions.

policies to guide its exploration and thus led it to higher rewards even at the beginning of the learning process. Without reuse, on the other hand, the agent takes more time to explore the environment and thus behaves more erroneously to try to solve the task. Therefore, this shows that the previous policies did encode relevant information to the new tasks. If we compare the two methods that use past knowledge, a better performance of the generalized policy than that of the policy library can be noticed, considering the average of all tasks. This is especially due to the fact that the library contains some policies that are not helpful at all—they may even be disadvantageous to the agent. Then the PRQ-learning algorithm may take some time to notice that some policies are unhelpful and to assign a low probability of choosing them, whereas the generalized policy mostly presents some aid to the learning process. This means that, if we seek to always obtain the maximum reward, regardless of the number of episodes executed, the generalized policy is more suitable. L0 was expected to present the poorest performance because it is the one that preserves less previous knowledge: just the policy of the first source task. All the other libraries, L25, L50, and L100, had quite a similar performance. Indeed, student’s t-tests were conducted pointwise and they show that the hypotheses that the learning curves of L25, L50, and L100 have all the same mean just after the 10th episode can not be rejected with a confidence level greater than 95%. On the other hand, the hypothesis that they have the same mean as the generalized policy is rejected after the 10th episode with a confidence level greater than 99.9%. Fig. 10 shows the results when there is knowledge from the solution of 20 source tasks, but what if n is smaller? Focusing on just one target task, Fig. 11 shows the cumulative average of the rewards after 1000 episodes with different values of n, for the generalized policy, L25, L50, and L100. Looking at the evolution of n, we notice that the generalized policy presents a better performance regardless of the number of solved source tasks. For all the approaches, as n increases, the average reward also increases until it reaches a maximum. Note that L25 and L50 do not include all the policies in their

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KOGA et al.: STOCHASTIC ABSTRACT POLICIES: GENERALIZING KNOWLEDGE TO IMPROVE REINFORCEMENT LEARNING

Fig. 12. Results of transfer to new domain. Performance of policy reuse methods applying knowledge of 20 source tasks in a new domain. Each point represents the average value of 100 executions and the error bars show the 95% confidence interval of this average.

libraries and eventually stop including any policy, thus showing constant reward in some ranges of n. Additionally, the L100 curve is noticed to present a decreasing performance for higher values of n, i.e., when the library has too many policies in it, its performance may be poorer, an indication that the online policy selection process does present a significant overhead to the learning process. The knowledge from the same 20 source tasks of Fig. 8 is used to show the transfer effect in tasks in 10 target tasks in a different domain (Fig. 9). The arithmetic mean of the cumulative average of the rewards after 100 runs in each of the 10 tasks is shown in Fig. 12. The results are quite similar to the ones presented before: the generalized policy has an overall better result than the libraries. Library with δ = 0.25 presented a slightly better performance than the ones with δ = 0.5 and δ = 1.0 (which contains all the 20 policies), corroborating the conclusions mentioned before that too many policies may affect the transfer performance. This experiment shows the power of a relational representation and abstract policies. They enabled the transfer process to a whole different domain, because the knowledge is kept at an abstract level. VII. C ONCLUSION To sum up, experiments show that the AbsProb-PI-multiple algorithm builds stochastic abstract policies that can indeed accelerate RL tasks. Due to their relational representation, abstract policies provide a way to encode knowledge that is not excessively bound to the source task, thus enabling transfer to new tasks. Moreover, the use of a generalized policy that encodes the knowledge of several previously solved tasks proves to be an effective approach to transfer learning. In our experiments, it achieves better results than a policy library. Although a library does contain more information, its knowledge base is scattered among all the policies and the overhead of selecting and evaluating each of them is high. The use of a generalized policy provides better results on average, in terms of the reward the agent receives over time. Besides, when there are few source tasks, the generalized policy also performs better. In this case

11

the overhead of selecting policies is lower, but as there are few policies, usually the knowledge is not generic enough. The generalized policy, however, generalizes the solution of these few tasks and this also presents better results on average. One could also use a library of generalized policies, an approach that would be a compromise between the two presented here. Instead of ignoring some policies or generalizing all of them in one single policy, one could use a measure to decide which solutions should be combined and which solutions should not. Unlike other alternatives for obtaining generalized policies [34], [35], our approach exploits the full description of the tasks to find generalized policies, rather than using only samples of optimal policies. Most inductive approaches that use optimal policy samples build an ordered list of rules covering the samples, whereas our proposal describes abstract policies in terms of conjunctions of abstract atoms. However, both approaches (induction considering task description or policy samples) rely on MDPs with a certain number of states for which value functions can be easily calculated. If the number of predicates that describe abstract atoms is very large, we also have problems in scalability, as it occurs with approaches that use factored MDPs descriptions. To alleviate the problem, we can use feature selection, which we began to investigate in the context of gradient-based policy search [36]. R EFERENCES [1] N. Navarro-Guerrero, C. Weber, P. Schroeter, and S. Wermter, “Realworld reinforcement learning for autonomous humanoid robot docking,” Robot. Auton. Syst., vol. 60, no. 11, pp. 1400–1407, 2012. [2] K. Conn and R. Peters, “Reinforcement learning with a supervisor for a mobile robot in a real-world environment,” in Proc. 7th CIRA, Jacksonville, FL, USA, 2007, pp. 73–78. [3] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey,” J. Mach. Learn. Res., vol. 10, pp. 1633–1685, Dec. 2009. [4] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. [5] Y.-B. Kang, S. Krishnaswamy, and A. Zaslavsky, “A retrieval strategy for case-based reasoning using similarity and association knowledge,” IEEE Trans. Cybern., vol. 44, no. 4, pp. 473–487, Apr. 2014. [6] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artif. Intell., vol. 112, nos. 1–2, pp. 181–211, Aug. 1999. [7] M. Klenk, D. W. Aha, and M. Molineaux, “The case for case-based transfer learning,” AI Mag., vol. 32, no. 1, pp. 54–69, 2011. [8] L. A. Celiberto, Jr., J. P. Matsuura, R. L. De Mantaras, and R. A. C. Bianchi, “Using cases as heuristics in reinforcement learning: A transfer learning application,” in Proc. 22nd IJCAI, 2011, pp. 1211–1217. [9] F. Fernández and M. Veloso, “Probabilistic policy reuse in a reinforcement learning agent,” in Proc. 5th AAMAS, Hakodate, Japan, 2006, pp. 720–727. [10] F. Fernández, J. García, and M. Veloso, “Probabilistic policy reuse for inter-task transfer learning,” Robot. Auton. Syst., vol. 58, no. 7, pp. 866–871, Jul. 2010. [11] F. Fernández and M. Veloso, “Learning domain structure through probabilistic policy reuse in reinforcement learning,” Prog. Artif. Intell., vol. 2, no. 1, pp. 13–27, 2013. [12] H. Blockeel and L. D. Raedt, “Top-down induction of first-order logical decision trees,” Artif. Intell., vol. 101, no. 12, pp. 285–297, 1998. [13] M. Martín and H. Geffner, “Learning generalized policies from planning examples using concept languages,” Appl. Intell., vol. 20, no. 1, pp. 9–19, 2004. [14] Y. Liu and P. Stone, “Value-function-based transfer for reinforcement learning using structure mapping,” in Proc. 21st Nat. Conf. Artif. Intell., 2006, pp. 415–420.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

[15] M. E. Taylor, P. Stone, and Y. Liu, “Transfer learning via inter-task mappings for temporal difference learning,” J. Mach. Learn. Res., vol. 8, no. 1, pp. 2125–2167, 2007. [16] B. Banerjee and P. Stone, “General game learning using knowledge transfer,” in Proc. 20th IJCAI, 2007, pp. 672–677. [17] G. Konidaris, I. Scheidwasser, and A. Barto, “Transfer in reinforcement learning via shared features,” J. Mach. Learn. Res., vol. 13, pp. 1333–1371, 2012. [18] R. Bianchi, M. Martins, C. Ribeiro, and A. Costa, “Heuristicallyaccelerated multiagent reinforcement learning,” IEEE Trans. Cybern., vol. 44, no. 2, pp. 252–265, Feb. 2014. [19] J. Ramon, K. Driessens, and T. Croonenborghs, “Transfer learning in reinforcement learning problems through partial policy recycling,” in Proc. 18th ECML, Warsaw, Poland, 2007, pp. 699–707. [20] C. Drummond, “Accelerating reinforcement learning by composing solutions of automatically identified subtasks,” J. Artif. Intell. Res., vol. 16, no. 1, pp. 59–104, Jan. 2002. [21] K. Kersting, C. Plagemann, A. Cocora, W. Burgard, and L. D. Raedt, “Learning to transfer optimal navigation policies,” Adv. Robot., vol. 21, no. 13, pp. 1565–1582, 2007. [22] S. Džeroski, L. De Raedt, and K. Driessens, “Relational reinforcement learning,” Mach. Learn., vol. 43, no. 1/2, pp. 7–52, 2001. [23] M. van Otterlo, “Reinforcement learning for relational MDPs,” in Proc. 13th BENELEARN, 2004, pp. 138–145. [24] L. Li, T. J. Walsh, and M. L. Littman, “Towards a unified theory of state abstraction for MDPs,” in Proc. 9th Int. Symp. Artif. Intell. Math., 2006, pp. 531–539. [25] V. F. Silva, F. A. Pereira, and A. H. R. Costa, “Finding memoryless probabilistic relational policies for inter-task reuse,” in Proc. 14th Int. Conf. IPMU, Catania, Italy, 2012, pp. 107–116. [26] M. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, NY, USA: Wiley, 1994. [27] M. van Otterlo, The Logic of Adaptive Behavior: Knowledge Representation and Algorithms for Adaptive Sequential Decision Making Under Uncertainty in First-Order and Relational Domains. Washington, DC, USA: IOS Press, 2009. [28] M. L. Littman, “Memoryless policies: Theoretical limitations and practical results,” in Proc. 3rd Int. Conf. SAB, Brighton, England, 1994, pp. 238–245. [29] S. P. Singh, T. Jaakkola, and M. I. Jordan, “Learning without stateestimation in partially observable markovian decision processes,” in Proc. 11th ICML, San Francisco, CA, USA, 1994, pp. 284–292. [30] T. Matos, Y. P. Bergamo, V. F. Silva, F. G. Cozman, and A. H. R. Costa, “Simultaneous abstract and concrete reinforcement learning,” in Proc. 9th SARA, 2011, pp. 82–89. [31] M. L. Koga, V. F. D. Silva, F. G. Cozman, and A. H. R. Costa, “Speeding-up reinforcement learning through abstraction and transfer learning,” in Proc. 12th AAMAS, Saint Paul, MN, USA, 2013, pp. 119–126. [32] X.-R. Cao and H.-T. Fang, “Gradient-based policy iteration: An example,” in Proc. 41st IEEE CDC, Piscataway, NJ, USA, 2002, pp. 3367–3371. [33] L. Torrey and J. Shavlik, “Transfer learning,” in Proc. Handb. Res. Mac. Learn. Appl., 2009, pp. 1–22. [34] S. Yoon, A. Fern, and R. Givan, “Inductive policy selection for firstorder MDPs,” in Proc. 18th Conf. UAI, San Francisco, CA, USA, 2002, pp. 568–576. [35] M. Martin and H. Geffner, “Learning generalized policies from planning examples using concept languages,” Appl. Intell., vol. 20, no. 1, pp. 9–19, 2004. [36] K. O. M. Bogdan and V. F. Silva, “Forward and backward feature selection in gradient-based MDP algorithms,” in Proc. 11th MICAI, San Luis Potosí, Mexico, 2013, pp. 383–394.

IEEE TRANSACTIONS ON CYBERNETICS

Marcelo L. Koga received the B.Sc. degree in electrical engineering and the M.Sc. degree in computer engineering from Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil, in 2010 and 2013, respectively. He is currently a Researcher with MVisia, Brazil. His current research interests include the fields of computer vision and artificial intelligence, particularly on reinforcement learning and transfer learning.

Valdinei Freire received the Ph.D. degree in electrical engineering from the Universidade de São Paulo, São Paulo, Brazil, and the Ph.D. degree in electrical and computer engineering, from the Technical University of Lisbon, Lisbon, Portugal. He is an Assistant Professor at Escola de Artes, Ciências e Humanidades, Universidade de São Paulo. His current research interests include various aspects of learning and sequential decision making: Markov decision process, reinforcement learning, preference elicitation, recommender systems, intelligent robotics, and video processing.

Anna H. R. Costa (M’13) received the M.Sc. and M.Eng. degrees in electrical engineering from the Centro Universitario da FEI, São Paulo, Brazil, in 1983 and 1989, respectively, and the Ph.D. degree in electrical engineering from Escola Politécnica at Universidade de São Paulo, São Paulo, Brazil, in 1994. She is currently a Full Professor with Escola Politécnica, Universidade de São Paulo. From 1998 to 1999, she was a Guest Researcher at Carnegie Mellon University, Pittsburgh, PA, USA. From 1983 to 1985, and from 1991 to 1992, she was a Research Scientist at the University of Karlsruhe, Karlsruhe, Germany. Her current research interests include cognitive robotics and perception, reasoning, and learning in mobile robotics.

Stochastic abstract policies: generalizing knowledge to improve reinforcement learning.

Reinforcement learning (RL) enables an agent to learn behavior by acquiring experience through trial-and-error interactions with a dynamic environment...
2MB Sizes 0 Downloads 3 Views