nips nips2012 nips2012-353 knowledge-graph by maker-knowledge-mining

353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Source: pdf

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

Abstract: We study how to automatically select and adapt multiple abstractions or representations of the world to support model-based reinforcement learning. We address the challenges of transfer learning in heterogeneous environments with varying tasks. We present an efﬁcient, online framework that, through a sequence of tasks, learns a set of relevant representations to be used in future tasks. Without predeﬁned mapping strategies, we introduce a general approach to support transfer learning across different state spaces. We demonstrate the potential impact of our system through improved jumpstart and faster convergence to near optimum policy in two benchmark domains. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We address the challenges of transfer learning in heterogeneous environments with varying tasks. [sent-5, score-0.366]

2 Without predeﬁned mapping strategies, we introduce a general approach to support transfer learning across different state spaces. [sent-7, score-0.279]

3 1 Introduction In reinforcement learning (RL), an agent autonomously learns how to make optimal sequential decisions by interacting with the world. [sent-9, score-0.384]

4 A small change in the task or the environment may render the agent’s accumulated knowledge useless; costly re-learning from scratch is often needed. [sent-11, score-0.255]

5 Many existing techniques assume the same state space or state representation in different tasks. [sent-13, score-0.19]

6 While recent efforts have addressed inter-task transfer in different action or state spaces, speciﬁc mapping criteria have to be established through policy reuse [7], action correlation [14], state abstraction [22], inter-space relation [16], or other methods. [sent-14, score-0.798]

7 Such mappings are hard to deﬁne when the agent operates in complex environments with large state spaces and multiple goal states, with possibly different state feature distributions and world dynamics. [sent-15, score-0.58]

8 To efﬁciently accomplish varying tasks in heterogeneous environments, the agent has to learn to focus attention on the crucial features of each environment. [sent-16, score-0.437]

9 We propose a system that tries to transfer old knowledge, but at the same time evaluates new options to see if they work better. [sent-17, score-0.236]

10 The agent gathers experience during its lifetime and enters a new environment equipped with expectations on how different aspects of the world affect the outcomes of the agent’s actions. [sent-18, score-0.375]

11 The main idea is to allow an agent to collect a library of world models or representations, called views, that it can consult to focus its attention in a new task. [sent-19, score-0.458]

12 The reward model library can be learned in an analogous fashion. [sent-21, score-0.344]

13 Effective utilization of the library of world models allows the agent to capture the transition dynamics of the new environment quickly; this should lead to a jumpstart in learning and faster convergence to a near optimal policy. [sent-22, score-0.755]

14 We will next formalize the problem and describe the method of collecting views into a library. [sent-24, score-0.229]

15 We will then present an efﬁcient implementation of the proposed transfer learning technique. [sent-25, score-0.184]

16 The goal is then to ﬁnd a policy π that speciﬁes an action to perform in each state so that the expected accumulated future reward (possibly giving higher weights to more immediate rewards) for each state is maximized [18]. [sent-28, score-0.738]

17 In model-based RL, the optimal policy is calculated based on the estimates of the transition model T and the reward model R which are obtained by interacting with the environment. [sent-29, score-0.431]

18 A key idea of this work is that the agent can represent the world dynamics from its sensory state space in different ways. [sent-30, score-0.443]

19 Such different views correspond to the agent’s decisions to focus attention on only some features of the state in order to quickly approximate the state transition function. [sent-31, score-0.643]

20 1 Decomposition of transition model To allow knowledge transfer from one state space to another, we assume that each state s in all the state spaces can be characterized by a d-dimensional feature vector f (s) ∈ Rd . [sent-33, score-0.628]

21 We use the idea in situation calculus [11] to decompose the transition model T in accordance with the possible action effects. [sent-35, score-0.305]

22 In the RL context, an action will stochastically create an effect that determines how the current state changes to the next one [2, 10, 14]. [sent-36, score-0.281]

23 For example, an attempt to move left in a grid world may cause the agent to move one step left or one step forward, with small probabilities. [sent-37, score-0.29]

24 CMDP is deﬁned by a tuple (S, A, E, τ, η, f, R) in which the transition model T has been replaced by the the terms E, τ, η, f , where E is an effect set and f is a function from states to their feature vectors. [sent-40, score-0.203]

25 τ : S ×A×E → [0, 1] is an action model such that τ (s, a, e) = P (e | f (s), a) indicates the probability of achieving effect e upon performing action a at state s. [sent-41, score-0.429]

26 While the agent needs to learn the effects of the action, it is usually assumed to understand the meaning of the effects, i. [sent-43, score-0.328]

27 Different effects e will change a state s to a different next state s = η(s, e). [sent-47, score-0.252]

28 The MDP transition model T can be reconstructed from the CMDP by the equation: T (s, a, s ; τ ) = P (s | f (s), a) = τ (s, a, e), (1) where e is the effect of action a that takes s to s , if such an e exists, otherwise T (s, a, s ; τ ) = 0. [sent-48, score-0.319]

29 We can therefore turn the learning of the transition model into a supervised online classiﬁcation problem that can be solved by any standard online classiﬁcation method. [sent-50, score-0.223]

30 More speciﬁcally, the classiﬁcation task is to predict the effect e of an action a in a state s with features f (s). [sent-51, score-0.363]

31 2 A multi-view transfer framework In our framework, the knowledge gathered and transferred by the agent is collected into a library T of online effect predictors or views. [sent-53, score-0.728]

32 ¯ A view consists of a structure component f that picks the features which should be focused on, and a quantitative component Θ that deﬁnes how these features should be combined to approximate the ¯ distribution of action effects. [sent-54, score-0.296]

33 Each view τ is specialized in predicting the effects of one action a(τ ) ∈ A and it yields a probability distribution for the effects of the action a in any state. [sent-56, score-0.502]

34 This prediction is based on the features of the state and the parameters Θ(τ ) of the view that may be adjusted based on the actual effects observed in the task environment. [sent-57, score-0.353]

35 We denote the subset of views that specify the effects for action a by T a ⊂ T . [sent-58, score-0.439]

36 The main challenge is to build and maintain a comprehensive set of views that can be used in new environments likely resembling the old ones, but at the same time allow adaptation to new tasks with completely new transition dynamics and feature distributions. [sent-59, score-0.629]

37 At the beginning of every new task, the existing library is copied into a working library which is also augmented with fresh, uninformed views, one for each action, that are ready to be adapted to new tasks. [sent-60, score-0.365]

38 This view is is used to estimate the optimal policy based on the transition model speciﬁed in Equation 1, and the policy is used to pick the ﬁrst action a. [sent-62, score-0.557]

39 The action effect is then used to score all the views in the working library and to adjust their parameters. [sent-63, score-0.651]

40 In each round the selection of views is repeated based on their scores, and the new optimal policy is calculated based on the new selections. [sent-64, score-0.326]

41 At the end of the task, the actual library is updated by possibly recruiting the views that have “performed well” and retiring those that have not. [sent-65, score-0.469]

42 Algorithm 1 TES: Transferring Expectations using a library of views Input: T = {τ1 , τ2 , . [sent-67, score-0.397]

43 }: view library; CMDPj : a new j th task; Φ: view goodness evaluator Let T0 be a set of fresh views - one for each action Ttmp ← T ∪ T0 /* THE WORKING LIBRARY FOR THE TASK */ ˆ for all a ∈ A do T [a] ← argmaxτ ∈T a Φ(τ, j) end for /* SELECTING VIEWS */ for t = 0, 1, 2, . [sent-70, score-0.565]

44 1 Scoring the views To assess the quality of a view τ , we measure its predictive performance by a cumulative log-score. [sent-75, score-0.343]

45 2 Growing the library After completing a task, the highest scoring new views for each action are considered for recruiting into the actual library. [sent-86, score-0.612]

46 3 The winners τ ∗ that are adjusted versions of old views τ are accepted as new members if they score ¯ signiﬁcantly higher than their original versions, based on the logarithm of the prequential likelihood ratio [5] Λ(τ ∗ , τ ) = S(τ ∗ , Da ) − S(¯, Da ). [sent-89, score-0.391]

47 3 Pruning the library To keep the library relatively compact, a plausible policy is to remove views that have not performed well for a long time, possibly because there are better predictors or they have become obsolete in the new tasks or environments. [sent-95, score-0.718]

48 To implement such a retiring scheme, each view τ maintains a list Hτ of task indices that indicates the tasks for which the view has been the best scoring predictor for its specialty action a(τ ). [sent-96, score-0.484]

49 3 A view learning algorithm In TES, a view can be implemented by any probabilistic classiﬁcation model that can be quickly learned online. [sent-102, score-0.189]

50 4 Related work The survey by Taylor and Stone [20] offers a comprehensive exposition of recent methods to transfer various forms of knowledge in RL. [sent-132, score-0.237]

51 Not much research, however, has focused on transferring transition models. [sent-133, score-0.294]

52 Taylor proposes TIMBREL [19] to transfer observations in a source to a target task via manually tailored inter-task mapping. [sent-135, score-0.233]

53 [7] transfers a library of policies learned in previous tasks to bias exploration in new tasks. [sent-137, score-0.276]

54 The method assumes a constant inter-task state space, otherwise a state mapping strategy is needed. [sent-138, score-0.19]

55 Hester and Stone [8] describe a method to learn a decision tree for predicting state relative changes which are similar to our action effects. [sent-139, score-0.278]

56 This work, however, does not directly focus on transfer learning. [sent-142, score-0.184]

57 None of these studies, however, solve the problem of transferring knowledge in heterogeneous environments. [sent-147, score-0.269]

58 Atkeson and Santamaria introduce a locally weighted transfer learning technique called LWT to adapt previously learned transition models into a new situation [1]. [sent-148, score-0.317]

59 This study is among the very few that actually consider transferring the transition model to a new task [20]. [sent-149, score-0.343]

60 While their work is conducted in continuous state space using a ﬁxed state similarity measure, it can be adapted to a discrete case. [sent-150, score-0.19]

61 This approach could also be extended to be compatible with our work by learning a library of state similarity measures and developing a method to choose among those similarities for each task. [sent-153, score-0.263]

62 [23] also address the problem of transfer in heterogeneous environments. [sent-155, score-0.266]

63 demonstrates the method for reward models, and it is unclear how to extend the approach for transferring transition models. [sent-161, score-0.47]

64 5 Experiments We examine the performance of our expectation transfer algorithm TES that transfers views to speedup the learning process across different environments in two benchmark domains. [sent-163, score-0.539]

65 We show that TES can efﬁciently: a) learn the appropriate views online, b) select views using the proposed scoring metric, c) achieve a good jump start, and d) perform well in the long run. [sent-164, score-0.524]

66 To better compare with some related work, we evaluate the performance of TES for transferring both transition models and reward models in RL. [sent-165, score-0.47]

67 TES can be adapted to transfer reward models as follows: Assuming that the rewards follow a Gaussian distribution, a view of the expected reward model can be learned similarly as shown in section 3. [sent-166, score-0.649]

68 When studying reward models, the transition models are assumed to be known. [sent-169, score-0.309]

69 1 Learning views for effective transfer In the ﬁrst experiment, we compare TES with the locally weighted LWT approach by Atkeson et al. [sent-171, score-0.413]

70 The maximum size M of the views library, initially empty, is set to be 20; threshold c for growing the library is set to be log 300. [sent-181, score-0.397]

71 Table 1: Transfer of reward models: Cumulative reward in the ﬁrst episodes; Time to solve 15 tasks (in minutes), in which each is run with 200 episodes. [sent-185, score-0.408]

72 This demonstrates that TES can successfully learn the views and utilize them in novel tasks. [sent-238, score-0.264]

73 Similarly, the strategy for LWT that tries to learn one common model for transfer in various tasks often does not work well. [sent-242, score-0.301]

74 2 Multi-view transfer in complex environments In the second experiment, we evaluate TES in a more challenging domain, transferring transition models. [sent-244, score-0.578]

75 However, the agent also observes numerous other features in the environment. [sent-248, score-0.264]

76 The agent has to learn to focus on the relevant features to quickly achieve its goal. [sent-249, score-0.324]

77 6 Experiment set-up: The agent can perform four actions (move up, down, left, right) which will lead it to one of the four states around it, or leave it to its current state if it bumps into a wall. [sent-251, score-0.394]

78 A task ends when the agent reaches any exit door or ﬁre. [sent-255, score-0.327]

79 In our experiments, the environment transition dynamics is generated using three different sets of multinomial logistic regression models so that every combination of cell surfaces and walls around the cell will lead to a different transition dynamics at the cell. [sent-260, score-0.565]

80 In each scenario the agent is ﬁrst allowed to experience fourteen tasks, over 100 episodes in each, and it is then tested on the remaining one task. [sent-272, score-0.263]

81 No recency weighting is used to calculate the goodness of the views in the library. [sent-273, score-0.287]

82 We train and test TES on two environments which have same dynamics and 200 irrelevant binary features that challenge agent’s ability to learn a compact model for transfer. [sent-277, score-0.226]

83 Figure 1a shows how much the other methods lose to TES in terms of accumulated reward in the test task. [sent-278, score-0.303]

84 loreRL is an implementation of TES equipped with the view learning algorithm that does not transfer knowledge. [sent-279, score-0.266]

85 fRmax is the factored Rmax [3] in which the network structures of transition models are provided by an oracle [17]; its parameter m is set to be 10 in all the experiments. [sent-280, score-0.198]

86 The results show that these oracle methods still have to spend time to learn the model parameters, so they gain less accumulated reward than TES. [sent-283, score-0.364]

87 fEpsG fRmax loreRL 0 10 20 30 episode (a) 40 50 300 0 accumulated reward 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 accumulated reward difference accumulated reward difference View selection vs. [sent-286, score-0.95]

88 Figure 1b shows how different views lead to different policies and accumulated rewards over the ﬁrst 50 episodes in a given task. [sent-288, score-0.419]

89 The Rands curves show the accumulated reward difference to TES when the agent follows some random combinations of views from the library. [sent-289, score-0.763]

90 We compare the multi-view learning TES agent with a non-transfer agent loreRL, and an LWT agent that tries to learn only one good model for transfer. [sent-348, score-0.754]

91 When the earlier training tasks are similar to the test task, the LWT agent performs well. [sent-351, score-0.287]

92 However, the TES agent also quickly picks the correct views, thus we never lose much but often gain a lot. [sent-352, score-0.256]

93 We also notice that TES achieves a higher accumulated reward than loreRL and fEpsG that are bound to make uninformed decisions in the beginning. [sent-353, score-0.365]

94 Table 2 shows the average cumulative reward after the ﬁrst episode (the jumpstart effect) for each test task in the leave-one-out cross-validation. [sent-354, score-0.351]

95 We also notice that due to its fast capability of capturing the world dynamics, TES running time is just slightly longer than LWT’s and loreRL’s, which do not perform extra work for view switching but need more time and data to learn the dynamics models. [sent-358, score-0.234]

96 This suggests that TES can successfully learn a good library of views in heterogeneous environments and efﬁciently utilize those views in novel tasks. [sent-369, score-0.843]

97 6 Conclusions We have presented a framework for learning and transferring multiple expectations or views about world dynamics in heterogeneous environments. [sent-370, score-0.621]

98 Our model learning method is independent of the policy learning task, thus it can well be coupled with any scalable approximate policy learning algorithms. [sent-376, score-0.194]

99 : Using cases as heuristics in reinforcement learning: A transfer learning application. [sent-401, score-0.279]

100 : Using homomorphisms to transfer options across continuous reinforcement learning domains. [sent-474, score-0.279]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tes', 0.588), ('lwt', 0.249), ('agent', 0.231), ('views', 0.229), ('transfer', 0.184), ('reward', 0.176), ('library', 0.168), ('transferring', 0.161), ('lorerl', 0.16), ('action', 0.148), ('frmax', 0.142), ('transition', 0.133), ('accumulated', 0.127), ('environments', 0.1), ('policy', 0.097), ('hb', 0.095), ('reinforcement', 0.095), ('state', 0.095), ('fepsg', 0.089), ('heterogeneous', 0.082), ('view', 0.082), ('stone', 0.078), ('ttmp', 0.071), ('score', 0.068), ('effects', 0.062), ('world', 0.059), ('dynamics', 0.058), ('walls', 0.058), ('recency', 0.058), ('tasks', 0.056), ('wilson', 0.054), ('jumpstart', 0.053), ('santamaria', 0.053), ('environment', 0.053), ('wk', 0.052), ('task', 0.049), ('atkeson', 0.047), ('exit', 0.047), ('cmdp', 0.047), ('mdp', 0.047), ('wins', 0.046), ('rmax', 0.046), ('online', 0.045), ('episode', 0.041), ('ei', 0.04), ('factored', 0.039), ('gt', 0.039), ('logistic', 0.039), ('rl', 0.039), ('effect', 0.038), ('actions', 0.036), ('transferred', 0.036), ('da', 0.036), ('prequential', 0.036), ('rands', 0.036), ('recruiting', 0.036), ('retiring', 0.036), ('silander', 0.036), ('learn', 0.035), ('decisions', 0.033), ('features', 0.033), ('multinomial', 0.033), ('barto', 0.033), ('adjusted', 0.032), ('cumulative', 0.032), ('states', 0.032), ('expectations', 0.032), ('episodes', 0.032), ('littman', 0.032), ('scoring', 0.031), ('hester', 0.031), ('fteen', 0.031), ('rewards', 0.031), ('ki', 0.031), ('abstraction', 0.031), ('moved', 0.03), ('taylor', 0.029), ('uninformed', 0.029), ('wki', 0.029), ('konidaris', 0.029), ('abstractions', 0.029), ('pruning', 0.029), ('comprehensive', 0.027), ('aaai', 0.027), ('ijcai', 0.027), ('exploration', 0.026), ('tries', 0.026), ('old', 0.026), ('knowledge', 0.026), ('skill', 0.026), ('transfers', 0.026), ('oracle', 0.026), ('quickly', 0.025), ('si', 0.025), ('interacting', 0.025), ('prototype', 0.025), ('super', 0.025), ('fern', 0.024), ('calculus', 0.024), ('fresh', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

2 0.18389845 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efﬁcient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

3 0.18226781 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary ﬁnite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1

4 0.17834534 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The ﬁeld has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functﬁon. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

5 0.14994761 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classiﬁer. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

6 0.14486471 344 nips-2012-Timely Object Recognition

7 0.14467086 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

8 0.1406372 38 nips-2012-Algorithms for Learning Markov Field Policies

9 0.13594228 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

10 0.13381261 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

11 0.13192078 185 nips-2012-Learning about Canonical Views from Internet Image Collections

12 0.13086568 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

13 0.11964826 348 nips-2012-Tractable Objectives for Robust Policy Optimization

14 0.11944988 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

15 0.11589382 160 nips-2012-Imitation Learning by Coaching

16 0.11072367 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

17 0.10364041 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

18 0.094270423 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

19 0.089858212 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

20 0.087546453 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.183), (1, -0.309), (2, -0.07), (3, -0.029), (4, -0.031), (5, 0.009), (6, 0.018), (7, 0.001), (8, 0.009), (9, 0.002), (10, -0.015), (11, -0.009), (12, -0.019), (13, 0.028), (14, 0.016), (15, 0.036), (16, -0.004), (17, 0.043), (18, 0.007), (19, -0.034), (20, 0.004), (21, 0.015), (22, -0.029), (23, 0.015), (24, 0.002), (25, -0.033), (26, 0.06), (27, -0.031), (28, 0.024), (29, 0.019), (30, 0.031), (31, 0.027), (32, 0.071), (33, -0.061), (34, -0.028), (35, 0.078), (36, 0.034), (37, -0.063), (38, 0.022), (39, -0.099), (40, 0.005), (41, -0.021), (42, 0.079), (43, 0.003), (44, -0.026), (45, 0.043), (46, 0.15), (47, 0.004), (48, -0.131), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9483667 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

2 0.8416869 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We deﬁne priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to ﬁnd an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually speciﬁed, leading to hierarchically optimal rather than recursively optimal policies. 1

3 0.82305688 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1

4 0.77201188 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

5 0.74508643 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, ﬁnding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a signiﬁcant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an inﬁnite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

6 0.71379745 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

7 0.69294286 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

8 0.68569171 31 nips-2012-Action-Model Based Multi-agent Plan Recognition

9 0.67039365 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

10 0.62817132 38 nips-2012-Algorithms for Learning Markov Field Policies

11 0.62439567 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

12 0.62373233 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

13 0.61715823 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

14 0.59613132 160 nips-2012-Imitation Learning by Coaching

15 0.58733839 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

16 0.5794673 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

17 0.56461042 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

18 0.55597395 344 nips-2012-Timely Object Recognition

19 0.53551376 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

20 0.5176183 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (17, 0.02), (21, 0.034), (38, 0.117), (42, 0.04), (51, 0.151), (53, 0.013), (54, 0.076), (55, 0.017), (74, 0.073), (76, 0.141), (80, 0.096), (84, 0.014), (91, 0.011), (92, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87174773 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

2 0.85752302 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

Author: Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seung-jean Kim

Abstract: Recent approaches to collaborative ﬁltering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative ﬁltering algorithms, with non-constant combination coefﬁcients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of state-of-the-art collaborative ﬁltering algorithms. 1

3 0.8269335 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a ﬂexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efﬁcient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

4 0.82509542 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method signiﬁcantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classiﬁers as black boxes and learns from execution traces using reinforcement learning. 1

5 0.82393831 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions

Author: Alexandra Carpentier, Rémi Munos

Abstract: We consider the problem of adaptive stratiﬁed sampling for Monte Carlo integration of a differentiable function given a ﬁnite number of evaluations to the function. We construct a sampling scheme that samples more often in regions where the function oscillates more, while allocating the samples such that they are well spread on the domain (this notion shares similitude with low discrepancy). We prove that the estimate returned by the algorithm is almost similarly accurate as the estimate that an optimal oracle strategy (that would know the variations of the function everywhere) would return, and provide a ﬁnite-sample analysis. 1

6 0.81637257 38 nips-2012-Algorithms for Learning Markov Field Policies

7 0.81568277 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

8 0.81434536 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

9 0.81232005 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

10 0.81215358 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

11 0.81209373 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

12 0.81105447 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

13 0.81005055 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

14 0.80357301 160 nips-2012-Imitation Learning by Coaching

15 0.8031581 348 nips-2012-Tractable Objectives for Robust Policy Optimization

16 0.80286121 234 nips-2012-Multiresolution analysis on the symmetric group

17 0.80214816 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

18 0.80083179 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

19 0.80081207 303 nips-2012-Searching for objects driven by context

20 0.80074358 51 nips-2012-Bayesian Hierarchical Reinforcement Learning