nips nips2012 nips2012-51 knowledge-graph by maker-knowledge-mining

51 nips-2012-Bayesian Hierarchical Reinforcement Learning


Source: pdf

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). [sent-3, score-0.27]

2 We define priors on the primitive environment model and on task pseudo-rewards. [sent-4, score-0.413]

3 Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. [sent-5, score-0.144]

4 RL agents learn policies that map environment states to available actions while optimizing some measure of long-term utility. [sent-8, score-0.239]

5 For example, one approach introduces macro-operators for sequences of primitive actions. [sent-13, score-0.185]

6 Another idea is to decompose the task’s overall value function, for example by defining task hierarchies [5] or partial programs with choice points [6]. [sent-15, score-0.149]

7 The structure of the decomposition provides several benefits: first, for the “higher level” subtasks, policies are defined by calling “lower level” subtasks (which may themselves be quite complex); as a result policies for higher level subtasks may be expressed compactly. [sent-16, score-0.452]

8 Second, a task hierarchy or partial program can impose constraints on the space of policies by encoding knowledge about the structure of good policies and thereby reduce the search space. [sent-17, score-0.361]

9 Third, learning within subtasks allows state abstraction, that is, some state variables can be ignored because they do not affect the policy within that subtask. [sent-18, score-0.282]

10 Bayesian reinforcement learning addresses this issue by incorporating priors on models [7], value functions [8, 9] or policies [10]. [sent-22, score-0.335]

11 Specifying good 1 priors leads to many benefits, including initial good policies, directed exploration towards regions of uncertainty, and faster convergence to the optimal policy. [sent-23, score-0.238]

12 In this paper, we propose an approach that incorporates Bayesian priors in hierarchical reinforcement learning. [sent-24, score-0.236]

13 We use the MAXQ framework [5], that decomposes the overall task into subtasks so that value functions of the individual subtasks can be combined to recover the value function of the overall task. [sent-25, score-0.392]

14 We extend this framework by incorporating priors on the primitive environment model and on task pseudo-rewards. [sent-26, score-0.447]

15 We empirically evaluate our algorithm to understand the effect of the priors in addition to the task hierarchy. [sent-28, score-0.187]

16 For example, one source of prior information is multi-task reinforcement learning [11, 12], where an agent uses the solutions of previous RL tasks to build priors over models or policies for future tasks. [sent-33, score-0.413]

17 2 Background and Related Work In the MAXQ framework, each composite subtask Ti defines a semi-Markov decision process with parameters Si , Xi , Ci , Gi . [sent-36, score-0.213]

18 An (s, a, s ) triple where Pi (s) is false, Pi (s ) is true, a ∈ Ci , and the transition probability P (s |s, a) > 0 ˜ is called an exit of the subtask Ti . [sent-42, score-0.266]

19 A pseudo-reward function R(s, a) can be defined over exits to express preferences over the possible exits of a subtask. [sent-43, score-0.138]

20 A hierarchical policy π for the overall task is an assignment of a local policy to each SMDP Ti . [sent-44, score-0.329]

21 A hierarchically optimal policy is a hierarchical policy that has the maximum expected reward. [sent-45, score-0.394]

22 A hierarchical policy is said to be recursively optimal if the local policy for each subtask is optimal given that all its subtask policies are optimal. [sent-46, score-0.764]

23 Given a task graph, model-free [5] or model-based [14] methods can be used to learn value functions for each task-subtask pair. [sent-47, score-0.13]

24 In the model-free method, a policy is produced by maintaining a value and a completion function for each subtask. [sent-48, score-0.187]

25 For a task i, the value V (a, s) denotes the expected value of calling child task a in state s. [sent-49, score-0.288]

26 The completion function C(i, s, a) denotes the expected reward obtained while completing i after having called a in s. [sent-51, score-0.185]

27 The model-based RMAXQ [14] algorithm extends RMAX [15] to MAXQ by learning models for all primitive and composite tasks. [sent-53, score-0.241]

28 An optimistic exploration strategy is used together with a parameter m that determines how often a transition or reward needs to be seen to be usable in the planning step. [sent-55, score-0.175]

29 In the MAXQ framework, pseudo-rewards must be manually specified to learn hierarchically optimal policies. [sent-56, score-0.198]

30 Recent work has attempted to directly learn hierarchically optimal policies for ALisp partial programs, that generalize MAXQ task hierarchies [6, 16], using a model-free approach. [sent-57, score-0.383]

31 Here, along with task value and completion functions, an “external” Q function QE is maintained for each subtask. [sent-58, score-0.164]

32 This function stores the reward obtained after the parent of a subtask exits. [sent-59, score-0.318]

33 In later work [16], this is addressed by recursively representing QE in terms of task value and completion functions, linked by conditional probabilities of parent exits given child exits. [sent-61, score-0.377]

34 2 Bayesian reinforcement learning methods incorporate probabilistic prior knowledge on models [7], value functions [8, 9], policies [10] or combinations [17]. [sent-63, score-0.234]

35 This model is then solved and actions are taken according to the policy obtained. [sent-67, score-0.169]

36 This approach adds priors to each subtask model and performs (separate) Bayesian model-based learning for each subtask. [sent-73, score-0.264]

37 Instead, we only maintain distributions over primitive actions, and use a mixed modelbased/model-free learning algorithm that is naturally integrated with the standard MAXQ learning algorithm. [sent-75, score-0.185]

38 3 Bayesian MAXQ Algorithm In this section, we describe our approach to incorporating probabilistic priors into MAXQ. [sent-77, score-0.141]

39 As we explain below, pseudo-rewards are value functions; thus our approach uses priors both on models and value functions. [sent-79, score-0.153]

40 We first describe our approach to incorporating priors on environment models alone (assuming pseudo-rewards are fixed). [sent-81, score-0.182]

41 At the start of each episode, the BAYESIAN MAXQ routine is called with the Root task and the initial state for the current episode. [sent-86, score-0.15]

42 The MAXQ execution protocol is then followed, where each task chooses an action based on its current value function (initially random). [sent-87, score-0.164]

43 When a primitive action is reached and executed, it updates the posterior over model parameters (Line 3) and its own value estimate (which is just the reward function for primitive actions). [sent-88, score-0.588]

44 When a task exits and returns to its parent, the parent subsequently updates its completion function based on the current estimates of the value of the exit state (Lines 14 and 15). [sent-89, score-0.399]

45 Finally, each primitive action maintains a count of how many times it has been executed and each composite task maintains a count of how many child actions have been taken. [sent-91, score-0.563]

46 When k (an algorithm parameter) steps have been executed in a composite task, BAYESIAN MAXQ calls R ECOMPUTE VALUE to re-estimate the value and completion functions (the check on k is shown in R ECOMPUTE VALUE, Line 2). [sent-92, score-0.163]

47 When activated, this function recursively re-estimates the value/completion functions for all subtasks of the current task. [sent-93, score-0.218]

48 At the level of a primitive action, this simply involves resampling the reward and transition parameters from the current posterior over models. [sent-94, score-0.385]

49 We run this algorithm for Sim episodes, starting with the current subtask as the root, with the current pseudoreward estimates (we explain below how these are obtained). [sent-96, score-0.199]

50 This algorithm recursively updates the completion function of the task graph below the current task. [sent-97, score-0.226]

51 Note that in this step, the subtasks with primitive actions use model-based updates. [sent-98, score-0.384]

52 That is, when a primitive action is “executed” in such tasks, the currently sampled transition function (part of Θ in Line 5) is used to find the next state, and then the associated reward is used to update the completion function. [sent-99, score-0.462]

53 A PR is a value associated with a subtask exit that defines how “good” that exit is for that subtask. [sent-105, score-0.35]

54 The ideal PR for an exit is the expected reward under the hierarchically optimal policy after exiting the subtask, until the global task (Root) ends; thus the PR is a value function. [sent-106, score-0.56]

55 This PR would enable the subtask to choose the “right” exit in the context of what the rest of the task hierarchy is doing. [sent-107, score-0.403]

56 This is problematic because it presupposes (quite detailed) knowledge of the hierarchically optimal policy. [sent-109, score-0.145]

57 However, it is easy to construct examples where a recursively optimal policy 4 Algorithm 3 U PDATE PSEUDO REWARD ˜ Input: taskStack, Parent’s pseudo reward Rp ˜ p , Na ← 0, cr ← 0 1: tempCR ← R 2: while taskStack is not empty do 3: a, s, Na , cr ← taskStack. [sent-112, score-0.528]

58 pop() 4: tempCR ← γ Na · tempCR + cr ˜ 5: Update pseudo-reward posterior Φ for R(a, s) using (a, s, tempCR) ˜ 6: Resample R(a, s) from Φ 7: Na ← Na , cr ← cr 8: end while is arbitrarily worse than the hierarchically optimal policy. [sent-113, score-0.443]

59 That is, instead of exploration through action selection, Bayesian RL can perform exploration by sampling task parameters. [sent-121, score-0.174]

60 In this way, hierarchically optimal policies can be learned from scratch—a major advantage over the standard MAXQ setting. [sent-123, score-0.23]

61 For simplicity, we only focus on tasks with multiple exits, since otherwise, a PR has no effect on the policy (though the value function changes). [sent-125, score-0.145]

62 When a composite task executes, we keep track of each child task’s execution in a stack. [sent-126, score-0.179]

63 When the parent itself exits, we obtain a new observation of the PRs of each child by computing the discounted cumulative reward received after it exited, added to the current estimate of the parent’s PR (Algorithm 3). [sent-127, score-0.251]

64 Combined with the model-based priors above, we hypothesize that this procedure, iterated till convergence, will produce a hierarchically optimal policy. [sent-134, score-0.252]

65 4 Empirical Evaluation In this section, we evaluate our approach and test four hypotheses: First, does incorporating modelbased priors help speed up the convergence of MAXQ to the optimal policy? [sent-135, score-0.185]

66 Second, does the task hierarchy still matter if very good priors are available for primitive actions? [sent-136, score-0.483]

67 Does Bayesian RL perform better (in terms of computational time) if a task hierarchy is available? [sent-138, score-0.161]

68 Finally, can our approach effectively learn PRs and policies that are hierarchically optimal? [sent-139, score-0.231]

69 The actions available to the agent consist of navigation actions and actions to pickup and putdown the passenger. [sent-144, score-0.32]

70 The agent gets a reward of +20 upon completing the task, a constant −1 reward for every action and a −10 penalty for an erroneous action. [sent-145, score-0.381]

71 Here the state variables consist of the location of the agent, what the agent is carrying, whether a goldmine or forest is adjacent to its current location and whether a desired gold or wood quota has been met. [sent-153, score-0.294]

72 The actions available to the agent are to move to a specific location, chop gold or harvest wood, and to deposit the item it is carrying (if any). [sent-154, score-0.239]

73 In our experiments, the map contains two goldmines and two forests, each containing two units of gold and two units of wood, and the gold and wood quota is set to three each. [sent-156, score-0.152]

74 The agent gets a +50 reward when it meets the gold/wood quota, a constant −1 reward for every action and an additional −1 for erroneous actions (such as trying to deposit when it is not carrying anything). [sent-157, score-0.508]

75 For the Bayesian methods, we use Dirichlet priors for the transition function parameters and NormalGamma priors for the reward function parameters. [sent-158, score-0.362]

76 We use two priors: an uninformed prior, set to approximate a uniform distribution, and a “good” prior where a previously computed model posterior is used as the “prior. [sent-159, score-0.209]

77 In general, this is not necessary; more complex priors could be used as long as we can sample from the posterior distribution. [sent-161, score-0.138]

78 In our implementation, the Bayesian model-based Q-learning uses the same code as the Bayesian MAXQ algorithm, with a “trivial” hierarchy consisting of the Root task with only the primitive actions as children. [sent-163, score-0.412]

79 From these results, comparing the Bayesian versions of MAXQ to standard MAXQ, we observe that for Taxi-World, the Bayesian version converges faster to the optimal policy even with the uninformed prior, while for Resource-collection, the convergence rates are similar. [sent-173, score-0.285]

80 We conjecture that, as the environment becomes more stochastic, errors in primitive model estimates may propagate into subtask models and hurt the performance of this algorithm. [sent-179, score-0.399]

81 In their analysis [14], the authors noted that the error in the transition function for a composite task is a function of the total number of terminal states in the subtask. [sent-180, score-0.18]

82 This would improve the accuracy of the primitive model, but would further hurt the convergence rate of the algorithm. [sent-183, score-0.219]

83 We note that in Taxi-World, with uninformed priors, though the “flat” method initially does worse, it soon catches up to standard MAXQ and then to Bayesian MAXQ. [sent-185, score-0.155]

84 This is probably because in this domain, the primitive models are relatively easy to acquire, and the task hierarchy provides no additional leverage. [sent-186, score-0.346]

85 The difference is that in this case, the task hierarchy encodes extra information that cannot be deduced just from the models. [sent-188, score-0.161]

86 In particular, the task hierarchy tells the agent that good policies consist of gold/wood collection moves followed by deposit moves. [sent-189, score-0.408]

87 Since the reward structure in this domain is very sparse, it is difficult to deduce this even if very good models are available. [sent-190, score-0.175]

88 Taken together, these results show that task hierarchies and model priors can be complementary: in general, Bayesian MAXQ outperforms both flat Bayesian RL and MAXQ (in speed of convergence, since here MAXQ can learn the hierarchically optimal policy). [sent-191, score-0.405]

89 This is because for the flat case, during the simulation in R ECOMPUTE VALUE, a much larger task needs to be solved, while the Bayesian MAXQ approach is able to take into account the structure of the hierarchy to only simulate subtasks as needed, which ends up being much more efficient. [sent-199, score-0.294]

90 In Modified-Taxi-World, we allow dropoffs at any one of the four locations and do not provide a reward for task termination. [sent-207, score-0.204]

91 Thus the Navigate subtask needs a PR (corresponding to the correct dropoff location) to learn a good policy. [sent-208, score-0.214]

92 For these experiments, we use uninformed priors on the environment model. [sent-211, score-0.286]

93 From these results, we first observe that the methods with zero PR always do worse than those with “proper” PR, indicating that in these cases the recursively optimal policy is not the hierarchically optimal policy. [sent-221, score-0.338]

94 We observe that in each case, the Bayesian MAXQ approach is able to learn a policy that is as good, starting with no pseudo rewards; further, its convergence rates are often better. [sent-223, score-0.181]

95 Finally, we observe that the tripartite Q-decomposition of ALISPQ is also able to correctly learn hierarchically optimal policies, however, it converges slowly compared to Bayesian MAXQ or MAXQ with manual PRs. [sent-225, score-0.198]

96 In a sense, it is doing more work than is needed to capture the hierarchically optimal policy, because an exact Q-function may not be needed to capture the preference for the best exit, rather, a value that assigns it a sufficiently high reward compared to the other exits would suffice. [sent-228, score-0.393]

97 Taken together, these results indicate that incorporating Bayesian priors into MAXQ can successfully learn PRs from scratch and produce hierarchically optimal policies. [sent-229, score-0.33]

98 5 Conclusion In this paper, we have proposed an approach to incorporating probabilistic priors on environment models and task pseudo-rewards into HRL by extending the MAXQ framework. [sent-230, score-0.262]

99 Hierarchical reinforcement learning with the maxq value function decomposition. [sent-254, score-0.877]

100 Model-based learning with bayesian and maxq value function decomposition for hierarchical task. [sent-331, score-0.987]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('maxq', 0.768), ('primitive', 0.185), ('subtask', 0.157), ('bayesian', 0.153), ('uninformed', 0.138), ('pr', 0.133), ('prs', 0.133), ('subtasks', 0.133), ('reward', 0.124), ('hierarchically', 0.119), ('rl', 0.116), ('hrl', 0.109), ('priors', 0.107), ('policy', 0.103), ('cr', 0.089), ('reinforcement', 0.086), ('exit', 0.085), ('policies', 0.085), ('ecompute', 0.085), ('hierarchy', 0.081), ('task', 0.08), ('agent', 0.076), ('exits', 0.069), ('actions', 0.066), ('episodes', 0.065), ('recursively', 0.064), ('episode', 0.063), ('completion', 0.061), ('na', 0.061), ('hallway', 0.061), ('sim', 0.057), ('composite', 0.056), ('qe', 0.048), ('taskstack', 0.048), ('tempcr', 0.048), ('hierarchies', 0.046), ('wood', 0.044), ('child', 0.043), ('hierarchical', 0.043), ('environment', 0.041), ('prior', 0.04), ('action', 0.04), ('ti', 0.039), ('parent', 0.037), ('alispq', 0.036), ('deposit', 0.036), ('flatq', 0.036), ('quota', 0.036), ('gold', 0.036), ('count', 0.035), ('flat', 0.035), ('incorporating', 0.034), ('pseudo', 0.033), ('posterior', 0.031), ('good', 0.03), ('update', 0.028), ('exploration', 0.027), ('learn', 0.027), ('navigation', 0.026), ('routine', 0.026), ('cumulative', 0.026), ('manual', 0.026), ('optimal', 0.026), ('manually', 0.026), ('carrying', 0.025), ('ray', 0.024), ('cleveland', 0.024), ('pdate', 0.024), ('proaches', 0.024), ('transition', 0.024), ('executed', 0.023), ('state', 0.023), ('value', 0.023), ('iv', 0.022), ('pack', 0.021), ('reserve', 0.021), ('soumya', 0.021), ('taxi', 0.021), ('abstraction', 0.021), ('domain', 0.021), ('current', 0.021), ('domains', 0.02), ('consist', 0.02), ('states', 0.02), ('tasks', 0.019), ('location', 0.019), ('dearden', 0.019), ('western', 0.019), ('convergence', 0.018), ('gi', 0.018), ('omnipress', 0.018), ('simulations', 0.017), ('scratch', 0.017), ('erroneous', 0.017), ('fifteenth', 0.017), ('initially', 0.017), ('needed', 0.016), ('calling', 0.016), ('cao', 0.016), ('hurt', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1

2 0.12961501 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

3 0.12787978 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

4 0.12616037 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1

5 0.12335594 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

6 0.11944988 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

7 0.11545683 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

8 0.11185574 344 nips-2012-Timely Object Recognition

9 0.11046167 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

10 0.10705162 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

11 0.10559835 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

12 0.097662412 348 nips-2012-Tractable Objectives for Robust Policy Optimization

13 0.096228756 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

14 0.090011194 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

15 0.084578268 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

16 0.074343525 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

17 0.070623942 160 nips-2012-Imitation Learning by Coaching

18 0.066586539 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

19 0.060699776 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

20 0.05667701 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.13), (1, -0.234), (2, -0.038), (3, -0.009), (4, -0.079), (5, 0.001), (6, -0.003), (7, -0.01), (8, 0.018), (9, -0.015), (10, -0.018), (11, 0.007), (12, -0.017), (13, 0.019), (14, 0.006), (15, 0.028), (16, -0.002), (17, 0.032), (18, 0.014), (19, -0.033), (20, -0.034), (21, 0.04), (22, 0.003), (23, 0.016), (24, 0.02), (25, 0.015), (26, 0.028), (27, -0.056), (28, 0.033), (29, -0.011), (30, 0.012), (31, -0.018), (32, 0.035), (33, -0.037), (34, -0.053), (35, 0.033), (36, 0.038), (37, -0.03), (38, 0.028), (39, -0.026), (40, -0.007), (41, -0.029), (42, 0.057), (43, -0.002), (44, -0.042), (45, 0.008), (46, 0.032), (47, -0.034), (48, -0.069), (49, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94565481 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1

2 0.86089677 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

Abstract: We study how to automatically select and adapt multiple abstractions or representations of the world to support model-based reinforcement learning. We address the challenges of transfer learning in heterogeneous environments with varying tasks. We present an efficient, online framework that, through a sequence of tasks, learns a set of relevant representations to be used in future tasks. Without predefined mapping strategies, we introduce a general approach to support transfer learning across different state spaces. We demonstrate the potential impact of our system through improved jumpstart and faster convergence to near optimum policy in two benchmark domains. 1

3 0.81393147 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

4 0.79638278 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1

5 0.77430171 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.76470029 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

7 0.76283479 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

8 0.71458751 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

9 0.70201337 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

10 0.67984223 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

11 0.6594243 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

12 0.63957161 38 nips-2012-Algorithms for Learning Markov Field Policies

13 0.61594856 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

14 0.60217786 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

15 0.5997268 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

16 0.59278244 31 nips-2012-Action-Model Based Multi-agent Plan Recognition

17 0.59187263 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

18 0.58920705 348 nips-2012-Tractable Objectives for Robust Policy Optimization

19 0.5727517 160 nips-2012-Imitation Learning by Coaching

20 0.56684208 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.059), (21, 0.017), (38, 0.103), (42, 0.028), (54, 0.081), (55, 0.014), (74, 0.052), (76, 0.114), (80, 0.095), (84, 0.278), (92, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77901876 234 nips-2012-Multiresolution analysis on the symmetric group

Author: Risi Kondor, Walter Dempsey

Abstract: There is no generally accepted way to define wavelets on permutations. We address this issue by introducing the notion of coset based multiresolution analysis (CMRA) on the symmetric group, find the corresponding wavelet functions, and describe a fast wavelet transform for sparse signals. We discuss potential applications in ranking, sparse approximation, and multi-object tracking. 1

2 0.75213969 102 nips-2012-Distributed Non-Stochastic Experts

Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic

Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1

same-paper 3 0.75125104 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1

4 0.74667972 59 nips-2012-Bayesian nonparametric models for bipartite graphs

Author: Francois Caron

Abstract: We develop a novel Bayesian nonparametric model for random bipartite graphs. The model is based on the theory of completely random measures and is able to handle a potentially infinite number of nodes. We show that the model has appealing properties and in particular it may exhibit a power-law behavior. We derive a posterior characterization, a generative process for network growth, and a simple Gibbs sampler for posterior simulation. Our model is shown to be well fitted to several real-world social networks. 1

5 0.63127726 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

6 0.60238159 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

7 0.59666759 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

8 0.59481043 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

9 0.59150356 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

10 0.59048349 344 nips-2012-Timely Object Recognition

11 0.58816713 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

12 0.58491129 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

13 0.5811578 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

14 0.58101106 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

15 0.57922536 160 nips-2012-Imitation Learning by Coaching

16 0.57816744 38 nips-2012-Algorithms for Learning Markov Field Policies

17 0.57635581 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

18 0.57562709 292 nips-2012-Regularized Off-Policy TD-Learning

19 0.57502276 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

20 0.57319868 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models