nips nips2011 nips2011-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. [sent-6, score-0.319]
2 Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. [sent-7, score-0.739]
3 The second change is introducing a communication of expected utility from the student to the teacher. [sent-9, score-0.19]
4 The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. [sent-10, score-0.923]
5 1 Introduction As problem domains become more complex, human guidance becomes increasingly necessary to improve agent performance. [sent-11, score-0.221]
6 For instance, apprenticeship learning, where teachers demonstrate behaviors for agents to follow, has been used to train agents to control complicated systems such as helicopters [1]. [sent-12, score-0.301]
7 However, most work on this topic burdens the teacher with demonstrating even the simplest nuances of a task. [sent-13, score-0.489]
8 By contrast, in autonomous reinforcement learning [2] a number of domain classes can be efficiently learned by an actively exploring agent, although this class is provably smaller than those learnable with the help of a teacher [3]. [sent-14, score-0.75]
9 Either agents learn autonomously and eschew the larger learning capacity from teacher interaction, or the agent overburdens the teacher by not exploring simple concepts it could garner on its own. [sent-16, score-1.341]
10 We show how to build a provably efficient learning system that balances teacher demonstrations and autonomous exploration. [sent-18, score-0.623]
11 Specifically, our protocol and algorithms cause a teacher to only step in when its advice will be significantly more helpful than autonomous exploration by the agent. [sent-19, score-0.704]
12 We extend a previously proposed apprenticeship learning protocol [3] where a learning agent and teacher take turns running trajectories. [sent-20, score-0.946]
13 This version of apprenticeship learning is fundamentally different from Inverse Reinforcement Learning [4] and imitation learning [5] because our agents are allowed to enact better policies than their teachers and observe reward signals. [sent-21, score-0.331]
14 In this setting, the number of times the teacher outperforms the student was proven to be related to the learnability of the domain class in a mistake bound predictor (MBP) framework. [sent-22, score-0.734]
15 Our work modifies previous apprenticeship learning efforts in two ways. [sent-23, score-0.179]
16 First, we will show that replacing the MBP framework with a different learning architecture called KWIK-MBP (based on a similar recently proposed protocol [6]) indicates areas where the agent should autonomously explore, and melds autonomous and apprenticeship learning. [sent-24, score-0.636]
17 However, this change alone is not suffi1 cient to keep the teacher from intervening when an agent is capable of learning on its own. [sent-25, score-0.71]
18 Hence, we introduce a communication of the agent’s expected utility, which provides enough information for the teacher to decide whether or not to provide a trace (a property not shared by any of the previous efforts). [sent-26, score-0.673]
19 We then discuss how to relax the communication requirement when the teacher observes the student for many episodes. [sent-28, score-0.647]
20 This gives us the first apprenticeship learning framework where a teacher only shows demonstrations when they are needed for efficient learning, and gracefully blends autonomous exploration and apprenticeship learning. [sent-29, score-1.046]
21 2 Background The main focus of this paper is blending KWIK autonomous exploration strategies [7] and apprenticeship learning techniques [3], utilizing a framework for measuring mistakes and uncertainty based on KWIK-MB [6]. [sent-30, score-0.424]
22 δ Intuitively, KWIK caps the number of times the agent will admit uncertainty in its predictions. [sent-40, score-0.221]
23 Prior work [7] showed that if the transition and reward functions (T and R) of an MDP are KWIK learnable, then a PAC-MDP agent (which takes only a polynomial number of suboptimal steps with high probability) can be constructed for autonomous exploration. [sent-41, score-0.45]
24 Specifically, KWIK-learners LT and LR are built for T and R and the agent replaces any ⊥ predictions with transitions to a trap state with reward Rmax , causing the agent to explore these uncertain regions. [sent-43, score-0.601]
25 While the class of functions that is KWIK learnable includes tabular and factored MDPs, it does not cover many larger dynamics classes (such as STRIPS rules with conjunctions for pre-conditions) that are efficiently learnable in the apprenticeship setting. [sent-45, score-0.327]
26 2 Apprenticeship Learning with Mistake Bound Predictor We now describe an existing apprenticeship learning framework [3], which we will be modifying throughout this paper. [sent-47, score-0.179]
27 In that protocol, an agent is presented with a start state s0 and is asked to take actions according to its current policy πA , until a horizon H or a terminal state is reached. [sent-48, score-0.33]
28 After each of these episodes, a teacher is allowed to (but may choose not to) demonstrate their own policy πT starting from s0 . [sent-49, score-0.541]
29 The learning agent is able to fully observe each transition and reward received both in its own trajectories as well as those of the teacher, who may be able to provide highly informative samples. [sent-50, score-0.284]
30 For example, in an environment with n bits representing a combination lock that can only be opened with a single setting of the bits, the teacher can demonstrate the combination in a single trace, while an autonomous agent could spend 2n steps trying to open it. [sent-51, score-0.858]
31 If ||Eh∗ [xt ] − yt || > , then the agent has made a mistake. [sent-57, score-0.248]
32 δ An agent using MBP learners LT and LR for the MDP model components will be PAC-MDPTrace. [sent-59, score-0.289]
33 our combination lock), the MBP learners default to predicting “false” where the data is incomplete, leading an agent to think its action will not work in those situations. [sent-63, score-0.289]
34 Such interpretations would be catastrophic in the autonomous case (where the agent would fail to explore such areas), but are permissible in apprenticeship learning where teacher traces will provide the missing data. [sent-64, score-1.198]
35 Notice that under a criteria where the number of teacher traces is to be minimized, MBP learning may overburden the teacher. [sent-65, score-0.674]
36 For example, in a simple flat MDP, an MBP-Agent picks actions that maximize utility in the part of the state space that has been exposed by the teacher, never exploring, so the number of teacher traces scales with |S||A|. [sent-66, score-0.732]
37 But a flat MDP is autonomously (KWIK) learnable, so no traces should be required. [sent-67, score-0.218]
38 Ideally an agent would explore the state space where it can learn efficiently, and only rely on the teacher for difficult to learn concepts (like conjunctions). [sent-68, score-0.799]
39 3 Teaching by Demonstration with Mixed Interpretations We now introduce a different criteria with the goal of minimizing teacher traces while not forcing the agent to explore exponentially long. [sent-69, score-0.925]
40 δ A good TI bound minimizes the teacher traces needed to achieve good behavior, but only requires the suboptimal exploration steps to be polynomially bounded, not minimized. [sent-72, score-0.818]
41 This reflects our judgement that teacher interactions are far more costly than autonomous agent steps, so as long as the latter are reasonably constrained, we should seek to minimize the former. [sent-73, score-0.803]
42 A PAC-MDP-Trace bound quantifies (with probability 1 − δ) the worst-case number of episodes where the student performs worse than the teacher, specifically where VA (s0 ) < VT (s0 )− . [sent-77, score-0.225]
43 This would mean the domain was learnable with at most B1 teacher traces. [sent-79, score-0.588]
44 But this is a contradiction because no more traces are needed to keep the autonomous exploration steps polynomial. [sent-80, score-0.345]
45 1 The KWIK-MBP Protocol We would like to describe a supervised learning framework (like KWIK or MBP) that can quantify the number of changes made to a model through exploration and teacher demonstrations. [sent-82, score-0.554]
46 3 Algorithm 1 KWIK-MBP-Agent with Value Communication 1: The agent A knows , δ, S, A, H and planner P . [sent-88, score-0.245]
47 2: The teacher T has policy πT with expected value VT 3: Initialize KWIK-MBP learners LT and LR to ensure k value accuracy w. [sent-89, score-0.609]
48 7: A communicates its expected utility UA (s0 ) on this episode to T 8: if VT (s0 ) − k−1 > UA (s0 ) then k 9: T provides a trace τ starting from s0 . [sent-94, score-0.218]
49 This is analogous to our TI criteria (traces minimized with exploration bounded) so we now examine a KWIK-MBP learner in the apprenticeship setting. [sent-103, score-0.311]
50 2 Mixing Optimism and Pessimism Algorithm 1 (KWIK-MBP-Agent) shows an apprenticeship learning agent built over KWIK-MBP learners LT and LR . [sent-105, score-0.468]
51 This has the effect of drawing the agent to explore explicitly uncertain regions (⊥) and to either explore on its own or rely on the teacher for areas where a mistake might be made. [sent-108, score-0.846]
52 For instance, in the experiments in Figure 2 (left), discussed in depth later, a KWIK-MBP agent only requires traces for learning the pre-conditions in a noisy blocks world but uses autonomous exploration to discover the noise probabilities. [sent-109, score-0.592]
53 4 Teaching by Demonstration with Explicit Communication Thus far we have not discussed communication from the student to the teacher in KWIK-MBPAgent (line 7). [sent-110, score-0.647]
54 Suppose there was no communication in Algorithm 1 and the teacher provided a trace when πA was suboptimal. [sent-113, score-0.673]
55 This optimistic interpretation means the agent will expect success, and can learn autonomously. [sent-117, score-0.267]
56 However, the teacher will provide a trace to the agent since it sees it performing suboptimally during exploration. [sent-118, score-0.837]
57 The protocol in Algorithm 1 captures this intuition by providing a channel (line 7) 4 where the student communicates its expected utility UA . [sent-121, score-0.19]
58 The teacher then only shows a trace to a pessimistic agent (line 8), but will “stand back” and let an over-confident student learn from its own mistakes. [sent-122, score-1.017]
59 the number of traces is related only to the MBP portion of the KWIK-MBP bound, and more specifically to the number of pessimistic mistakes, defined as: (2)VA VA (4)VA VA Definition 5. [sent-132, score-0.226]
60 Algorithm 1 with KWIK-MBP learners will have a TI bound that is polynomial in 1 , 1 , δ 1 and 1−γ and P , where P is the number of pessimistic mistakes (P ≤ M ) made by LT and LR . [sent-136, score-0.256]
61 That original lemma categorized the three possible outcomes of an episode in an apprenticeship learning setting where the teacher always gives a trace and with LT and LR built to learn V within 2 . [sent-139, score-0.892]
62 We need to extend the lemma to cover our change in protocol (the teacher may not step in on every episode) and in evaluation criteria (TI bound instead of PAC-MDP-Trace). [sent-144, score-0.605]
63 Specifically, we need to show: (i) The number of steps between traces where VA < VT − is polynomially bounded. [sent-145, score-0.216]
64 (ii) Only a polynomial number of traces are given, and they are all guaranteed to improve some parameter in the agent’s model with high probability. [sent-146, score-0.203]
65 (iii) Only pessimistic mistakes (Definition 5) cause a teacher intervention. [sent-147, score-0.616]
66 If UA ≥ VT − (k−1) k k then no trace is given, but the agent’s policy is near optimal so property (i) is not violated. [sent-157, score-0.179]
67 If UA < VT − (k−1) , then a trace is given, even in this exploit case, because the teacher does not k know VA and cannot distinguish this case from the “explain” case above. [sent-158, score-0.616]
68 However, this trace will still be helpful, because UT ≤ UA , so at least UT < VT − k (satisfying iii), and again by the simulation lemma, the trace will help us learn a parameter and there are a limited number of such mistakes, so (ii) holds. [sent-159, score-0.276]
69 In that case, the agent’s own experience will help it learn a parameter, but in terms of traces we have the following cases: UA ≥ VT − (k−1) and VA > UA + k . [sent-161, score-0.208]
70 No trace is given here, but this is the classical exploration k case (UA is optimistic, as in KWIK learning). [sent-164, score-0.192]
71 In either case, a trace will be k provided but UT ≤ UA so at least UT < VT − k and the trace will be helpful (satisfying property (ii)). [sent-167, score-0.254]
72 Pessimistic mistakes are causing the trace (property iii) since πT is undervalued. [sent-168, score-0.197]
73 The result also generalizes earlier apprenticeship learning results on 2 -accurate learners [3] to k -accuracy, while ensuring a more practical and stronger bound (TI instead of PAC-MDP-Trace). [sent-170, score-0.274]
74 putDown(X, To) cannot execute unless the agent is holding X and To is clear and a block). [sent-181, score-0.221]
75 This is an interesting case because the effect probabilities can be learned autonomously while the conjunctive pre-conditions (of sizes 3 and 4), require teacher input (like our combination lock example). [sent-186, score-0.575]
76 Figure 2, column 1, shows KWIK, MBP, and KWIK-MBP agents as trained by a teacher who uses unreliable actions half the time. [sent-187, score-0.565]
77 The KWIK learner never receives traces (since its expected utility, shown in 1a, is always high), but spends an exponential (in the number of literals) time exploring the potential pre-conditions of actions (1b). [sent-188, score-0.284]
78 In contrast, the MBP and KWIK-MBP agents use the first trace to learn the pre-conditions. [sent-189, score-0.198]
79 The proportion of trials (out of 30) that the MBP and KWIK-MBP learners received teacher traces across episodes is shown in the bar graphs 1c and 1d of Fig. [sent-190, score-0.849]
80 The MBP learner continues to get traces for several episodes afterwards, using them to 6 help learn the probabilities well after the pre-conditions are learned. [sent-192, score-0.339]
81 KWIK-MBP actually learns the probabilities faster than MBP because it targets areas it does not know about rather than relying on potentially redundant teacher samples. [sent-195, score-0.521]
82 The reason for this is that sometimes the learner may be unlucky and construct an inaccurate value estimate and the teacher then steps in and provides a trace. [sent-197, score-0.558]
83 Because of this, tions U (s ), (b) average undiscounted cumulative A 0 in our KWIK-MBP implementation, we com- reward and (c and d) the proportion of trials where bined a KWIK linear regression learner for MBP and KWIK-MBP received teacher traces. [sent-210, score-0.674]
84 We constructed an “optimal hunting” teacher that finds the best combination of locations to shoot the wumpus from/at, but ignores the berries. [sent-214, score-0.607]
85 We concentrate on the ability of our algorithm to find a better policy than the teacher (i. [sent-215, score-0.541]
86 , learning to pick berries), while staying close enough to the teacher’s traces that it can hunt the wumpus effectively. [sent-217, score-0.311]
87 The KWIK learner starts with high UA that gradually descends (in 2a), but without traces the agent spends most of its time exploring fruitlessly (very slowly inclining slope of 2b). [sent-220, score-0.478]
88 The MBP agent learns to hunt from the teacher and quickly achieves good behavior, but rarely learns to pick berries (only gaining experience on the reward of berries if it ends up in completely unknown state and picks berries at random many times). [sent-221, score-1.042]
89 The KWIK-MBP learner starts with high expected utility and explores the structure of just the reward function, discovering berries but not the proper location combinations for killing the wumpus. [sent-222, score-0.227]
90 Once this crosses the teacher’s threshold, the teacher steps in with a number of traces showing the best way to hunt the wumpus—this is seen in plot 2d with the small bump in the proportion of trials with traces, starting at episode 2 and declining roughly linearly until episode 10. [sent-224, score-0.844]
91 The KWIK-MBP student is then able to fill in the CPTs with information from the teacher and reach an optimal policy that kills the wumpus and picks berries, avoiding both the over- and under-exploration of the KWIK and MBP agents. [sent-225, score-0.76]
92 7 5 Inferring Student Aptitude We now describe a method for a teacher to infer the student’s aptitude by using long periods without teacher interventions as observation phases. [sent-227, score-0.998]
93 This interaction protocol is an extension of Algorithm 1, but instead of using direct communication, the teacher will allow the student to run some number of trajectories m from a fixed start state and then decide whether to show a trace or not. [sent-228, score-0.808]
94 The ZTE (inspired by the zone of proximal development [11]) is the area of an MDP that an agent with background knowledge B and model learners LT and LR can act in with a polynomial number of suboptimal steps as judged only within that area. [sent-231, score-0.382]
95 Combining the ZTE, B, LT and LR induces a learning sub-problem where the agent must learn to act as well as possible without the teacher’s help. [sent-232, score-0.243]
96 A1 trials are necessary because the agent may need to explore all the ⊥ or optimistic mistakes within the ZTE, and each episode might contain only one of the A1 suboptimal steps. [sent-235, score-0.451]
97 6 Related Work and Conclusions Our teaching protocol extends early apprenticeship learning work for linear MDPs [1], which showed a polynomial number of upfront traces followed by greedy (not explicitly exploring) trajectories could achieve good behavior. [sent-243, score-0.472]
98 Our protocol is similar to a recent “practice/critique” interaction [13] where a teacher observed an agent and then labeled individual actions as “good” or “bad”, but the teacher did not provide demonstrations in that work. [sent-244, score-1.343]
99 Many such approaches give all the teacher data at the beginning, while our teaching protocol has the teacher only step in selectively, and our theoretical results ensure the teacher will only step in when its advice will have a significant effect. [sent-249, score-1.557]
100 We have shown how to use an extension of the KWIK-MB [6] (now KWIK-MBP) framework as the basis for model-based RL agents in the apprenticeship paradigm. [sent-250, score-0.228]
wordName wordTfidf (topN-words)
[('teacher', 0.489), ('mbp', 0.398), ('ua', 0.346), ('kwik', 0.287), ('va', 0.276), ('agent', 0.221), ('apprenticeship', 0.179), ('traces', 0.169), ('trace', 0.127), ('vt', 0.122), ('wumpus', 0.118), ('student', 0.101), ('episodes', 0.097), ('lr', 0.097), ('autonomous', 0.093), ('lt', 0.081), ('learnable', 0.074), ('mistakes', 0.07), ('learners', 0.068), ('exploration', 0.065), ('reward', 0.063), ('berries', 0.061), ('episode', 0.059), ('mistake', 0.059), ('protocol', 0.057), ('pessimistic', 0.057), ('communication', 0.057), ('ti', 0.053), ('policy', 0.052), ('learner', 0.051), ('autonomously', 0.049), ('zte', 0.049), ('agents', 0.049), ('reinforcement', 0.047), ('undiscounted', 0.045), ('pr', 0.043), ('demonstrations', 0.041), ('avg', 0.041), ('mdp', 0.037), ('ut', 0.037), ('polynomial', 0.034), ('teaching', 0.033), ('utility', 0.032), ('dbn', 0.031), ('rmax', 0.03), ('explore', 0.03), ('trap', 0.03), ('polynomially', 0.029), ('actions', 0.027), ('bound', 0.027), ('mdps', 0.027), ('demonstration', 0.027), ('yt', 0.027), ('strips', 0.026), ('trials', 0.026), ('domain', 0.025), ('optimistic', 0.024), ('knows', 0.024), ('walsh', 0.024), ('teachers', 0.024), ('hunt', 0.024), ('smax', 0.024), ('world', 0.023), ('lock', 0.022), ('learn', 0.022), ('exploring', 0.022), ('blocks', 0.021), ('lihong', 0.021), ('suboptimal', 0.021), ('predictions', 0.021), ('aptitude', 0.02), ('cpts', 0.02), ('killing', 0.02), ('melds', 0.02), ('predictedvalues', 0.02), ('putdown', 0.02), ('unlimited', 0.02), ('zone', 0.02), ('interaction', 0.019), ('learnability', 0.018), ('steps', 0.018), ('blending', 0.017), ('pieter', 0.017), ('littman', 0.017), ('interpretations', 0.017), ('experience', 0.017), ('areas', 0.017), ('lemma', 0.016), ('imitation', 0.016), ('disjunction', 0.016), ('criteria', 0.016), ('michael', 0.016), ('learns', 0.015), ('predictor', 0.015), ('environment', 0.015), ('state', 0.015), ('conjunctive', 0.015), ('spends', 0.015), ('literals', 0.015), ('abbeel', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
2 0.24527691 41 nips-2011-Autonomous Learning of Action Models for Planning
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
3 0.11989228 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
Author: João V. Messias, Matthijs Spaan, Pedro U. Lima
Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.
4 0.10924203 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension
Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu
Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justification for curriculum learning. 1
5 0.1063626 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
Author: Scott Niekum, Andrew G. Barto
Abstract: Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-specific subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefficient when many identified subgoals correspond to the same underlying skill, but are all used individually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share identical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data defined in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, sufficient collection of portable skills. 1
6 0.086613767 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
7 0.079538673 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
8 0.074331366 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
9 0.071110711 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
10 0.066223025 215 nips-2011-Policy Gradient Coagent Networks
11 0.065497376 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
12 0.06104489 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
13 0.058999896 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
14 0.058631591 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
15 0.055233896 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
16 0.054941464 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
17 0.054924332 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
18 0.051307116 21 nips-2011-Active Learning with a Drifting Distribution
19 0.051118445 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
20 0.048431106 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
topicId topicWeight
[(0, 0.105), (1, -0.121), (2, 0.042), (3, 0.092), (4, -0.077), (5, 0.023), (6, -0.002), (7, -0.037), (8, -0.029), (9, -0.035), (10, 0.026), (11, 0.004), (12, -0.012), (13, -0.011), (14, 0.018), (15, -0.111), (16, 0.122), (17, 0.064), (18, 0.076), (19, -0.104), (20, 0.114), (21, 0.042), (22, -0.079), (23, -0.004), (24, 0.08), (25, 0.076), (26, 0.003), (27, 0.037), (28, -0.045), (29, -0.091), (30, -0.085), (31, -0.03), (32, -0.157), (33, -0.077), (34, 0.048), (35, -0.0), (36, 0.06), (37, 0.063), (38, 0.013), (39, -0.095), (40, 0.14), (41, -0.036), (42, 0.213), (43, 0.198), (44, 0.107), (45, -0.055), (46, 0.156), (47, -0.091), (48, -0.054), (49, 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.96647221 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
2 0.81694275 41 nips-2011-Autonomous Learning of Action Models for Planning
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
3 0.64473337 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
Author: João V. Messias, Matthijs Spaan, Pedro U. Lima
Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.
4 0.55596769 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
Author: Scott Niekum, Andrew G. Barto
Abstract: Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-specific subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefficient when many identified subgoals correspond to the same underlying skill, but are all used individually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share identical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data defined in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, sufficient collection of portable skills. 1
5 0.54254764 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension
Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu
Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justification for curriculum learning. 1
6 0.42431608 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
7 0.39338046 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
8 0.35307705 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
9 0.33529019 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
10 0.32520157 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
11 0.31413403 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
12 0.31255645 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
13 0.29979041 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
14 0.29779452 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
15 0.27992237 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
16 0.27404839 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
17 0.26686588 90 nips-2011-Evaluating the inverse decision-making approach to preference learning
18 0.24937826 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
19 0.244432 21 nips-2011-Active Learning with a Drifting Distribution
20 0.24353059 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
topicId topicWeight
[(0, 0.017), (4, 0.033), (20, 0.018), (26, 0.015), (31, 0.094), (33, 0.019), (35, 0.023), (37, 0.361), (43, 0.044), (45, 0.075), (57, 0.03), (74, 0.041), (83, 0.045), (86, 0.027), (99, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.807998 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
2 0.69545352 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
3 0.46396247 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
4 0.43776485 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
Author: João V. Messias, Matthijs Spaan, Pedro U. Lima
Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.
5 0.42639068 41 nips-2011-Autonomous Learning of Action Models for Planning
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
6 0.40208393 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
7 0.39000419 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
8 0.38979772 75 nips-2011-Dynamical segmentation of single trials from population neural data
9 0.38871124 249 nips-2011-Sequence learning with hidden units in spiking neural networks
10 0.38718119 229 nips-2011-Query-Aware MCMC
11 0.38716206 180 nips-2011-Multiple Instance Filtering
12 0.38554379 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension
13 0.38526857 102 nips-2011-Generalised Coupled Tensor Factorisation
14 0.38475388 219 nips-2011-Predicting response time and error rates in visual search
15 0.38467532 66 nips-2011-Crowdclustering
16 0.38462994 301 nips-2011-Variational Gaussian Process Dynamical Systems
17 0.38399643 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
18 0.38370478 258 nips-2011-Sparse Bayesian Multi-Task Learning
19 0.3832342 241 nips-2011-Scalable Training of Mixture Models via Coresets
20 0.38315138 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning