nips nips2011 nips2011-48 knowledge-graph by maker-knowledge-mining

48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

Source: pdf

Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison

Abstract: We present theoretical and empirical results for a framework that combines the beneﬁts of apprenticeship and autonomous reinforcement learning. Our approach modiﬁes an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The ﬁrst change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efﬁciently learn on its own. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present theoretical and empirical results for a framework that combines the beneﬁts of apprenticeship and autonomous reinforcement learning. [sent-6, score-0.319]

2 Our approach modiﬁes an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. [sent-7, score-0.739]

3 The second change is introducing a communication of expected utility from the student to the teacher. [sent-9, score-0.19]

4 The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efﬁciently learn on its own. [sent-10, score-0.923]

5 1 Introduction As problem domains become more complex, human guidance becomes increasingly necessary to improve agent performance. [sent-11, score-0.221]

6 For instance, apprenticeship learning, where teachers demonstrate behaviors for agents to follow, has been used to train agents to control complicated systems such as helicopters [1]. [sent-12, score-0.301]

7 However, most work on this topic burdens the teacher with demonstrating even the simplest nuances of a task. [sent-13, score-0.489]

8 By contrast, in autonomous reinforcement learning [2] a number of domain classes can be efﬁciently learned by an actively exploring agent, although this class is provably smaller than those learnable with the help of a teacher [3]. [sent-14, score-0.75]

9 Either agents learn autonomously and eschew the larger learning capacity from teacher interaction, or the agent overburdens the teacher by not exploring simple concepts it could garner on its own. [sent-16, score-1.341]

10 We show how to build a provably efﬁcient learning system that balances teacher demonstrations and autonomous exploration. [sent-18, score-0.623]

11 Speciﬁcally, our protocol and algorithms cause a teacher to only step in when its advice will be signiﬁcantly more helpful than autonomous exploration by the agent. [sent-19, score-0.704]

12 We extend a previously proposed apprenticeship learning protocol [3] where a learning agent and teacher take turns running trajectories. [sent-20, score-0.946]

13 This version of apprenticeship learning is fundamentally different from Inverse Reinforcement Learning [4] and imitation learning [5] because our agents are allowed to enact better policies than their teachers and observe reward signals. [sent-21, score-0.331]

14 In this setting, the number of times the teacher outperforms the student was proven to be related to the learnability of the domain class in a mistake bound predictor (MBP) framework. [sent-22, score-0.734]

15 Our work modiﬁes previous apprenticeship learning efforts in two ways. [sent-23, score-0.179]

16 First, we will show that replacing the MBP framework with a different learning architecture called KWIK-MBP (based on a similar recently proposed protocol [6]) indicates areas where the agent should autonomously explore, and melds autonomous and apprenticeship learning. [sent-24, score-0.636]

17 However, this change alone is not sufﬁ1 cient to keep the teacher from intervening when an agent is capable of learning on its own. [sent-25, score-0.71]

18 Hence, we introduce a communication of the agent’s expected utility, which provides enough information for the teacher to decide whether or not to provide a trace (a property not shared by any of the previous efforts). [sent-26, score-0.673]

19 We then discuss how to relax the communication requirement when the teacher observes the student for many episodes. [sent-28, score-0.647]

20 This gives us the ﬁrst apprenticeship learning framework where a teacher only shows demonstrations when they are needed for efﬁcient learning, and gracefully blends autonomous exploration and apprenticeship learning. [sent-29, score-1.046]

21 2 Background The main focus of this paper is blending KWIK autonomous exploration strategies [7] and apprenticeship learning techniques [3], utilizing a framework for measuring mistakes and uncertainty based on KWIK-MB [6]. [sent-30, score-0.424]

22 δ Intuitively, KWIK caps the number of times the agent will admit uncertainty in its predictions. [sent-40, score-0.221]

23 Prior work [7] showed that if the transition and reward functions (T and R) of an MDP are KWIK learnable, then a PAC-MDP agent (which takes only a polynomial number of suboptimal steps with high probability) can be constructed for autonomous exploration. [sent-41, score-0.45]

24 Speciﬁcally, KWIK-learners LT and LR are built for T and R and the agent replaces any ⊥ predictions with transitions to a trap state with reward Rmax , causing the agent to explore these uncertain regions. [sent-43, score-0.601]

25 While the class of functions that is KWIK learnable includes tabular and factored MDPs, it does not cover many larger dynamics classes (such as STRIPS rules with conjunctions for pre-conditions) that are efﬁciently learnable in the apprenticeship setting. [sent-45, score-0.327]

26 2 Apprenticeship Learning with Mistake Bound Predictor We now describe an existing apprenticeship learning framework [3], which we will be modifying throughout this paper. [sent-47, score-0.179]

27 In that protocol, an agent is presented with a start state s0 and is asked to take actions according to its current policy πA , until a horizon H or a terminal state is reached. [sent-48, score-0.33]

28 After each of these episodes, a teacher is allowed to (but may choose not to) demonstrate their own policy πT starting from s0 . [sent-49, score-0.541]

29 The learning agent is able to fully observe each transition and reward received both in its own trajectories as well as those of the teacher, who may be able to provide highly informative samples. [sent-50, score-0.284]

30 For example, in an environment with n bits representing a combination lock that can only be opened with a single setting of the bits, the teacher can demonstrate the combination in a single trace, while an autonomous agent could spend 2n steps trying to open it. [sent-51, score-0.858]

31 If ||Eh∗ [xt ] − yt || > , then the agent has made a mistake. [sent-57, score-0.248]

32 δ An agent using MBP learners LT and LR for the MDP model components will be PAC-MDPTrace. [sent-59, score-0.289]

33 our combination lock), the MBP learners default to predicting “false” where the data is incomplete, leading an agent to think its action will not work in those situations. [sent-63, score-0.289]

34 Such interpretations would be catastrophic in the autonomous case (where the agent would fail to explore such areas), but are permissible in apprenticeship learning where teacher traces will provide the missing data. [sent-64, score-1.198]

35 Notice that under a criteria where the number of teacher traces is to be minimized, MBP learning may overburden the teacher. [sent-65, score-0.674]

36 For example, in a simple ﬂat MDP, an MBP-Agent picks actions that maximize utility in the part of the state space that has been exposed by the teacher, never exploring, so the number of teacher traces scales with |S||A|. [sent-66, score-0.732]

37 But a ﬂat MDP is autonomously (KWIK) learnable, so no traces should be required. [sent-67, score-0.218]

38 Ideally an agent would explore the state space where it can learn efﬁciently, and only rely on the teacher for difﬁcult to learn concepts (like conjunctions). [sent-68, score-0.799]

39 3 Teaching by Demonstration with Mixed Interpretations We now introduce a different criteria with the goal of minimizing teacher traces while not forcing the agent to explore exponentially long. [sent-69, score-0.925]

40 δ A good TI bound minimizes the teacher traces needed to achieve good behavior, but only requires the suboptimal exploration steps to be polynomially bounded, not minimized. [sent-72, score-0.818]

41 This reﬂects our judgement that teacher interactions are far more costly than autonomous agent steps, so as long as the latter are reasonably constrained, we should seek to minimize the former. [sent-73, score-0.803]

42 A PAC-MDP-Trace bound quantiﬁes (with probability 1 − δ) the worst-case number of episodes where the student performs worse than the teacher, speciﬁcally where VA (s0 ) < VT (s0 )− . [sent-77, score-0.225]

43 This would mean the domain was learnable with at most B1 teacher traces. [sent-79, score-0.588]

44 But this is a contradiction because no more traces are needed to keep the autonomous exploration steps polynomial. [sent-80, score-0.345]

45 1 The KWIK-MBP Protocol We would like to describe a supervised learning framework (like KWIK or MBP) that can quantify the number of changes made to a model through exploration and teacher demonstrations. [sent-82, score-0.554]

46 3 Algorithm 1 KWIK-MBP-Agent with Value Communication 1: The agent A knows , δ, S, A, H and planner P . [sent-88, score-0.245]

47 2: The teacher T has policy πT with expected value VT 3: Initialize KWIK-MBP learners LT and LR to ensure k value accuracy w. [sent-89, score-0.609]

48 7: A communicates its expected utility UA (s0 ) on this episode to T 8: if VT (s0 ) − k−1 > UA (s0 ) then k 9: T provides a trace τ starting from s0 . [sent-94, score-0.218]

49 This is analogous to our TI criteria (traces minimized with exploration bounded) so we now examine a KWIK-MBP learner in the apprenticeship setting. [sent-103, score-0.311]

50 2 Mixing Optimism and Pessimism Algorithm 1 (KWIK-MBP-Agent) shows an apprenticeship learning agent built over KWIK-MBP learners LT and LR . [sent-105, score-0.468]

51 This has the effect of drawing the agent to explore explicitly uncertain regions (⊥) and to either explore on its own or rely on the teacher for areas where a mistake might be made. [sent-108, score-0.846]

52 For instance, in the experiments in Figure 2 (left), discussed in depth later, a KWIK-MBP agent only requires traces for learning the pre-conditions in a noisy blocks world but uses autonomous exploration to discover the noise probabilities. [sent-109, score-0.592]

53 4 Teaching by Demonstration with Explicit Communication Thus far we have not discussed communication from the student to the teacher in KWIK-MBPAgent (line 7). [sent-110, score-0.647]

54 Suppose there was no communication in Algorithm 1 and the teacher provided a trace when πA was suboptimal. [sent-113, score-0.673]

55 This optimistic interpretation means the agent will expect success, and can learn autonomously. [sent-117, score-0.267]

56 However, the teacher will provide a trace to the agent since it sees it performing suboptimally during exploration. [sent-118, score-0.837]

57 The protocol in Algorithm 1 captures this intuition by providing a channel (line 7) 4 where the student communicates its expected utility UA . [sent-121, score-0.19]

58 The teacher then only shows a trace to a pessimistic agent (line 8), but will “stand back” and let an over-conﬁdent student learn from its own mistakes. [sent-122, score-1.017]

59 the number of traces is related only to the MBP portion of the KWIK-MBP bound, and more speciﬁcally to the number of pessimistic mistakes, deﬁned as: (2)VA VA (4)VA VA Deﬁnition 5. [sent-132, score-0.226]

60 Algorithm 1 with KWIK-MBP learners will have a TI bound that is polynomial in 1 , 1 , δ 1 and 1−γ and P , where P is the number of pessimistic mistakes (P ≤ M ) made by LT and LR . [sent-136, score-0.256]

61 That original lemma categorized the three possible outcomes of an episode in an apprenticeship learning setting where the teacher always gives a trace and with LT and LR built to learn V within 2 . [sent-139, score-0.892]

62 We need to extend the lemma to cover our change in protocol (the teacher may not step in on every episode) and in evaluation criteria (TI bound instead of PAC-MDP-Trace). [sent-144, score-0.605]

63 Speciﬁcally, we need to show: (i) The number of steps between traces where VA < VT − is polynomially bounded. [sent-145, score-0.216]

64 (ii) Only a polynomial number of traces are given, and they are all guaranteed to improve some parameter in the agent’s model with high probability. [sent-146, score-0.203]

65 (iii) Only pessimistic mistakes (Deﬁnition 5) cause a teacher intervention. [sent-147, score-0.616]

66 If UA ≥ VT − (k−1) k k then no trace is given, but the agent’s policy is near optimal so property (i) is not violated. [sent-157, score-0.179]

67 If UA < VT − (k−1) , then a trace is given, even in this exploit case, because the teacher does not k know VA and cannot distinguish this case from the “explain” case above. [sent-158, score-0.616]

68 However, this trace will still be helpful, because UT ≤ UA , so at least UT < VT − k (satisfying iii), and again by the simulation lemma, the trace will help us learn a parameter and there are a limited number of such mistakes, so (ii) holds. [sent-159, score-0.276]

69 In that case, the agent’s own experience will help it learn a parameter, but in terms of traces we have the following cases: UA ≥ VT − (k−1) and VA > UA + k . [sent-161, score-0.208]

70 No trace is given here, but this is the classical exploration k case (UA is optimistic, as in KWIK learning). [sent-164, score-0.192]

71 In either case, a trace will be k provided but UT ≤ UA so at least UT < VT − k and the trace will be helpful (satisfying property (ii)). [sent-167, score-0.254]

72 Pessimistic mistakes are causing the trace (property iii) since πT is undervalued. [sent-168, score-0.197]

73 The result also generalizes earlier apprenticeship learning results on 2 -accurate learners [3] to k -accuracy, while ensuring a more practical and stronger bound (TI instead of PAC-MDP-Trace). [sent-170, score-0.274]

74 putDown(X, To) cannot execute unless the agent is holding X and To is clear and a block). [sent-181, score-0.221]

75 This is an interesting case because the effect probabilities can be learned autonomously while the conjunctive pre-conditions (of sizes 3 and 4), require teacher input (like our combination lock example). [sent-186, score-0.575]

76 Figure 2, column 1, shows KWIK, MBP, and KWIK-MBP agents as trained by a teacher who uses unreliable actions half the time. [sent-187, score-0.565]

77 The KWIK learner never receives traces (since its expected utility, shown in 1a, is always high), but spends an exponential (in the number of literals) time exploring the potential pre-conditions of actions (1b). [sent-188, score-0.284]

78 In contrast, the MBP and KWIK-MBP agents use the ﬁrst trace to learn the pre-conditions. [sent-189, score-0.198]

79 The proportion of trials (out of 30) that the MBP and KWIK-MBP learners received teacher traces across episodes is shown in the bar graphs 1c and 1d of Fig. [sent-190, score-0.849]

80 The MBP learner continues to get traces for several episodes afterwards, using them to 6 help learn the probabilities well after the pre-conditions are learned. [sent-192, score-0.339]

81 KWIK-MBP actually learns the probabilities faster than MBP because it targets areas it does not know about rather than relying on potentially redundant teacher samples. [sent-195, score-0.521]

82 The reason for this is that sometimes the learner may be unlucky and construct an inaccurate value estimate and the teacher then steps in and provides a trace. [sent-197, score-0.558]

83 Because of this, tions U (s ), (b) average undiscounted cumulative A 0 in our KWIK-MBP implementation, we com- reward and (c and d) the proportion of trials where bined a KWIK linear regression learner for MBP and KWIK-MBP received teacher traces. [sent-210, score-0.674]

84 We constructed an “optimal hunting” teacher that ﬁnds the best combination of locations to shoot the wumpus from/at, but ignores the berries. [sent-214, score-0.607]

85 We concentrate on the ability of our algorithm to ﬁnd a better policy than the teacher (i. [sent-215, score-0.541]

86 , learning to pick berries), while staying close enough to the teacher’s traces that it can hunt the wumpus effectively. [sent-217, score-0.311]

87 The KWIK learner starts with high UA that gradually descends (in 2a), but without traces the agent spends most of its time exploring fruitlessly (very slowly inclining slope of 2b). [sent-220, score-0.478]

88 The MBP agent learns to hunt from the teacher and quickly achieves good behavior, but rarely learns to pick berries (only gaining experience on the reward of berries if it ends up in completely unknown state and picks berries at random many times). [sent-221, score-1.042]

89 The KWIK-MBP learner starts with high expected utility and explores the structure of just the reward function, discovering berries but not the proper location combinations for killing the wumpus. [sent-222, score-0.227]

90 Once this crosses the teacher’s threshold, the teacher steps in with a number of traces showing the best way to hunt the wumpus—this is seen in plot 2d with the small bump in the proportion of trials with traces, starting at episode 2 and declining roughly linearly until episode 10. [sent-224, score-0.844]

91 The KWIK-MBP student is then able to ﬁll in the CPTs with information from the teacher and reach an optimal policy that kills the wumpus and picks berries, avoiding both the over- and under-exploration of the KWIK and MBP agents. [sent-225, score-0.76]

92 7 5 Inferring Student Aptitude We now describe a method for a teacher to infer the student’s aptitude by using long periods without teacher interventions as observation phases. [sent-227, score-0.998]

93 This interaction protocol is an extension of Algorithm 1, but instead of using direct communication, the teacher will allow the student to run some number of trajectories m from a ﬁxed start state and then decide whether to show a trace or not. [sent-228, score-0.808]

94 The ZTE (inspired by the zone of proximal development [11]) is the area of an MDP that an agent with background knowledge B and model learners LT and LR can act in with a polynomial number of suboptimal steps as judged only within that area. [sent-231, score-0.382]

95 Combining the ZTE, B, LT and LR induces a learning sub-problem where the agent must learn to act as well as possible without the teacher’s help. [sent-232, score-0.243]

96 A1 trials are necessary because the agent may need to explore all the ⊥ or optimistic mistakes within the ZTE, and each episode might contain only one of the A1 suboptimal steps. [sent-235, score-0.451]

97 6 Related Work and Conclusions Our teaching protocol extends early apprenticeship learning work for linear MDPs [1], which showed a polynomial number of upfront traces followed by greedy (not explicitly exploring) trajectories could achieve good behavior. [sent-243, score-0.472]

98 Our protocol is similar to a recent “practice/critique” interaction [13] where a teacher observed an agent and then labeled individual actions as “good” or “bad”, but the teacher did not provide demonstrations in that work. [sent-244, score-1.343]

99 Many such approaches give all the teacher data at the beginning, while our teaching protocol has the teacher only step in selectively, and our theoretical results ensure the teacher will only step in when its advice will have a signiﬁcant effect. [sent-249, score-1.557]

100 We have shown how to use an extension of the KWIK-MB [6] (now KWIK-MBP) framework as the basis for model-based RL agents in the apprenticeship paradigm. [sent-250, score-0.228]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('teacher', 0.489), ('mbp', 0.398), ('ua', 0.346), ('kwik', 0.287), ('va', 0.276), ('agent', 0.221), ('apprenticeship', 0.179), ('traces', 0.169), ('trace', 0.127), ('vt', 0.122), ('wumpus', 0.118), ('student', 0.101), ('episodes', 0.097), ('lr', 0.097), ('autonomous', 0.093), ('lt', 0.081), ('learnable', 0.074), ('mistakes', 0.07), ('learners', 0.068), ('exploration', 0.065), ('reward', 0.063), ('berries', 0.061), ('episode', 0.059), ('mistake', 0.059), ('protocol', 0.057), ('pessimistic', 0.057), ('communication', 0.057), ('ti', 0.053), ('policy', 0.052), ('learner', 0.051), ('autonomously', 0.049), ('zte', 0.049), ('agents', 0.049), ('reinforcement', 0.047), ('undiscounted', 0.045), ('pr', 0.043), ('demonstrations', 0.041), ('avg', 0.041), ('mdp', 0.037), ('ut', 0.037), ('polynomial', 0.034), ('teaching', 0.033), ('utility', 0.032), ('dbn', 0.031), ('rmax', 0.03), ('explore', 0.03), ('trap', 0.03), ('polynomially', 0.029), ('actions', 0.027), ('bound', 0.027), ('mdps', 0.027), ('demonstration', 0.027), ('yt', 0.027), ('strips', 0.026), ('trials', 0.026), ('domain', 0.025), ('optimistic', 0.024), ('knows', 0.024), ('walsh', 0.024), ('teachers', 0.024), ('hunt', 0.024), ('smax', 0.024), ('world', 0.023), ('lock', 0.022), ('learn', 0.022), ('exploring', 0.022), ('blocks', 0.021), ('lihong', 0.021), ('suboptimal', 0.021), ('predictions', 0.021), ('aptitude', 0.02), ('cpts', 0.02), ('killing', 0.02), ('melds', 0.02), ('predictedvalues', 0.02), ('putdown', 0.02), ('unlimited', 0.02), ('zone', 0.02), ('interaction', 0.019), ('learnability', 0.018), ('steps', 0.018), ('blending', 0.017), ('pieter', 0.017), ('littman', 0.017), ('interpretations', 0.017), ('experience', 0.017), ('areas', 0.017), ('lemma', 0.016), ('imitation', 0.016), ('disjunction', 0.016), ('criteria', 0.016), ('michael', 0.016), ('learns', 0.015), ('predictor', 0.015), ('environment', 0.015), ('state', 0.015), ('conjunctive', 0.015), ('spends', 0.015), ('literals', 0.015), ('abbeel', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison

2 0.24527691 41 nips-2011-Autonomous Learning of Action Models for Planning

Author: Neville Mehta, Prasad Tadepalli, Alan Fern

Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A speciﬁc family of hypothesis spaces is shown to be efﬁciently learnable in both the frameworks. 1

3 0.11989228 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

Author: João V. Messias, Matthijs Spaan, Pedro U. Lima

Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufﬁcient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.

4 0.10924203 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu

Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justiﬁcation for curriculum learning. 1

5 0.1063626 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

Author: Scott Niekum, Andrew G. Barto

Abstract: Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-speciﬁc subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefﬁcient when many identiﬁed subgoals correspond to the same underlying skill, but are all used individually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share identical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data deﬁned in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, sufﬁcient collection of portable skills. 1

6 0.086613767 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

7 0.079538673 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

8 0.074331366 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

9 0.071110711 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

10 0.066223025 215 nips-2011-Policy Gradient Coagent Networks

11 0.065497376 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

12 0.06104489 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

13 0.058999896 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

14 0.058631591 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

15 0.055233896 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

16 0.054941464 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

17 0.054924332 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

18 0.051307116 21 nips-2011-Active Learning with a Drifting Distribution

19 0.051118445 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

20 0.048431106 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.105), (1, -0.121), (2, 0.042), (3, 0.092), (4, -0.077), (5, 0.023), (6, -0.002), (7, -0.037), (8, -0.029), (9, -0.035), (10, 0.026), (11, 0.004), (12, -0.012), (13, -0.011), (14, 0.018), (15, -0.111), (16, 0.122), (17, 0.064), (18, 0.076), (19, -0.104), (20, 0.114), (21, 0.042), (22, -0.079), (23, -0.004), (24, 0.08), (25, 0.076), (26, 0.003), (27, 0.037), (28, -0.045), (29, -0.091), (30, -0.085), (31, -0.03), (32, -0.157), (33, -0.077), (34, 0.048), (35, -0.0), (36, 0.06), (37, 0.063), (38, 0.013), (39, -0.095), (40, 0.14), (41, -0.036), (42, 0.213), (43, 0.198), (44, 0.107), (45, -0.055), (46, 0.156), (47, -0.091), (48, -0.054), (49, 0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96647221 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison

2 0.81694275 41 nips-2011-Autonomous Learning of Action Models for Planning

Author: Neville Mehta, Prasad Tadepalli, Alan Fern

3 0.64473337 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

Author: João V. Messias, Matthijs Spaan, Pedro U. Lima

4 0.55596769 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

Author: Scott Niekum, Andrew G. Barto

5 0.54254764 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu

6 0.42431608 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

7 0.39338046 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

8 0.35307705 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

9 0.33529019 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

10 0.32520157 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

11 0.31413403 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

12 0.31255645 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation

13 0.29979041 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

14 0.29779452 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

15 0.27992237 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

16 0.27404839 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

17 0.26686588 90 nips-2011-Evaluating the inverse decision-making approach to preference learning

18 0.24937826 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

19 0.244432 21 nips-2011-Active Learning with a Drifting Distribution

20 0.24353059 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.017), (4, 0.033), (20, 0.018), (26, 0.015), (31, 0.094), (33, 0.019), (35, 0.023), (37, 0.361), (43, 0.044), (45, 0.075), (57, 0.03), (74, 0.041), (83, 0.045), (86, 0.027), (99, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.807998 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison

2 0.69545352 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

Author: Zhan Lim, Lee Sun, Daniel J. Hsu

Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufﬁcient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1

3 0.46396247 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

Author: Joni K. Pajarinen, Jaakko Peltonen

Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) ﬁnd a policy for multiple agents. The policy in inﬁnite-horizon POMDP and DEC-POMDP problems has been represented as ﬁnite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method ﬁnds a deterministic ﬁnite-horizon policy and converts it to an initial periodic inﬁnitehorizon policy. This policy is optimized by a new inﬁnite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.

4 0.43776485 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

Author: João V. Messias, Matthijs Spaan, Pedro U. Lima

5 0.42639068 41 nips-2011-Autonomous Learning of Action Models for Planning

Author: Neville Mehta, Prasad Tadepalli, Alan Fern

6 0.40208393 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

7 0.39000419 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

8 0.38979772 75 nips-2011-Dynamical segmentation of single trials from population neural data

9 0.38871124 249 nips-2011-Sequence learning with hidden units in spiking neural networks

10 0.38718119 229 nips-2011-Query-Aware MCMC

11 0.38716206 180 nips-2011-Multiple Instance Filtering

12 0.38554379 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

13 0.38526857 102 nips-2011-Generalised Coupled Tensor Factorisation