nips nips2011 nips2011-41 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. [sent-5, score-1.288]
2 In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. [sent-6, score-1.309]
3 A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. [sent-8, score-0.397]
4 Moreover, model learning, planning, and plan execution must be interleaved, because agents need to plan long before perfect models are learned. [sent-11, score-0.536]
5 This paper formulates and analyzes the learning of deterministic action models used in planning for goal achievement. [sent-12, score-0.527]
6 It has been shown that deterministic STRIPS actions with a constant number of preconditions can be learned from raw experience with at most a polynomial number of plan prediction mistakes [8]. [sent-13, score-0.533]
7 In spite of this positive result, compact action models in fully observable, deterministic action models are not always efficiently learnable. [sent-14, score-0.368]
8 For example, action models represented as arbitrary Boolean functions are not efficiently learnable under standard cryptographic assumptions such as the hardness of factoring [2]. [sent-15, score-0.382]
9 Learning action models for planning is different from learning an arbitrary function from states and actions to next states, because one can ignore modeling the effects of some actions in certain contexts. [sent-16, score-0.58]
10 To capture this intuition, we introduce the concept of an adequate model, that is, a model that is sound and sufficiently complete for planning for a given class of goals. [sent-18, score-0.709]
11 We define two distinct frameworks for learning adequate models for planning and then characterize sufficient conditions for success in these frameworks. [sent-20, score-0.571]
12 In the mistake-bounded planning (MBP) framework, the goal is to continually solve user-generated planning problems while learning action models and guarantee at most a polynomial number of faulty plans or mistakes. [sent-21, score-1.101]
13 We assume that in addition to the problem generator, the learner has access to a sound and complete planner and a simulator (or the real world). [sent-22, score-0.747]
14 We also introduce a more demanding planned exploration (PLEX) framework, where the learner needs to generate its own problems to refine its action model. [sent-23, score-0.604]
15 This requirement translates to an experiment-design problem, where the learner needs to design problems in a goal language to refine the action models. [sent-24, score-0.632]
16 1 The MBP and PLEX frameworks can be reduced to over-general query learning, concept learning with strictly one-sided error, where the learner is only allowed to make false positive mistakes [7]. [sent-25, score-0.595]
17 We view an action model as a set of state-action-state transitions and ensure that the learner always maintains a hypothesis which includes all transitions in some adequate model. [sent-28, score-0.93]
18 Thus, a sound plan is always in the learner’s search space, while it may not always be generated. [sent-29, score-0.384]
19 As the learner gains more experience in generating plans, executing them on the simulator, and receiving observations, the hypothesis is incrementally refined until an adequate model is discovered. [sent-30, score-0.686]
20 To ground our analysis, we show that a general family of hypothesis spaces is learnable in polynomial time in the two frameworks given appropriate goal languages. [sent-31, score-0.677]
21 Given a set of negative examples Z consistent with a target hypothesis h, the version space of action models is the subset of all hypotheses in H that are consistent with Z and is denoted as M(Z). [sent-45, score-0.514]
22 H is well-structured if, for any negative example set Z which has a consistent target hypothesis in H, the version space M(Z) contains a most general hypothesis mgh(Z). [sent-48, score-0.416]
23 The learner outputs a query in the form of a hypothesis h ∈ H, where h must be at least as general as c. [sent-66, score-0.625]
24 A hypothesis space is OGQ-learnable if there exists a learning algorithm for the OGQ framework that identifies the target c with the number of queries and total running time that is polynomial in the size of c and the size of the largest counterexample. [sent-71, score-0.509]
25 H is learnable in the OGQ framework if and only if H is efficiently well-structured and its height is a polynomial function. [sent-73, score-0.436]
26 (If) If H is efficiently well-structured, then the OGQ learner can always output the mgh, guaranteed to be more general than the target concept, in polynomial time. [sent-75, score-0.529]
27 Because the maximum number of hypothesis refinements is bounded by the polynomial height of H, it is learnable in the OGQ framework. [sent-76, score-0.564]
28 At some point, these counterexamples will force the learner to choose between h1 or h2 , because their union is not in the hypothesis space. [sent-79, score-0.566]
29 Once the learner makes its choice, the teacher can choose the other hypothesis as its target concept c, resulting in the learner’s hypothesis not being more general than c. [sent-80, score-0.896]
30 If the teacher picks mgh(Z ∪{z}) as the target concept and only provides counterexamples from Z ∪ {z}, then the learner cannot have polynomial running time. [sent-82, score-0.78]
31 Finally, the teacher can always provide counterexamples that forces the learner to take the longest path in H’s generalization graph. [sent-83, score-0.553]
32 An inclusion mistake is made when the learner predicts x ∈ c although x ∈ c; an exclusion mistake / is made when the learner predicts x ∈ c although x ∈ c. [sent-87, score-0.995]
33 The teacher presents the true label to the / learner if a mistake is made, and then presents the next instance to the learner, and so on. [sent-88, score-0.571]
34 In the following analysis, we let the name of a framework denote the set of hypothesis spaces learnable in that framework. [sent-92, score-0.417]
35 We can construct an OGMB learner from the OGQ learner as follows. [sent-96, score-0.616]
36 When the OGQ learner makes a query h, we use h to make predictions for the OGMB learner. [sent-97, score-0.391]
37 Because the OGQ learner makes only a polynomial number of queries and takes polynomial time for query generation, the simulated OGMB learner makes only a polynomial number of mistakes and runs in at most polynomial time per instance. [sent-101, score-1.528]
38 The converse does not hold in general because the queries of the OGQ learner are restricted to be “proper”, that is, they must belong to the given hypothesis space. [sent-102, score-0.545]
39 While the OGMB learner can maintain the version space of all consistent hypotheses of a polynomially-sized hypothesis space, the OGQ learner can only query with a single hypothesis and there may not be any hypothesis that is guaranteed to be more general than the target concept. [sent-103, score-1.35]
40 If the learner is allowed to ask queries outside H, such as queries of the form h1 ∪ · · · ∪ hn for all hi in the version space, then over-general learning is possible. [sent-104, score-0.386]
41 In general, if the learner is allowed to ask about any polynomially-sized, polynomial-time computable hypothesis, then it is as powerful as OGMB, because it can encode the computation of the OGMB learner inside a polynomial-sized circuit and query with that as the hypothesis. [sent-105, score-0.699]
42 The Knows-What-It-Knows (KWIK) learning framework [4] is similar to the OGMB framework with one key difference: it does not allow the learner to make any prediction when it does not know the correct answer. [sent-109, score-0.4]
43 The set of hypothesis spaces learnable in the mistake-bound (MB) framework is a strict subset of that learnable in the probably-approximately-correct (PAC) framework [5], leading to the following result. [sent-112, score-0.635]
44 OGMB MB: Every hypothesis space that is OGMB-learnable is MB-learnable because the OGMB learner is additionally constrained to not make an exclusion mistake. [sent-116, score-0.549]
45 A single exclusion mistake is sufficient for an MB learner to learn this hypothesis space. [sent-119, score-0.685]
46 As there is exactly one positive example, this could force the OGMB learner to make an exponential number of mistakes (similar to guessing an unknown password). [sent-121, score-0.378]
47 KWIK OGMB: If a concept class is KWIK-learnable, it is also OGMB-learnable — when the KWIK learner does not know the true label, the OGMB learner simply predicts that the instance is positive and gets corrected if it is wrong. [sent-122, score-0.688]
48 A single inclusion mistake is sufficient for the OGMB learner to learn this hypothesis space. [sent-126, score-0.658]
49 On the other hand, the teacher can supply the KWIK learner with an exponential number of positive examples, because the KWIK learner cannot ever know that the target does not include all possible instances; this implies that the number of abstentions is not polynomially bounded. [sent-127, score-0.836]
50 S = Dn represents the state space and T ⊂ S × A × S is the transition relation, where (s, a, s ) ∈ T signifies that taking action a in state s results in state s . [sent-134, score-0.404]
51 As we only consider learning deterministic action models, the transition relation is in fact a function, although the learner’s hypothesis space may include nondeterministic models. [sent-135, score-0.48]
52 A planning problem is a pair (s0 , g), where s0 ∈ S and the goal condition g is an expression chosen from a goal language G and represents a set of states in which it evaluates to true. [sent-140, score-0.491]
53 Given a planning problem (s0 , g), a plan is a sequence of states and actions s0 , a1 , . [sent-142, score-0.625]
54 The plan is sound with respect to (M, g) if (si−1 , ai , si ) ∈ M for 1 ≤ i ≤ p. [sent-146, score-0.384]
55 A planner for the hypothesis space and goal language pair (H, G) is an algorithm that takes M ∈ H and (s0 , g ∈ G) as inputs and outputs a plan or signals failure. [sent-149, score-0.775]
56 It is sound with respect to (H, G) if, given any M and (s0 , g), it produces a sound plan with respect to (M, g) or signals failure. [sent-150, score-0.544]
57 It is complete with respect to (H, G) if, given any M and (s0 , g), it produces a sound plan whenever one exists with respect to (M, g). [sent-151, score-0.413]
58 We generalize the definition of soundness from its standard usage in the literature in order to apply to nondeterministic action models, where the nondeterminism is “angelic” — the planner can control the outcome of actions when multiple outcomes are possible according to its model [6]. [sent-152, score-0.429]
59 One way to implement such a planner is to do forward search through all possible action and outcome sequences and return an action sequence if it leads to a goal under some outcome choices. [sent-153, score-0.572]
60 Our analysis is agnostic to plan quality or plan length and applies equally well to suboptimal planners. [sent-154, score-0.512]
61 This is motivated by the fact that optimal planning is hard for most domains, but suboptimal planning such as hierarchical planning can be quite efficient and practical. [sent-155, score-0.915]
62 A planning mistake occurs if either the planner signals failure when a sound plan exists with respect to the transition function T or when the plan output by the planner is not sound with respect to T . [sent-158, score-1.696]
63 Let P be a planning domain and G be a goal language. [sent-161, score-0.378]
64 An action model M is adequate for G in P if M ⊆ T and the existence of a sound plan with respect to (T, g ∈ G) implies the existence of a sound plan with respect to (M, g). [sent-162, score-1.156]
65 H is adequate for G if ∃M ∈ H such that M is adequate for G. [sent-163, score-0.408]
66 However, the model is sufficient to produce a sound plan with respect to (T, g) for every goal g in the desired language. [sent-165, score-0.422]
67 Given a goal language G, a problem generator generates an arbitrary problem (s0 , g ∈ G) and sets the state of the simulator to s0 . [sent-172, score-0.401]
68 It actualizes the teacher of the OGQ framework by combining a problem generator, a planner, and a simulator, and interfaces with the OGQ learner to learn action models as hypotheses over the space of possible state transitions for each action. [sent-174, score-0.806]
69 Further, A must respond in time polynomial in the domain parameters and the length of the longest plan generated by the planner, assuming that a call to the planner, simulator, or problem generator takes O(1) time. [sent-180, score-0.656]
70 The goal language is picked such that the hypothesis space is adequate for it. [sent-181, score-0.489]
71 P ROBLEM G ENERATOR pro- 1: M ← OGQ-L EARNER() // Initial query vides a planning problem and initializes the cur- 2: loop 3: (s, g) ← P ROBLEM G ENERATOR(G) rent state of S IMULATOR. [sent-188, score-0.438]
72 Otherwise, the plan is executed through 10: print mistake S IMULATOR until an observed transition conbreak flicts with the predicted transition. [sent-191, score-0.522]
73 If such a tran- 11: 12: s←s sition is found, it is supplied to OGQ-L EARNER if no mistake then and M is updated with the next query; other- 13: print plan wise, the plan is output. [sent-192, score-0.702]
74 As the number of planning mistakes is 5 polynomial and every response of Algorithm 1 is polynomial in the runtime of OGQ-L EARNER and the length of the longest plan, H is learnable in the MBP framework. [sent-194, score-0.973]
75 ) It also clarifies the notion of an adequate model, which can be much simpler than the true transition model, and the influence of the goal language on the complexity of learning action models. [sent-197, score-0.569]
76 In the planned exploration (PLEX) framework, the agent seeks to learn an action model for the domain without an external problem generator by generating planning problems for itself. [sent-199, score-0.763]
77 The key issue here is to generate a reasonably small number of planning problems such that solving them would identify a deterministic action model. [sent-200, score-0.489]
78 Learning a model in the PLEX framework involves knowing where it is deficient and then planning to reach states that are informative, which entails formulating planning problems in a goal language. [sent-201, score-0.731]
79 This framework provides a polynomial sample convergence guarantee which is stronger than a polynomial mistake bound of the MBP framework. [sent-202, score-0.542]
80 Without a problem generator that can change the simulator’s state, it is impossible for the simulator to transition freely between strongly connected components (SCCs) of the transition graph. [sent-203, score-0.38]
81 Let P be a planning domain whose transition graph is a union of SCCs. [sent-207, score-0.442]
82 Further, A must run in polynomial time in the domain parameters and the length of the longest plan output by the planner, assuming that every call to the planner and the simulator takes O(1) time. [sent-209, score-0.872]
83 A key step in planned exploration is designing appropriate planning problems. [sent-210, score-0.417]
84 Given the goal language G and a model M , the problem of experiment design is to find a set of goals G ⊆ G such that the sets of states that satisfy the goals in G collectively cover all informative states I(M ). [sent-219, score-0.471]
85 If none of the goals in G can be successfully planned for, then no informative states for that action are reachable. [sent-221, score-0.424]
86 (H, G) is learnable in the PLEX framework if it permits efficient experiment design, and H is adequate for G and is OGQ-learnable. [sent-231, score-0.521]
87 If P LAN 9: break NER signals failure on all of the goals, 10: if plan = false then then none of the informative states are 11: return M reachable and M cannot be refined fur- 12: for (ˆ, a, s ) in plan do s ˆ ther. [sent-237, score-0.675]
88 If H is OGQ-learnable, then OGQ-L EARNER will only be called a polynomial number of times and it can output a new query in polynomial time. [sent-243, score-0.443]
89 As the number of failures per successful plan is bounded by a polynomial in the width w of (H, G), the total number of calls to P LANNER is polynomial. [sent-244, score-0.436]
90 Power(U) is efficiently well-structured, because it is closed under union by definition and the new mgh can be computed by removing any basis hypotheses that are not consistent with the counterexample; this takes polynomial time as U is of polynomial size. [sent-257, score-0.642]
91 Any informative state for a hypothesis in Power(U) is an informative state for some hypothesis in Pairs(U), and vice versa. [sent-270, score-0.588]
92 For any goal language G, (Power(U), G) is learnable in the PLEX framework if (Pairs(U), G) permits efficient experiment design and Power(U) is adequate for G. [sent-275, score-0.661]
93 Let an action production r be defined as “act : pre → post”, where act(r) is an action and the precondition pre(r) and postcondition post(r) are conjunctions of “variable = value” literals. [sent-281, score-0.525]
94 Let k-SAP be the hypothesis space of models represented by a set of action productions (SAP) with no more than k variables per production. [sent-290, score-0.384]
95 As U is of polynomial size, k-SAP is an instance of the family of basis action models. [sent-292, score-0.39]
96 (k-SAP, Conj) is learnable in the PLEX framework if k-SAP is adequate for Conj. [sent-296, score-0.422]
97 It also provides results on learning a family of hypothesis spaces that is, in some ways, more general than standard action modeling languages. [sent-298, score-0.409]
98 While STRIPS-like languages served us well in planning research by creating a common useful platform, they are not designed from the point of view of learnability or planning efficiency. [sent-300, score-0.61]
99 This suggests an approach in which the learner considers increasingly complex models as dictated by its planning needs. [sent-302, score-0.613]
100 For example, the model learner might start with small values of k in k-SAP and then incrementally increase k until a value is found that is adequate for the goals encountered. [sent-303, score-0.567]
wordName wordTfidf (topN-words)
[('learner', 0.308), ('planning', 0.305), ('ogmb', 0.265), ('ogq', 0.265), ('plan', 0.256), ('adequate', 0.204), ('action', 0.184), ('polynomial', 0.18), ('hypothesis', 0.174), ('learnable', 0.172), ('mbp', 0.168), ('planner', 0.166), ('earner', 0.162), ('mgh', 0.162), ('plex', 0.155), ('simulator', 0.145), ('mistake', 0.136), ('sound', 0.128), ('teacher', 0.127), ('lanner', 0.103), ('generator', 0.095), ('kwik', 0.095), ('query', 0.083), ('planned', 0.078), ('permits', 0.075), ('tates', 0.074), ('language', 0.073), ('concept', 0.072), ('mistakes', 0.07), ('transition', 0.07), ('informative', 0.07), ('production', 0.068), ('exclusion', 0.067), ('longest', 0.066), ('plans', 0.065), ('strips', 0.065), ('frameworks', 0.062), ('hypotheses', 0.061), ('esign', 0.059), ('imulator', 0.059), ('nfoaction', 0.059), ('goals', 0.055), ('cover', 0.053), ('counterexamples', 0.052), ('nondeterministic', 0.052), ('state', 0.05), ('post', 0.049), ('framework', 0.046), ('triggered', 0.045), ('mb', 0.045), ('conj', 0.044), ('xperiment', 0.044), ('boolean', 0.042), ('target', 0.041), ('inclusion', 0.04), ('queries', 0.039), ('propositional', 0.039), ('height', 0.038), ('goal', 0.038), ('states', 0.037), ('power', 0.037), ('nition', 0.036), ('outputs', 0.036), ('domain', 0.035), ('exploration', 0.034), ('executed', 0.034), ('pre', 0.034), ('literals', 0.033), ('union', 0.032), ('signals', 0.032), ('agent', 0.032), ('transitions', 0.03), ('angelic', 0.029), ('chema', 0.029), ('enerator', 0.029), ('roblem', 0.029), ('scc', 0.029), ('sccs', 0.029), ('design', 0.029), ('conjunctions', 0.029), ('autonomous', 0.029), ('exists', 0.029), ('supplied', 0.028), ('consistent', 0.027), ('actions', 0.027), ('ever', 0.026), ('family', 0.026), ('abstentions', 0.026), ('cryptographic', 0.026), ('earning', 0.026), ('precondition', 0.026), ('print', 0.026), ('productions', 0.026), ('superset', 0.026), ('spaces', 0.025), ('must', 0.024), ('failure', 0.024), ('experiment', 0.024), ('faulty', 0.024), ('icts', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 41 nips-2011-Autonomous Learning of Action Models for Planning
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
2 0.24527691 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
3 0.1186147 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
4 0.11164048 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
5 0.090060495 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
6 0.088709377 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
7 0.088371493 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
8 0.073285244 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
9 0.066729002 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
10 0.066558048 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
11 0.064089604 229 nips-2011-Query-Aware MCMC
12 0.06325344 87 nips-2011-Energetically Optimal Action Potentials
13 0.059561748 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension
14 0.056609925 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
15 0.051314093 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
16 0.050779667 154 nips-2011-Learning person-object interactions for action recognition in still images
17 0.048808444 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
18 0.048382208 21 nips-2011-Active Learning with a Drifting Distribution
19 0.048330788 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
20 0.047416843 202 nips-2011-On the Universality of Online Mirror Descent
topicId topicWeight
[(0, 0.128), (1, -0.1), (2, 0.009), (3, 0.074), (4, -0.023), (5, 0.03), (6, -0.039), (7, -0.084), (8, -0.097), (9, -0.047), (10, -0.003), (11, 0.005), (12, -0.035), (13, -0.012), (14, 0.084), (15, -0.112), (16, 0.155), (17, 0.007), (18, -0.005), (19, -0.093), (20, 0.088), (21, 0.056), (22, -0.073), (23, 0.023), (24, 0.085), (25, 0.087), (26, 0.025), (27, 0.088), (28, -0.036), (29, -0.125), (30, -0.036), (31, -0.11), (32, -0.205), (33, -0.072), (34, 0.025), (35, -0.008), (36, 0.123), (37, 0.056), (38, -0.012), (39, -0.09), (40, 0.071), (41, 0.038), (42, 0.167), (43, 0.141), (44, 0.022), (45, -0.095), (46, 0.1), (47, -0.13), (48, -0.009), (49, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9611091 41 nips-2011-Autonomous Learning of Action Models for Planning
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
2 0.86690009 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
3 0.67556679 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
Author: João V. Messias, Matthijs Spaan, Pedro U. Lima
Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.
4 0.56485516 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension
Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu
Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justification for curriculum learning. 1
5 0.50988966 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
Author: Scott Niekum, Andrew G. Barto
Abstract: Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-specific subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefficient when many identified subgoals correspond to the same underlying skill, but are all used individually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share identical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data defined in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, sufficient collection of portable skills. 1
6 0.43155512 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
7 0.40313029 153 nips-2011-Learning large-margin halfspaces with more malicious noise
8 0.39957964 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
9 0.39647752 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
10 0.36636633 29 nips-2011-Algorithms and hardness results for parallel large margin learning
11 0.34477448 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
12 0.32100144 256 nips-2011-Solving Decision Problems with Limited Information
13 0.31350511 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
14 0.29917189 87 nips-2011-Energetically Optimal Action Potentials
15 0.29416791 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
16 0.28046104 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
17 0.27234912 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
18 0.27085882 21 nips-2011-Active Learning with a Drifting Distribution
19 0.26598632 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
20 0.26255715 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
topicId topicWeight
[(0, 0.025), (4, 0.045), (20, 0.021), (26, 0.035), (31, 0.095), (33, 0.019), (37, 0.048), (43, 0.048), (45, 0.085), (57, 0.021), (74, 0.039), (83, 0.021), (86, 0.371), (99, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.78671306 41 nips-2011-Autonomous Learning of Action Models for Planning
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
2 0.6055572 72 nips-2011-Distributed Delayed Stochastic Optimization
Author: Alekh Agarwal, John C. Duchi
Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1
3 0.56600535 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
4 0.43473506 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
5 0.41073918 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
6 0.39323086 229 nips-2011-Query-Aware MCMC
7 0.39028972 180 nips-2011-Multiple Instance Filtering
8 0.38890421 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
9 0.38747507 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
10 0.38726428 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
11 0.38721409 241 nips-2011-Scalable Training of Mixture Models via Coresets
12 0.38631609 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
13 0.38559645 231 nips-2011-Randomized Algorithms for Comparison-based Search
14 0.3855432 197 nips-2011-On Tracking The Partition Function
15 0.38515982 258 nips-2011-Sparse Bayesian Multi-Task Learning
16 0.38487563 66 nips-2011-Crowdclustering
17 0.38439685 283 nips-2011-The Fixed Points of Off-Policy TD
18 0.38408443 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
19 0.38357383 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
20 0.38343951 75 nips-2011-Dynamical segmentation of single trials from population neural data