nips nips2009 nips2009-218 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: George Konidaris, Andre S. Barreto
Abstract: We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates appropriate skills and achieves performance benefits in a challenging continuous domain. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. [sent-5, score-0.904]
2 1 Introduction Much recent research in reinforcement learning (RL) has focused on hierarchical RL methods [1] and in particular the options framework [2], which adds to the RL framework principled methods for planning and learning using high level-skills (called options). [sent-7, score-0.468]
3 An important research goal is the development of methods by which an agent can discover useful new skills autonomously, and thereby construct its own high-level skill hierarchies. [sent-8, score-0.857]
4 Although several methods exist for creating new options in discrete domains, none are immediately extensible to, or have been successfully applied in, continuous domains. [sent-9, score-0.473]
5 We introduce skill chaining, a skill discovery method for agents in continuous domains. [sent-10, score-1.161]
6 Skill chaining produces chains of skills, each leading to one of a list of designated target events, where the list can simply contain the end-of-episode event or more sophisticated heuristic events (e. [sent-11, score-0.476]
7 The goal of each skill in the chain is to enable the agent to reach a state where its successor skill can be successfully executed. [sent-14, score-1.371]
8 We demonstrate experimentally that skill chaining creates appropriate skills and achieves performance improvements in the Pinball domain. [sent-15, score-0.856]
9 The options framework adds methods for planning and learning using options as temporally-extended actions to the standard RL framework based on the Markov decision process (MDP) framework [4]. [sent-17, score-0.728]
10 Options can be added to an agent’s action repertoire alongside its primitive actions, and the agent chooses when to execute them in the same way it chooses when to execute primitive actions. [sent-18, score-0.479]
11 Methods for creating new options must determine when to create an option and how to define its termination condition (skill discovery), how to expand its initiation set, and how to learn its policy. [sent-19, score-1.291]
12 Given an option reward function, policy learning can be viewed as just another RL problem. [sent-20, score-0.61]
13 Creation and termination are typically performed by the identification of option goal states, with an option created to reach one of its goal states and terminate when it does so. [sent-21, score-1.182]
14 The initiation set is then the set 1 of states from which a goal state can be reached. [sent-22, score-0.473]
15 In previous research, option goal states have been selected by a variety of methods, the most common relying on computing visit or reward statistics over individual states to identify useful subgoals [5, 6, 7, 8]. [sent-23, score-0.707]
16 In domains with factored state spaces, the agent may create options to change infrequently changing variables [12, 13]. [sent-27, score-0.781]
17 Finally, some methods extract options by exploiting commonalities in collections of policies over a single state space [14, 15, 16, 17]. [sent-28, score-0.457]
18 These properties are unlikely to easily generalize to continuous spaces, where an agent may never see the same state twice. [sent-30, score-0.341]
19 We know of very little work on skill acquisition in continuous domains where the skills or action hierarchy are not designed in advance. [sent-31, score-0.782]
20 3 Skill Discovery in Continuous Domains In discrete domains, the primary reason for creating an option to reach a goal state is to make that state prominent in learning: a state that may once have been difficult to reach can now be reached using a single decision (to invoke the option). [sent-43, score-0.909]
21 This effectively modifies the connectivity of the MDP by connecting the option’s goal states to every state in its initiation set. [sent-44, score-0.473]
22 Another reason for creating options is transfer: if options are learned in an appropriate space they can be used in later tasks to speed up learning. [sent-45, score-0.792]
23 If the agent faces a sequence of tasks having the same state space, then options learned in it are portable [14, 15, 16, 17]; if it faces a sequence of tasks having different but related state spaces, then the options must be learned using features common to all the tasks [22]. [sent-46, score-1.175]
24 Creating new options that each have their own function approximator concentrated on a subset of the state space may result in better overall policies by freeing the primary value function from having to simultaneously represent the complexities of the individual option value functions. [sent-49, score-0.967]
25 Thus, skill discovery offers an additional representational benefit in continuous domains. [sent-50, score-0.573]
26 Most existing skill discovery methods identify a single state as an option target. [sent-53, score-1.031]
27 In continuous domains, where the agent may never see the same state twice, this must be generalized to a target region. [sent-54, score-0.422]
28 For example, many of the above methods generate target regions that are difficult to reach—a too-small neighborhood may make the target nearly impossible to reach; conversely, a too-large neighborhood may include regions that are not difficult to reach at all. [sent-56, score-0.34]
29 While in discrete domains it is common for an option’s initiation set to expand arbitrarily as the agent learns a policy for successfully executing the option, this is not desirable in continuous domains. [sent-59, score-0.839]
30 In discrete domains without function approximation a policy to reach a subgoal 2 can always be represented exactly; in continuous domains (or even discrete domains with function approximation), it may only be possible to represent such a policy locally. [sent-60, score-0.585]
31 An option policy in both discrete and continuous domains should be able to consistently solve a simpler problem than the overall task using a simpler policy. [sent-63, score-0.729]
32 Thus, in a discrete domain it is perfectly feasible to create a new value table for each learned option of the same dimension as the task value table. [sent-65, score-0.612]
33 Therefore, “lightweight” options that use fewer features than needed to solve the overall problem are desirable in high-dimensional domains, or when we may wish to create many skills. [sent-67, score-0.415]
34 In the following section we develop an algorithm for skill discovery in continuous domains by addressing these challenges. [sent-73, score-0.646]
35 4 Skill Chaining Since a useful option lies on a solution path, it seems natural to first create an option to reach the task’s goal. [sent-74, score-1.07]
36 The high likelihood that the option can only do so from a local neighborhood about this region suggests a follow-on step: create an option to reach the states where the first option can be successfully executed. [sent-75, score-1.574]
37 This section describes skill chaining, a method that formalizes this intuition to create chains of options to reach a given target event by repeatedly creating options to reach options created earlier in the chain. [sent-76, score-1.989]
38 First, we describe how to create an option given a target event. [sent-77, score-0.588]
39 1 Creating an Option to Trigger a Target Event Given an episodic task defined over state space S with reward function R, assume we are given a goal trigger function T defined over S that evaluates to 1 on states in the goal event and 0 otherwise. [sent-79, score-0.406]
40 , to reach a state on which T evaluates to 1, we must define oT ’s termination condition, reward function, and initiation set. [sent-82, score-0.6]
41 We set oT ’s reward function to R plus an option completion reward for triggering T . [sent-84, score-0.645]
42 Obtaining oT ’s initiation set is more difficult because it should consist of the states from which executing oT succeeds in triggering T . [sent-86, score-0.474]
43 We can treat this as a standard classification problem, using as positive training examples states in which oT has been executed and triggered T , and as negative training examples states in which it has been executed and failed to trigger T . [sent-87, score-0.34]
44 A classifier suited to a potentially non-stationary classification problem with continuous features can be used to learn oT ’s initiation set. [sent-88, score-0.386]
45 2 Creating Skill Chains Given an initial target event with trigger function T0 , which for the purposes of this discussion we consider to be the indicator function of the goal region of task, the agent creates a chain of skills as follows. [sent-90, score-0.732]
46 First, the agent creates option oT0 to trigger T0 , learns a good policy for this option, and ˆ ˆ obtains a good estimate, IT0 , of its initiation set. [sent-91, score-1.256]
47 We then add event T1 = IT0 to the list of target ˆT , it creates a new option oT whose goal is to trigger events, so that when the agent first enters I 0 1 ˆ T1 . [sent-92, score-1.039]
48 That is, the new option’s termination function is set to the indicator function IT0 , and its reward 3 function becomes the task’s reward function plus an option completion reward for triggering T1 . [sent-93, score-0.764]
49 Repeating this procedure results in a chain of skills leading from any state in which the agent may start to the task’s goal region as depicted in Figure 1. [sent-94, score-0.487]
50 (a) (b) (c) Figure 1: An agent creates options using skill chaining. [sent-98, score-1.114]
51 (a) First, the agent encounters a target event and creates an option to reach it. [sent-99, score-1.023]
52 (b) Entering the initiation set of this first option triggers the creation of a second option whose target is the initiation set of the first option. [sent-100, score-1.707]
53 (c) Finally, after many trajectories the agent has created a chain of options to reach the original target. [sent-101, score-0.77]
54 Note that although the options lie on a chain, the decision to execute each option is part of the agent’s overall learning problem. [sent-102, score-0.852]
55 Thus, they may not necessarily be executed sequentially; in particular, if an agent has learned a better policy for some parts of the chain, it may learn to skip some options. [sent-103, score-0.425]
56 More than one option may be created to reach a target event if that event remains on the target event list after the first option is created to reach it. [sent-106, score-1.549]
57 Each “child” option then creates its own chain, resulting in a skill tree, depicted in Figure 2. [sent-107, score-0.99]
58 (a) (b) (c) Figure 2: (a) A skill chaining agent in an environment with multiple start states and two initial target events. [sent-114, score-1.068]
59 (b) When the agent initially encounters target events it creates options to trigger them. [sent-115, score-0.877]
60 (c) The initiation sets of these options then become target events, later triggering the creation of new options so that the agent eventually creates a skill tree covering all solution trajectories. [sent-116, score-1.991]
61 To control the branching factor of this tree, we need to place three further conditions on option creation. [sent-117, score-0.502]
62 First, we do not create a new option when a target event is triggered from a state already in the initiation set of an option targeting that event. [sent-118, score-1.554]
63 Second, we require that the initiation set of an option does not overlap that of its siblings or parents. [sent-119, score-0.789]
64 (Note that although these conditions seem computationally expensive, they can be implemented using at most one execution of each initiation set classifier per visited state—which is required for action selection anyway). [sent-120, score-0.383]
65 Finally, we may find it necessary to set a limit on the branching factor of the tree by removing a target event once it has some number of options targeting it. [sent-121, score-0.56]
66 , physically meaningful events for a robot), or more general skill discovery techniques that can identify regions of interest before the goal is reached. [sent-126, score-0.637]
67 Collisions with obstacles are fully elastic and cause ˙ ˙ the ball to bounce, so rather than merely avoiding obstacles the agent may choose to use them to efficiently reach the hole. [sent-131, score-0.366]
68 Off-policy updates to an option for states outside its initiation set were ignored (because its policy does not need to be defined in those states), as were updates from unsuccessful on-policy trajectories (because their start states were then removed from the initiation set). [sent-144, score-1.39]
69 To initialize the option’s policy before attempting to learn its initiation set, a newly created option was first allowed a “gestation period” of 10 episodes where it could not be executed and its policy was updated using only off-policy learning. [sent-145, score-1.107]
70 After its gestation period, the option was added to the agent’s action repertoire. [sent-146, score-0.55]
71 For new option o, this requires expanding the overall action-value function Q to include o and assigning appropriate initial values to Q(s, o). [sent-147, score-0.49]
72 Each option’s initiation set was learned by a logistic regression classifier, initialized to be true everywhere, using 2nd order polynomial features, learning rate η = 0. [sent-153, score-0.37]
73 When the agent executed the option, states on trajectories that reached its goal within 250 steps were used as positive examples, and the start states of trajectories that did not were used as negative examples. [sent-155, score-0.562]
74 We considered an option’s initiation set learned well enough to be added to the list of target events when its weights changed on average less than 0. [sent-156, score-0.53]
75 Since the Pinball domain has such strong discontinuities, to avoid over-generalization after this learning period we additionally constrained the initiation set to contain only points within a Euclidean distance of 0. [sent-158, score-0.362]
76 (b) A good example solution to the first Pinball instance, showing the acquired options executed along the sample trajectory in different colors. [sent-163, score-0.483]
77 Figure 4(a) shows that the skill chaining agents performed significantly better than flat agents by 50 episodes, and went on to obtain consistently good solutions by 250 episodes, whereas the flat agents did much worse and were less consistent. [sent-165, score-1.023]
78 Agents that started with given options did very well initially—with an initial episode return far greater than the average solution eventually learned by agents without options—and proceeded quickly to the same quality of solution as the agents that discovered their options. [sent-166, score-0.69]
79 This shows that the options themselves, and not the process of acquiring them, were responsible for the increase in performance. [sent-167, score-0.341]
80 Figure 4(b) shows a sample solution trajectory from an agent performing skill chaining in the first Pinball instance, with the options executed shown in different colors. [sent-168, score-1.382]
81 The figure illustrates that this agent discovered options corresponding to simple, efficient policies covering segments of the sample trajectory. [sent-169, score-0.656]
82 It also illustrates that in some places (in this case, the beginning of the trajectory) the agent learned to bypass a learned option—the black portions of the trajectory show where the agent employed primitive actions rather than a learned option. [sent-170, score-0.718]
83 In this particular case, the presence of other options freed the overall policy (using a more complex function approximator) to represent the remaining trajectory segment better than could an option (with its less complex function approximator). [sent-172, score-0.957]
84 Figure 5 shows the initiation sets and three sample trajectories from the options used in the trajectory shown in Figure 4(b). [sent-173, score-0.785]
85 These learned initiation sets show that the discovered option policies are only locally valid, even though they are represented using Fourier basis functions, which have global support. [sent-174, score-0.926]
86 6 Figure 5: Initiation sets and sample policy trajectories for the options used in Figure 4(b). [sent-175, score-0.482]
87 ˙ ˙ 2 2 Figures 6, 7 and 8 show similar results for the second Pinball instance, although Figure 6 shows a slight and transient initial penalty for skill chaining agents, before they go on to obtain far better and more consistent solutions than flat agents. [sent-177, score-0.68]
88 The example trajectory in Figure 7 and initiation sets in Figure 8 show portions of a successfully formed skill tree. [sent-178, score-0.859]
89 4 x 10 0 −2 −4 Return −6 −8 −10 −12 No Options Given Options Skill Chaining −14 −16 50 100 150 200 250 300 Episodes Figure 6: Performance in the second Pinball instance (averaged over 100 runs) for agents employing skill chaining, agents with given options, and agents without options. [sent-179, score-0.828]
90 Figure 7: Good solutions to the second Pinball experimental domain, showing the acquired options executed along the sample trajectory in different colors. [sent-180, score-0.463]
91 7 Figure 8: Initiation sets and sample trajectories for the options used in Figure 7. [sent-182, score-0.394]
92 We expect that methods that identify regions likely to lie on the solution trajectory before a solution is found will result in the kinds of early performance gains sometimes seen in discrete skill discovery methods (e. [sent-185, score-0.669]
93 The primary benefit of skill chaining is that it reduces the burden of representing the task’s value function, allowing each option to focus on representing its own local value function and thereby achieving a better overall solution. [sent-188, score-1.136]
94 This implies that skill acquisition is best suited to highdimensional problems where a single value function cannot be well represented using a feasible number of basis functions in reasonable time. [sent-189, score-0.512]
95 In tasks where a good solution can be well represented using a low-order function approximator, we do not expect to see any benefits when using skill chaining. [sent-190, score-0.507]
96 We expect that such methods will prove most effective for extended control problems when combined with skill acquisition, where they can tailor a separate representation for each option rather than for the entire problem. [sent-192, score-0.945]
97 In this paper we used “lightweight” function approximators to represent option value functions. [sent-193, score-0.453]
98 One such approach is abstraction selection [25, 26], where an agent uses sample trajectories (as obtained during gestation) to select an appropriate abstraction for a new option from a library of candidate abstractions, potentially resulting in a much easier learning problem. [sent-195, score-0.844]
99 We conjecture that the ability to discover new skills, and for each skill to employ its own abstraction, will prove a key advantage of hierarchical reinforcement learning as we try to scale up to extended control problems in high-dimensional spaces. [sent-196, score-0.602]
100 Sensorimotor abstraction selection for efficient, autonomous robot skill acquisition. [sent-352, score-0.541]
wordName wordTfidf (topN-words)
[('skill', 0.468), ('option', 0.453), ('options', 0.341), ('initiation', 0.336), ('agent', 0.236), ('chaining', 0.195), ('pinball', 0.152), ('skills', 0.124), ('agents', 0.12), ('reinforcement', 0.11), ('ot', 0.093), ('reach', 0.09), ('policy', 0.088), ('target', 0.081), ('event', 0.077), ('trigger', 0.074), ('domains', 0.073), ('reward', 0.069), ('creates', 0.069), ('executed', 0.067), ('konidaris', 0.062), ('simsek', 0.062), ('policies', 0.061), ('primitive', 0.06), ('events', 0.059), ('creating', 0.057), ('state', 0.055), ('discovery', 0.055), ('trajectory', 0.055), ('amherst', 0.054), ('create', 0.054), ('triggering', 0.054), ('states', 0.053), ('trajectories', 0.053), ('abstraction', 0.051), ('episodes', 0.05), ('termination', 0.05), ('gestation', 0.05), ('subgoals', 0.05), ('continuous', 0.05), ('barto', 0.048), ('action', 0.047), ('rl', 0.044), ('execute', 0.038), ('approximator', 0.037), ('controllers', 0.037), ('learned', 0.034), ('executing', 0.031), ('actions', 0.029), ('goal', 0.029), ('fourier', 0.029), ('creation', 0.028), ('triggered', 0.026), ('domain', 0.026), ('robotics', 0.026), ('massachusetts', 0.026), ('regions', 0.026), ('chain', 0.025), ('discrete', 0.025), ('branching', 0.025), ('menache', 0.025), ('mugan', 0.025), ('subtasks', 0.025), ('created', 0.025), ('control', 0.024), ('chains', 0.024), ('basis', 0.024), ('intrinsically', 0.023), ('factored', 0.022), ('robot', 0.022), ('portable', 0.022), ('sequencing', 0.022), ('lightweight', 0.022), ('autonomously', 0.022), ('motion', 0.021), ('overall', 0.02), ('task', 0.02), ('list', 0.02), ('mdp', 0.02), ('solution', 0.02), ('triggers', 0.02), ('abstractions', 0.02), ('discontinuities', 0.02), ('obstacles', 0.02), ('proceedings', 0.02), ('acquisition', 0.02), ('tasks', 0.019), ('novelty', 0.019), ('targeting', 0.019), ('neumann', 0.019), ('discovered', 0.018), ('start', 0.018), ('neighborhood', 0.018), ('international', 0.018), ('mannor', 0.018), ('bene', 0.018), ('tree', 0.017), ('planning', 0.017), ('encounters', 0.017), ('initial', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
Author: George Konidaris, Andre S. Barreto
Abstract: We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates appropriate skills and achieves performance benefits in a challenging continuous domain. 1
2 0.21580371 53 nips-2009-Complexity of Decentralized Control: Special Cases
Author: Martin Allen, Shlomo Zilberstein
Abstract: The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case. 1
3 0.2037524 242 nips-2009-The Infinite Partially Observable Markov Decision Process
Author: Finale Doshi-velez
Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1
4 0.17761028 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
Author: Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, Joshua B. Tenenbaum
Abstract: Everyday social interactions are heavily influenced by our snap judgments about others’ goals. Even young infants can infer the goals of intentional agents from observing how they interact with objects and other agents in their environment: e.g., that one agent is ‘helping’ or ‘hindering’ another’s attempt to get up a hill or open a box. We propose a model for how people can infer these social goals from actions, based on inverse planning in multiagent Markov decision problems (MDPs). The model infers the goal most likely to be driving an agent’s behavior by assuming the agent acts approximately rationally given environmental constraints and its model of other agents present. We also present behavioral evidence in support of this model over a simpler, perceptual cue-based alternative. 1
5 0.13411351 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black
Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1
6 0.097841553 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
7 0.095473662 113 nips-2009-Improving Existing Fault Recovery Policies
8 0.088983938 134 nips-2009-Learning to Explore and Exploit in POMDPs
9 0.06513118 221 nips-2009-Solving Stochastic Games
10 0.064636603 52 nips-2009-Code-specific policy gradient rules for spiking neurons
11 0.058421358 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
12 0.05829775 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
13 0.05786838 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
14 0.053582814 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
15 0.052238014 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
16 0.034041695 115 nips-2009-Individuation, Identification and Object Discovery
17 0.033404786 54 nips-2009-Compositionality of optimal control laws
18 0.032941759 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
19 0.032572042 14 nips-2009-A Parameter-free Hedging Algorithm
20 0.032482173 137 nips-2009-Learning transport operators for image manifolds
topicId topicWeight
[(0, -0.106), (1, -0.005), (2, 0.167), (3, -0.212), (4, -0.259), (5, 0.014), (6, 0.032), (7, -0.004), (8, -0.011), (9, 0.024), (10, 0.128), (11, 0.097), (12, 0.066), (13, 0.015), (14, 0.04), (15, 0.028), (16, -0.033), (17, 0.052), (18, 0.033), (19, 0.024), (20, -0.021), (21, 0.043), (22, -0.049), (23, -0.017), (24, 0.09), (25, -0.001), (26, 0.046), (27, -0.032), (28, 0.013), (29, -0.018), (30, 0.03), (31, -0.011), (32, -0.01), (33, 0.013), (34, 0.105), (35, -0.052), (36, -0.037), (37, 0.019), (38, 0.021), (39, -0.018), (40, -0.049), (41, 0.002), (42, -0.025), (43, 0.007), (44, 0.044), (45, 0.016), (46, 0.038), (47, 0.036), (48, -0.016), (49, -0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.96531594 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
Author: George Konidaris, Andre S. Barreto
Abstract: We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates appropriate skills and achieves performance benefits in a challenging continuous domain. 1
2 0.89108366 53 nips-2009-Complexity of Decentralized Control: Special Cases
Author: Martin Allen, Shlomo Zilberstein
Abstract: The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case. 1
3 0.86876523 242 nips-2009-The Infinite Partially Observable Markov Decision Process
Author: Finale Doshi-velez
Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1
4 0.85096496 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
Author: Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, Joshua B. Tenenbaum
Abstract: Everyday social interactions are heavily influenced by our snap judgments about others’ goals. Even young infants can infer the goals of intentional agents from observing how they interact with objects and other agents in their environment: e.g., that one agent is ‘helping’ or ‘hindering’ another’s attempt to get up a hill or open a box. We propose a model for how people can infer these social goals from actions, based on inverse planning in multiagent Markov decision problems (MDPs). The model infers the goal most likely to be driving an agent’s behavior by assuming the agent acts approximately rationally given environmental constraints and its model of other agents present. We also present behavioral evidence in support of this model over a simpler, perceptual cue-based alternative. 1
5 0.749035 134 nips-2009-Learning to Explore and Exploit in POMDPs
Author: Chenghui Cai, Xuejun Liao, Lawrence Carin
Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.
6 0.54806632 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
7 0.52940923 113 nips-2009-Improving Existing Fault Recovery Policies
8 0.50385779 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
9 0.45144424 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
10 0.40952557 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
11 0.39012897 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
12 0.33550757 221 nips-2009-Solving Stochastic Games
13 0.32676035 52 nips-2009-Code-specific policy gradient rules for spiking neurons
14 0.31253067 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
15 0.29001242 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
16 0.28894469 39 nips-2009-Bayesian Belief Polarization
17 0.26511651 69 nips-2009-Discrete MDL Predicts in Total Variation
18 0.25405046 14 nips-2009-A Parameter-free Hedging Algorithm
19 0.24402499 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
20 0.24096373 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs
topicId topicWeight
[(24, 0.041), (25, 0.05), (30, 0.269), (35, 0.058), (36, 0.07), (39, 0.05), (51, 0.01), (55, 0.01), (57, 0.015), (58, 0.044), (61, 0.095), (71, 0.11), (81, 0.013), (82, 0.011), (86, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.82314181 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
Author: George Konidaris, Andre S. Barreto
Abstract: We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates appropriate skills and achieves performance benefits in a challenging continuous domain. 1
2 0.57953203 242 nips-2009-The Infinite Partially Observable Markov Decision Process
Author: Finale Doshi-velez
Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1
3 0.57316929 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
Author: Kai Yu, Tong Zhang, Yihong Gong
Abstract: This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. 1
4 0.56207162 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
Author: Anne Hsu, Thomas L. Griffiths
Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-specific knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1
5 0.5462082 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black
Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1
6 0.53964537 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
7 0.52918118 113 nips-2009-Improving Existing Fault Recovery Policies
8 0.52898997 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
9 0.52719241 56 nips-2009-Conditional Neural Fields
10 0.52699697 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
11 0.52625394 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization
12 0.52052069 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
13 0.51733363 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
14 0.5169785 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
15 0.51691782 205 nips-2009-Rethinking LDA: Why Priors Matter
16 0.51656377 204 nips-2009-Replicated Softmax: an Undirected Topic Model
17 0.51632392 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
18 0.51446134 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
19 0.51272941 134 nips-2009-Learning to Explore and Exploit in POMDPs
20 0.51122302 53 nips-2009-Complexity of Decentralized Control: Special Cases