nips nips2010 nips2010-184 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. [sent-3, score-0.835]
2 Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. [sent-4, score-1.309]
3 We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. [sent-5, score-0.834]
4 1 Introduction We address the reinforcement learning (RL) problem of finding a good policy in an unknown, stochastic, and partially observable domain, given both data from independent exploration and expert demonstrations. [sent-6, score-1.145]
5 In contrast, imitation and inverse reinforcement learning [5, 6] use expert trajectories to learn reward models. [sent-9, score-0.671]
6 We consider cases where we have data from both independent exploration and expert trajectories. [sent-11, score-0.463]
7 Data from independent observation gives direct information about the dynamics, while expert demonstrations show outputs of good policies and thus provide indirect information about the underlying model. [sent-12, score-0.677]
8 Because dynamics and policies are linked through a complex, nonlinear function, leveraging information about both these aspects at once is challenging. [sent-14, score-0.275]
9 In previous work [7, 8, 9, 10], the model prior p(M ) was defined as a distribution directly on the dynamics and rewards models, making it difficult to incorporate expert trajectories. [sent-17, score-0.678]
10 Our main contribution is a new approach to defining this prior: our prior uses the assumption that the expert knew something about the world model when computing his optimal policy. [sent-18, score-0.682]
11 In this domain, one of our key technical contributions is the insight that the Bayesian approach used for building models of transition dynamics can also be used as policy priors, if we exchange the typical role of actions and 1 observations. [sent-21, score-0.671]
12 For example, algorithms for learning partially observable Markov decision processes (POMDPs) build models that output observations and take in actions as exogenous variables. [sent-22, score-0.361]
13 By using nonparametric priors [12], our agent can scale the sophistication of its policies and world models based on the data. [sent-24, score-0.783]
14 First, our choices for the policy prior and a world model prior can be viewed as a joint prior which introduces a bias for world models which are both simple and easy to control. [sent-26, score-1.164]
15 This bias is especially beneficial in the case of direct policy search, where it is easier to search directly for good controllers than it is to first construct a complete POMDP model and then plan with it. [sent-27, score-0.677]
16 Our method can also be used with approximately optimal expert data; in these cases the expert data can be used to bias which models are likely but not set hard constraints on the model. [sent-28, score-0.895]
17 The state transition function T (s′ |s, a) defines the distribution over next-states s′ to which the agent may transition after taking action a from state s. [sent-33, score-0.423]
18 Bayesian nonparametric approaches are well-suited for partially observable environments because they can also infer the dimensionality of the underlying state space. [sent-40, score-0.294]
19 Other approaches [15, 16, 9] try to balance the off-line computation of a good policy (the computational complexity) and the cost of getting data online (the sample complexity). [sent-45, score-0.471]
20 Finite State Controllers Another possibility for choosing actions—including in our partiallyobservable reinforcement learning setting—is to consider a parametric family of policies, and attempt to estimate the optimal policy parameters from data. [sent-46, score-0.533]
21 This is the approach underlying, for example, much work on policy gradients. [sent-47, score-0.441]
22 The node transition function β(n′ |n, o) defines the distribution over next-nodes n′ to which the agent may transition after taking action a from node n. [sent-51, score-0.341]
23 The policy function π(a|n) is a distribution over actions that the finite state controller may output in node n. [sent-52, score-0.716]
24 3 Nonparametric Bayesian Policy Priors We now describe our framework for combining world models and expert data. [sent-54, score-0.626]
25 Recall that our key assumption is that the expert used knowledge about the underlying world to derive his policy. [sent-55, score-0.586]
26 1 2 Figure 1: Two graphical models of expert data generation. [sent-57, score-0.471]
27 Left: the prior only addresses world dynamics and rewards. [sent-58, score-0.323]
28 Right: the prior addresses both world dynamics and controllable policies. [sent-59, score-0.353]
29 Combined with the world model M , the expert’s policy πe and agent’s policy πa produce the expert’s and agent’s data De and Da . [sent-62, score-1.06]
30 The data consist of a sequence of histories, where a history ht is a sequence of actions a1 , · · · , at , observations o1 , · · · , ot , and rewards r1 , · · · , rt . [sent-63, score-0.267]
31 The agent has access to all histories, but the true world model and optimal policy are hidden. [sent-64, score-0.779]
32 Both graphical models assume that a particular world M is sampled from a prior over POMDPs, gM (M ). [sent-65, score-0.366]
33 In what would be the standard application of Bayesian RL with expert data (Fig. [sent-66, score-0.408]
34 1(a)), the prior gM (M ) fully encapsulates our initial belief over world models. [sent-67, score-0.297]
35 An expert, who knows the true world model M , executes a planning algorithm plan(M ) to construct an optimal policy πe . [sent-68, score-0.716]
36 The expert then executes the policy to generate expert data De , distributed according to p(De |M, πe ), where πe = plan(M ). [sent-69, score-1.281]
37 1(a) does not easily allow us to encode a prior bias toward more controllable world models. [sent-71, score-0.39]
38 In particular, if we choose a distribution of the form p(πe |M ) ∝ fM (πe )gπ (πe ) (1) where we interpret gπ (πe ) as a prior over policies and fM (πe ) as a likelihood of a policy given a model. [sent-74, score-0.763]
39 For example, if the policy class is the set of finite state controllers as discussed in Sec. [sent-77, score-0.555]
40 2, the policy prior gπ (πe ) might encode preferences for a smaller number of nodes used the policy, while gM (M ) might encode preferences for a smaller number of visited states in the world. [sent-78, score-0.802]
41 The function fM (πe ) can also be made more general to encode how likely it is that the expert uses the policy πe given world model M . [sent-79, score-1.074]
42 Similarly, the conditional distribution over policies given data De and Da is a p(πe |De , Da ) ∝ gπ (πe )p(De |πe ) M o,r fM (πe )p(De , Da |M )gM (M ) We next describe three inference approaches for using Eqs. [sent-83, score-0.3]
43 If fM (πe ) = δ(plan(M )) and we believe that all policies are equally likely (graphical model 1(a)), then we can leverage the expert’s data by simply considering how well that world model’s policy plan(M ) matches the expert’s actions for a particular world model M . [sent-86, score-1.151]
44 The uniform policy prior implied by standard Bayesian RL does not allow us to encode prior biases about the policy. [sent-94, score-0.68]
45 6 becomes E [q(a)] = M o,r a q(a|M )p(De , Da |M )gM (M )gπ (plan(M ))p(De |plan(M )) (7) where we still assume that the expert uses an optimal policy, that is, fM (πe ) = δ(plan(M )). [sent-97, score-0.408]
46 It also assumes that the expert used the optimal policy, whereas a more realistic assumption might be that the expert uses a near-optimal policy. [sent-100, score-0.816]
47 While the model-based inference for policy priors is correct, using importance weights often suffers when the proposal distribution is not near the true posterior. [sent-105, score-0.612]
48 In particular, sampling world models and policies—both very high dimensional objects—from distributions that ignore large parts of the evidence means that large numbers of samples may be needed to get accurate estimates. [sent-106, score-0.283]
49 We now describe an inference approach that alternates sampling models and policies that both avoids importance sampling and can be used even 1 We omit the belief over world states b(s) from the equations that follow for clarity; all references to q(a|M ) are q(a|bM (s), M ). [sent-107, score-0.638]
50 The inference proceeds in two alternating stages: first, we sample a new policy given a sampled model. [sent-110, score-0.514]
51 1, we use the iPOMDP [12] as a conjugate prior over policies encoded as finite state controllers. [sent-114, score-0.385]
52 While applying MH can suffer from the same issues as the importance sampling in the model-based approach, Gibbs sampling new policies removes one set of proposal distributions from the inference, resulting in better estimates with fewer samples. [sent-123, score-0.319]
53 1 Priors over State Controller Policies We now turn to the definition of the policy prior p(πe ). [sent-125, score-0.537]
54 In theory, any policy prior can be used, but there are some practical considerations. [sent-126, score-0.537]
55 Mathematically, the policy prior serves as a regularizer to avoid overfitting the expert data, so it should encode a preference toward simple policies. [sent-127, score-0.992]
56 In discrete domains, one choice for the policy prior (as well as the model prior) is the iPOMDP [12]. [sent-129, score-0.537]
57 The iPOMDP posits that there are an infinite number of states s but a few popular states are visited most of the time; the beam sampler [17] can efficiently draw samples of state transition, observation, and reward models for visited states. [sent-131, score-0.499]
58 To use the iPOMDP as a policy prior, we simply reverse the roles of actions and observations, treating the observations as inputs and the actions as outputs. [sent-133, score-0.69]
59 Now, the iPOMDP posits that there is a state controller with an infinite number of nodes n, but probable polices use only a small subset of the nodes a majority of the time. [sent-134, score-0.27]
60 We perform joint inference over the node transition and policy parameters β and π as well as the visited nodes n. [sent-135, score-0.615]
61 As in the model prior application, using the iPOMDP as a policy prior biases the agent towards simpler policies—those that visit fewer nodes—but allows the number of nodes to grow as with new expert experience. [sent-138, score-1.303]
62 There are some mild conditions on the world and policy priors to ensure consistency: since the policy prior and model prior are specified independently, we require that there exist models for which both the policy prior and model prior are non-zero in the limit of data. [sent-141, score-2.052]
63 Formally, we also require that the expert provide optimal trajectories; in practice, we see that this assumption can be relaxed. [sent-142, score-0.408]
64 3 output samples of models or policies to be used for planning. [sent-148, score-0.295]
65 During the testing phase, the internal belief state of the models (in the model-based approaches) or the internal node state of the policies (in the policy-based approaches), is updated after each action-observation pair. [sent-152, score-0.437]
66 Actions are chosen by first selecting, depending on the approach, a model or policy based on their weights, and then performing its most preferred action. [sent-154, score-0.441]
67 2 4 Experiments We first describe a pair of demonstrations that show two important properties of using policy priors: (1) that policy priors can be useful even in the absence of expert data and (2) that our approach works even when the expert trajectories are not optimal. [sent-156, score-1.921]
68 We then compare policy priors with the basic iPOMDP [12] and finite-state model learner trained with EM on several standard problems. [sent-157, score-0.597]
69 The agent was provided with an expert tran jectory with probability . [sent-160, score-0.568]
70 No expert trajectories were provided in the last quarter of the iterations. [sent-162, score-0.482]
71 Models and policies were updated every 100 iterations, and each episode was capped at 50 iterations (though it could be shorter, if the task was achieved in fewer iterations). [sent-164, score-0.29]
72 Following each update, we ran 50 test episodes (not included in the agent’s experience) with the new models and policies to empirically evaluate the current value of the agents’ policy. [sent-165, score-0.287]
73 One iteration of bounded policy iteration [19] was performed per sampled model. [sent-168, score-0.47]
74 Policy Priors with No Expert Data The combined policy and model prior can be used to encode a prior bias towards models with simpler control policies. [sent-171, score-0.786]
75 6 be useful even without expert data: the left pane of Fig. [sent-173, score-0.408]
76 2 shows the performance of the policy prior-biased approaches and the standard iPOMDP on a gridworld problem in which observations correspond to both the adjacent walls (relevant for planning) and the color of the square (not relevant for planning). [sent-174, score-0.595]
77 The optimal policy for this gridworld was simple: go east until the agent hits a wall, then go south. [sent-176, score-0.69]
78 Without expert data, Approach 1 cannot do better than iPOMDP. [sent-178, score-0.408]
79 By biasing the agent towards worlds that admit simpler policies, the model-based inference with policy priors (Approach 2) creates a faster learner. [sent-179, score-0.824]
80 Policy Priors with Imperfect Experts While we focused on optimal expert data, in practice policy priors can be applied even if the expert is imperfect. [sent-180, score-1.384]
81 We generated expert data by first deriving 16 motor primitives for the action space using a clustering technique on a near-optimal trajectory produced by a rapidly-exploring random tree (RRT). [sent-185, score-0.498]
82 Trajectories from this controller were treated as expert data for our policy prior model. [sent-187, score-1.028]
83 In the beach problem, the agent needed to track a beach ball on a 2D grid. [sent-191, score-0.288]
84 We compared our inference approaches with two approaches that did not leverage the expert data: expectation-maximization (EM) used to learn a finite world model of the correct size and the infinite POMDP [12], which placed the same nonparametric prior over world models as we did. [sent-193, score-1.077]
85 3 shows the learning curves for our policy priors approaches (problems ordered by state space size); the cumulative rewards and final values are shown in Table 1. [sent-200, score-0.833]
86 As expected, approaches that leverage expert trajectories generally perform better than those that ignore the near-optimality of the expert data. [sent-201, score-0.941]
87 Here, even though the inferred state spaces could grow large, policies remained relatively simple. [sent-203, score-0.312]
88 The optimization used in the policy-based approach—recall we use the stochastic search to find a probable policy—was also key to producing reasonable policies with limited computation. [sent-204, score-0.289]
89 tiger network shuttle follow gridworld hallway beach rocksample tag image Cumulative Reward iPOMDP App. [sent-205, score-0.362]
90 Both [1, 4] describe how expert data augment learning. [sent-313, score-0.408]
91 In contrast, [4] lets the agent query an expert for optimal actions. [sent-317, score-0.594]
92 While policy information may be much easier to specify—incorporating the result of a single query into the prior over models is challenging; the particle-filtering approach of [4] can be brittle as model-spaces grow large. [sent-318, score-0.656]
93 Our policy priors approach uses entire trajectories; by learning policies rather than single actions, we can generalize better and evaluate models more holistically. [sent-319, score-0.834]
94 Targeted criteria for asking for expert trajectories, especially one with performance guarantees such as [4], would be an interesting extension to our approach. [sent-321, score-0.408]
95 6 Conclusion We addressed a key gap in the learning-by-demonstration literature: learning from both expert and agent data in a partially observable setting. [sent-322, score-0.717]
96 Prior work used expert data in MDP and imitationlearning cases, but less work exists for the general POMDP case. [sent-323, score-0.408]
97 Our Bayesian approach combined priors over the world models and policies, connecting information about world dynamics and expert trajectories. [sent-324, score-0.98]
98 Taken together, these priors are a new way to think about specifying priors over models: instead of simply putting a prior over the dynamics, our prior provides a bias towards models with simple dynamics and simple optimal policies. [sent-325, score-0.601]
99 We show with our approach expert data never reduces performance, and our extra bias towards controllability improves performance even without expert data. [sent-326, score-0.882]
100 Our policy priors over nonparametric finite state controllers were relatively simple; classes of priors to address more problems is an interesting direction for future work. [sent-327, score-0.861]
wordName wordTfidf (topN-words)
[('policy', 0.441), ('expert', 0.408), ('fm', 0.288), ('ipomdp', 0.287), ('da', 0.227), ('policies', 0.226), ('world', 0.178), ('agent', 0.16), ('gm', 0.145), ('priors', 0.127), ('rewards', 0.125), ('plan', 0.12), ('actions', 0.107), ('pomdp', 0.107), ('observable', 0.099), ('prior', 0.096), ('reinforcement', 0.092), ('gridworld', 0.089), ('rl', 0.083), ('controller', 0.083), ('reward', 0.074), ('trajectories', 0.074), ('planning', 0.073), ('action', 0.069), ('experience', 0.066), ('beach', 0.064), ('state', 0.063), ('snake', 0.059), ('bayesian', 0.058), ('pomdps', 0.055), ('exploration', 0.055), ('states', 0.055), ('nonparametric', 0.052), ('controllers', 0.051), ('finale', 0.051), ('hallway', 0.051), ('shuttle', 0.051), ('partially', 0.05), ('dynamics', 0.049), ('mi', 0.048), ('cumulative', 0.047), ('encode', 0.047), ('rocksample', 0.044), ('joelle', 0.044), ('inference', 0.044), ('visited', 0.043), ('iterations', 0.043), ('beam', 0.041), ('models', 0.04), ('bias', 0.039), ('pineau', 0.038), ('planner', 0.038), ('probable', 0.037), ('sampling', 0.036), ('observations', 0.035), ('histories', 0.035), ('transition', 0.034), ('fsc', 0.034), ('multicolored', 0.034), ('tiger', 0.033), ('nodes', 0.031), ('draw', 0.031), ('tag', 0.03), ('brahim', 0.03), ('brittle', 0.03), ('controllable', 0.03), ('exogenous', 0.03), ('approaches', 0.03), ('em', 0.029), ('samples', 0.029), ('sampled', 0.029), ('learner', 0.029), ('posterior', 0.028), ('stephane', 0.027), ('towards', 0.027), ('query', 0.026), ('search', 0.026), ('poupart', 0.025), ('posits', 0.025), ('worlds', 0.025), ('forward', 0.025), ('executes', 0.024), ('nite', 0.024), ('nicholas', 0.023), ('imitation', 0.023), ('belief', 0.023), ('grow', 0.023), ('graphical', 0.023), ('node', 0.022), ('mh', 0.022), ('demonstrations', 0.022), ('beal', 0.022), ('ross', 0.021), ('indirect', 0.021), ('episodes', 0.021), ('primitives', 0.021), ('fewer', 0.021), ('leverage', 0.021), ('bayes', 0.021), ('preferences', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
2 0.42054644 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
3 0.35265294 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
4 0.35056046 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
5 0.32786447 43 nips-2010-Bootstrapping Apprenticeship Learning
Author: Abdeslam Boularias, Brahim Chaib-draa
Abstract: We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of state-action features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert’s policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system is known. Empirical results on gridworlds and car racing problems show that our approach is able to learn good policies from a small number of demonstrations. 1
6 0.30144405 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
7 0.28807643 208 nips-2010-Policy gradients in linearly-solvable MDPs
8 0.25278929 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
9 0.24672708 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
10 0.22848187 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
11 0.22462733 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
12 0.20230143 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
13 0.19525822 229 nips-2010-Reward Design via Online Gradient Ascent
14 0.19219503 4 nips-2010-A Computational Decision Theory for Interactive Assistants
15 0.18522991 168 nips-2010-Monte-Carlo Planning in Large POMDPs
16 0.18499069 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
17 0.18056253 134 nips-2010-LSTD with Random Projections
18 0.16436574 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
19 0.1503675 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
20 0.1379499 64 nips-2010-Distributionally Robust Markov Decision Processes
topicId topicWeight
[(0, 0.258), (1, -0.554), (2, -0.105), (3, -0.067), (4, -0.053), (5, -0.024), (6, 0.049), (7, -0.008), (8, 0.035), (9, -0.057), (10, -0.044), (11, 0.041), (12, -0.025), (13, -0.021), (14, 0.008), (15, -0.035), (16, 0.002), (17, 0.056), (18, -0.023), (19, -0.005), (20, -0.011), (21, -0.041), (22, 0.024), (23, -0.031), (24, 0.088), (25, 0.021), (26, -0.038), (27, 0.031), (28, -0.002), (29, 0.0), (30, -0.013), (31, 0.04), (32, 0.063), (33, -0.015), (34, 0.011), (35, 0.007), (36, -0.041), (37, 0.044), (38, 0.06), (39, -0.022), (40, -0.034), (41, 0.064), (42, 0.05), (43, 0.02), (44, -0.061), (45, 0.035), (46, 0.011), (47, 0.058), (48, 0.042), (49, -0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.97199029 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
2 0.91377145 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
3 0.89032978 43 nips-2010-Bootstrapping Apprenticeship Learning
Author: Abdeslam Boularias, Brahim Chaib-draa
Abstract: We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of state-action features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert’s policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system is known. Empirical results on gridworlds and car racing problems show that our approach is able to learn good policies from a small number of demonstrations. 1
4 0.83316964 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
5 0.82141542 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
6 0.79825401 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
7 0.77547312 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
8 0.7711283 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
9 0.73730069 208 nips-2010-Policy gradients in linearly-solvable MDPs
10 0.7260552 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
11 0.72555572 4 nips-2010-A Computational Decision Theory for Interactive Assistants
12 0.6827814 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
13 0.68127167 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
14 0.67345685 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
15 0.65593833 168 nips-2010-Monte-Carlo Planning in Large POMDPs
16 0.65540242 229 nips-2010-Reward Design via Online Gradient Ascent
17 0.59580761 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
18 0.58206511 64 nips-2010-Distributionally Robust Markov Decision Processes
19 0.55068195 212 nips-2010-Predictive State Temporal Difference Learning
20 0.54869586 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
topicId topicWeight
[(0, 0.015), (13, 0.025), (17, 0.012), (27, 0.06), (30, 0.042), (39, 0.023), (45, 0.266), (50, 0.05), (52, 0.021), (60, 0.043), (77, 0.046), (88, 0.277), (90, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.83346367 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
2 0.83219439 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
Author: Ziv Bar-joseph, Hai-son P. Le
Abstract: Recent studies compare gene expression data across species to identify core and species specific genes in biological systems. To perform such comparisons researchers need to match genes across species. This is a challenging task since the correct matches (orthologs) are not known for most genes. Previous work in this area used deterministic matchings or reduced multidimensional expression data to binary representation. Here we develop a new method that can utilize soft matches (given as priors) to infer both, unique and similar expression patterns across species and a matching for the genes in both species. Our method uses a Dirichlet process mixture model which includes a latent data matching variable. We present learning and inference algorithms based on variational methods for this model. Applying our method to immune response data we show that it can accurately identify common and unique response patterns by improving the matchings between human and mouse genes. 1
3 0.78826261 224 nips-2010-Regularized estimation of image statistics by Score Matching
Author: Diederik P. Kingma, Yann L. Cun
Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.
4 0.71470016 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
Author: Akshat Kumar, Shlomo Zilberstein
Abstract: Computing a maximum a posteriori (MAP) assignment in graphical models is a crucial inference problem for many practical applications. Several provably convergent approaches have been successfully developed using linear programming (LP) relaxation of the MAP problem. We present an alternative approach, which transforms the MAP problem into that of inference in a mixture of simple Bayes nets. We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. Experiments on the real-world protein design dataset show that EM’s convergence rate is significantly higher than the previous LP relaxation based approach MPLP. EM also achieves a solution quality within 95% of optimal for most instances. 1
5 0.71425533 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: For many structured prediction problems, complex models often require adopting approximate inference techniques such as variational methods or sampling, which generally provide no satisfactory accuracy guarantees. In this work, we propose sidestepping intractable inference altogether by learning ensembles of tractable sub-models as part of a structured prediction cascade. We focus in particular on problems with high-treewidth and large state-spaces, which occur in many computer vision tasks. Unlike other variational methods, our ensembles do not enforce agreement between sub-models, but filter the space of possible outputs by simply adding and thresholding the max-marginals of each constituent model. Our framework jointly estimates parameters for all models in the ensemble for each level of the cascade by minimizing a novel, convex loss function, yet requires only a linear increase in computation over learning or inference in a single tractable sub-model. We provide a generalization bound on the filtering loss of the ensemble as a theoretical justification of our approach, and we evaluate our method on both synthetic data and the task of estimating articulated human pose from challenging videos. We find that our approach significantly outperforms loopy belief propagation on the synthetic data and a state-of-the-art model on the pose estimation/tracking problem. 1
6 0.71420723 287 nips-2010-Worst-Case Linear Discriminant Analysis
7 0.71289814 63 nips-2010-Distributed Dual Averaging In Networks
8 0.71226567 103 nips-2010-Generating more realistic images using gated MRF's
9 0.71198553 172 nips-2010-Multi-Stage Dantzig Selector
10 0.71195799 1 nips-2010-(RF)^2 -- Random Forest Random Field
11 0.71183485 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
12 0.71147776 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect
13 0.71143663 282 nips-2010-Variable margin losses for classifier design
14 0.71066219 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
15 0.71064502 158 nips-2010-Learning via Gaussian Herding
16 0.71061569 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
17 0.71033049 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
18 0.71032995 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
19 0.70992702 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
20 0.70962989 174 nips-2010-Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition