nips nips2013 nips2013-165 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup
Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. [sent-2, score-0.4]
2 Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. [sent-5, score-0.283]
3 Our experiments include simulations as well as a real robot path-finding task. [sent-6, score-0.17]
4 1 Introduction Learning from Demonstration (LfD) is a practical framework for learning complex behaviour policies from demonstration trajectories produced by an expert. [sent-7, score-0.17]
5 In most conventional approaches to LfD, the agent observes mappings between states and actions in the expert trajectories, and uses supervised learning to estimate a function that can approximately reproduce this mapping. [sent-8, score-0.522]
6 , policy) should also generalize well to regions of the state space that are not observed in the demonstration data. [sent-11, score-0.105]
7 Many of the recent methods focus on incrementally querying the expert in appropriate regions of the state space to improve the learned policy, or to reduce uncertainty [1, 2, 3]. [sent-12, score-0.376]
8 Key assumptions of most these works are that (1) the expert exhibits optimal behaviour, (2) the expert demonstrations are abundant, and (3) the expert stays with the learning agent throughout the training. [sent-13, score-1.31]
9 We present a framework that leverages insights and techniques from the reinforcement learning (RL) literature to overcome these limitations of the conventional LfD methods. [sent-15, score-0.099]
10 A combination of both expert and interaction data (i. [sent-19, score-0.343]
11 , mixing LfD and RL), however, offers a tantalizing way to effectively address challenging real-world policy learning problems under realistic assumptions. [sent-21, score-0.238]
12 The method is 1 formulated as a coupled constraint convex optimization, in which expert demonstrations define a set of linear constraints in API. [sent-23, score-0.539]
13 The optimization is formulated in a way that permits mistakes in the demonstrations provided by the expert, and also accommodates variable availability of demonstrations (i. [sent-24, score-0.424]
14 We evaluate our algorithm in a simulated environment under various scenarios, such as varying the quality and quantity of expert demonstrations. [sent-28, score-0.39]
15 We also evaluate the algorithm’s practicality in a real robot path finding task, where there are few demonstrations, and exploration data is expensive due to limited time. [sent-30, score-0.238]
16 Our method also significantly outperformed Dataset Aggregation (DAgger), a state-of-art LfD algorithm, in cases where the expert demonstrations are few or suboptimal. [sent-32, score-0.571]
17 2 Proposed Algorithm We consider a continuous-state, finite-action discounted MDP (X , A, P, R, γ), where X is a measurable state space, A is a finite set of actions, P : X × A → M(X ) is the transition model, R : X × A → M(R) is the reward model, and γ ∈ [0, 1) is a discount factor. [sent-33, score-0.078]
18 As usual, V π and Qπ denote the value and action-value function for π, and V ∗ and Q∗ denote the corresponding value functions for the optimal policy π ∗ [5]. [sent-36, score-0.258]
19 A standard API algorithm starts with an initial policy π0 . [sent-38, score-0.254]
20 At the (k + 1)th iteration, given a policy πk , the algorithm approximately evaluates ˆ ˆ ˆ πk to find Qk , usually as an approximate fixed point of the Bellman operator T πk : Qk ≈ T πk Qk . [sent-39, score-0.254]
21 Then, a new policy πk+1 is computed, which ˆ is greedy with respect to Qk . [sent-41, score-0.238]
22 There are several variants of API that mostly differ on how the approximate policy evaluation is performed. [sent-42, score-0.254]
23 This is precisely our case, since we have expert demonstrations. [sent-44, score-0.343]
24 To develop the algorithm, we start with regularized Bellman error minimization, which is a common flavour of policy evaluation used in API. [sent-45, score-0.274]
25 Suppose that we want to evaluate a policy π given a batch of data DRL = {(Xi , Ai )}n containing n examples, and that the exact Bellman operator T π is i=1 ˆ known. [sent-46, score-0.238]
26 In addition to DRL , we have a set of expert examples DE = {(Xi , πE (Xi ))}m , which we would i=1 like to take into account in the optimization process. [sent-55, score-0.343]
27 The intuition behind our algorithm is that we want to use the expert examples to “shape” the value function where they are available, while using the RL data to improve the policy everywhere else. [sent-56, score-0.581]
28 Hence, even if we have few demonstration examples, we can still obtain good generalization everywhere due to the RL data. [sent-57, score-0.088]
29 To incorporate the expert examples in the algorithm one might require that at the states Xi from DE , the demonstrated action πE (Xi ) be optimal, which can be expressed as a large-margin constraint: 1 For a space Ω with σ-algebra σΩ , M(Ω) denotes the set of all probability measures over σΩ . [sent-58, score-0.383]
30 However, this might not always be feasible, or desirable (if the expert itself is not optimal), so we add slack variables ξi ≥ 0 to allow occasional violations of the constraints (similar to soft vs. [sent-62, score-0.367]
31 The policy evaluation step can then be written as the following constrained optimization problem: ˆ Q← Q − T πQ argmin Q∈F |A| ,ξ∈Rm + s. [sent-64, score-0.267]
32 The approach presented so far tackles the policy evaluation step of the API algorithm. [sent-97, score-0.238]
33 As usual in API, we alternate this step with the policy improvement step (i. [sent-98, score-0.238]
34 These datasets might be regenerated at each iteration, or they might be reused, depending on the availability of the expert and the environment. [sent-103, score-0.423]
35 In practice, if the expert data is rare, DE will be a single fixed batch, but DRL could be increased, e. [sent-104, score-0.343]
36 , by running the most current policy (possibly with some exploration) to collect more data. [sent-106, score-0.238]
37 Note that the values of the regularization coefficients as well as α should ideally change from iteration to iteration as a function of the number of samples as well as the value function Qπk . [sent-108, score-0.114]
38 Such an upper bound allows us to use error propagation results [16, 17] to provide a performance guarantee on the value of the outcome policy πK (the policy obtained after K iterations of the algorithm) compared to the optimal value function V ∗ . [sent-112, score-0.558]
39 Xi ∼ νE ∈ M(X ) from an expert i=1 distribution νE . [sent-126, score-0.343]
40 Assumption A3 (Function Approximation Property) For any policy π, Qπ ∈ F |A| . [sent-134, score-0.238]
41 Assumption A4 (Expansion of Smoothness) For all Q ∈ F |A| , there exist constants 0 ≤ LR , LP < ∞, depending only on the MDP and F |A| , such that for any policy π, J(T π Q) ≤ LR + γLP J(Q). [sent-135, score-0.238]
42 For any fixed policy π, let Q be the solution to the optimization problem (2) with the choice of α > 0 and λ > 0. [sent-154, score-0.238]
43 4 Experiments We evaluate APID on a simulated domain, as well as a real robot path-finding task. [sent-179, score-0.187]
44 In the simulated environment, we compare APID against other benchmarks under varying availability and optimality of the expert demonstrations. [sent-180, score-0.392]
45 In the real robot task, we evaluate the practicality of deploying APID on a live system, especially when DRL and DE are both expensive to obtain. [sent-181, score-0.204]
46 1 Car Brake Control Simulation In the vehicle brake control simulation [21], the agent’s goal is reach a target velocity, then maintain that target. [sent-183, score-0.135]
47 It can either press the acceleration pedal or the brake pedal, but not both simultaneously. [sent-184, score-0.185]
48 A state is represented by four continuous-valued features: target and current velocities, current positions of brake pedal and acceleration pedal. [sent-185, score-0.202]
49 Given a state, the learned policy has to output one of five actions: acceleration up, acceleration down, brake up, brake down, do nothing. [sent-186, score-0.49]
50 The initial velocity is set to 2m/s, and the target velocity is set to 7m/s. [sent-188, score-0.116]
51 The expert was implemented using the dynamics between the pedal pressure and output velocity, from which we calculate the optimal velocity at each state. [sent-189, score-0.491]
52 Each iteration adds 500 new RL data to APID and LSPI, while the expert data stays the same. [sent-194, score-0.4]
53 Depending on the availability of expert data, we either compare APID with the standard supervised LfD, or DAgger [1], a state-of-the-art LfD algorithm that has strong theoretical guarantees and good empirical performance when the expert is optimal. [sent-203, score-0.786]
54 DAgger is designed to query for more demonstrations at each iteration; then, it aggregates all demonstrations and trains a new policy. [sent-204, score-0.392]
55 We first consider the case with little but optimal expert data, with task horizon 1000. [sent-207, score-0.401]
56 This is due to the fact that expert constraints impose a particular shape to the value function, as noted in Section 2. [sent-212, score-0.343]
57 The supervised LfD performs the worst, as the amount of expert data is insufficient. [sent-213, score-0.411]
58 Results for the case in which the agent has more but sub-optimal expert data are shown in Figure 1b. [sent-214, score-0.389]
59 5 the expert gives a random action rather than the optimal action. [sent-216, score-0.363]
60 Compared to supervised LfD, APID and LSPI are both able to overcome sub-optimality in the expert’s behaviour to achieve good performance, by leveraging the RL data. [sent-217, score-0.13]
61 Next, we consider the case of abundant demonstrations from a sub-optimal expert, who gives random actions with probability 0. [sent-218, score-0.288]
62 The task horizon is reduced to 100, due to the number of demonstrations required by DAgger. [sent-220, score-0.234]
63 As can be seen in Figure 2a, the sub-optimal demonstrations cause DAgger to diverge, because it changes the policy at each iteration, based on the newly aggregated sub-optimal demonstrations. [sent-221, score-0.434]
64 APID, on the other hand, is able to learn a better policy by leveraging DRL . [sent-222, score-0.258]
65 This result illustrates well APID’s robustness to sub-optimal expert demonstrations. [sent-224, score-0.343]
66 Figure 2b shows the result for the case of optimal and abundant demonstrations. [sent-225, score-0.082]
67 2 Real Robot Path Finding We now evaluate the practicality of APID on a real robot path-finding task and compare it with LSPI and supervised LfD, using only one demonstrated trajectory. [sent-228, score-0.29]
68 We do not assume that the expert is optimal (and abundant), and therefore do not include DAgger, which was shown to perform poorly for this case. [sent-229, score-0.363]
69 Each iteration (X-axis) adds 100 new expert data points to APID and DAgger. [sent-232, score-0.4]
70 The robot also has three bumpers to detect a collision from the front, left, and right. [sent-238, score-0.17]
71 Figures 3a and 3b show a picture of the robot and its environment. [sent-239, score-0.17]
72 In order to reach the goal, the robot needs to turn left to avoid a first box and wall on the right, while not turning too much, to avoid the couch. [sent-240, score-0.3]
73 Next, the robot must turn right to avoid a second box, but make sure not to turn too much or too soon to avoid colliding with the wall or first box. [sent-241, score-0.276]
74 Then, the robot needs to get into the hallway, turn right, and move forward to reach the goal position; the goal position is 6m forward and 1. [sent-242, score-0.273]
75 The robot has three discrete actions: turn left, turn right, and move forward. [sent-245, score-0.214]
76 The reward is minus the distance to the goal, but if the robot’s front bumper is pressed and the robot moves forward, it receives a penalty equal to 2 times the current distance to the goal. [sent-246, score-0.284]
77 If the robot’s left bumper is pressed and the robot does not turn right, and vice-versa, it also receives 2 times the current distance to the goal. [sent-247, score-0.248]
78 6 minutes of exploration using -greedy exploration policy (decreasing over iterations). [sent-252, score-0.306]
79 To evaluate the performance of each algorithm, we ran each iteration’s policy for a task horizon of 100 (≈ 1min); we repeated each iteration 5 times, to compute the mean and standard deviation. [sent-256, score-0.333]
80 As shown in Figure 3c, APID outperformed both LSPI and supervised LfD; in fact, these two methods could not reach the goal. [sent-257, score-0.127]
81 The supervised LfD kept running into the couch, as state distributions of expert and robot differed, as noted in [1]. [sent-258, score-0.598]
82 LSPI had a problem due to exploring unnecessary states; specifically, the -greedy policy of LSPI explored regions of state space that were not relevant in learning the optimal plan, such as the far left areas from the initial position. [sent-259, score-0.291]
83 -greedy policy of APID, on the other hand, was able to leverage the expert data to efficiently explore the most relevant states and avoid unnecessary collisions. [sent-260, score-0.619]
84 For example, it learned to avoid the first box in the first iteration, then explored states near the couch, where supervised LfD failed. [sent-261, score-0.125]
85 Table 1 gives the time it took for the robot to reach the goal (within 1. [sent-262, score-0.216]
86 (c) Distance to the goal for LSPI, APID and supervised LfD with random forest. [sent-276, score-0.087]
87 5 Discussion We proposed an algorithm that learns from limited demonstrations by leveraging a state-of-the-art API method. [sent-277, score-0.216]
88 data assumptions by changing the policy slowly [23], by reducing the problem to online learning [1], by querying the expert [2] or by obtaining corrections from the expert [3]. [sent-282, score-0.959]
89 These methods all assume that the expert is optimal or close-to-optimal, and demonstration data is abundant. [sent-283, score-0.451]
90 The TAMER system [24] uses rewards provided by the expert (and possibly blended with MDP rewards), instead of assuming that an action choice is provided. [sent-284, score-0.379]
91 We considered four scenarios: very few but optimal demonstrations, a reasonable number of sub-optimal demonstrations, abundant sub-optimal demonstrations, and abundant optimal demonstrations. [sent-288, score-0.164]
92 In the real robot path-finding task, our method again outperformed LSPI and supervised LfD. [sent-291, score-0.27]
93 LSPI suffered from inefficient exploration, and supervised LfD was affected by the violation of the i. [sent-292, score-0.087]
94 We note that APID accelerated API by utilizing demonstration data. [sent-296, score-0.088]
95 In contrast, APID leverages the expert data to shape the policy throughout the planning. [sent-300, score-0.604]
96 However, their approach is less general, as it assumes a particular noise model in the expert data, whereas APID is able to handle demonstrations that are sub-optimal non-uniformly along the trajectory. [sent-302, score-0.539]
97 Approximate policy iteration: A survey and some new methods. [sent-344, score-0.238]
98 Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. [sent-390, score-0.316]
99 Error propagation for approximate policy and value a iteration. [sent-407, score-0.254]
100 RTMBA: A real-time model-based reinforcement learning architecture for robot control. [sent-428, score-0.227]
wordName wordTfidf (topN-words)
[('apid', 0.534), ('expert', 0.343), ('lfd', 0.327), ('lspi', 0.288), ('policy', 0.238), ('drl', 0.237), ('demonstrations', 0.196), ('robot', 0.17), ('rl', 0.161), ('dagger', 0.157), ('bellman', 0.141), ('api', 0.113), ('brake', 0.089), ('demonstration', 0.088), ('xi', 0.084), ('supervised', 0.068), ('abundant', 0.062), ('hq', 0.06), ('pedal', 0.059), ('quebec', 0.059), ('iteration', 0.057), ('reinforcement', 0.057), ('farahmand', 0.052), ('mcgill', 0.052), ('velocity', 0.05), ('agent', 0.046), ('ai', 0.043), ('behaviour', 0.042), ('rkhs', 0.041), ('montreal', 0.04), ('reward', 0.04), ('szepesv', 0.039), ('acceleration', 0.037), ('cvx', 0.036), ('kinect', 0.036), ('rewards', 0.036), ('mdp', 0.035), ('practicality', 0.034), ('exploration', 0.034), ('qk', 0.033), ('availability', 0.032), ('outperformed', 0.032), ('canada', 0.03), ('actions', 0.03), ('environment', 0.03), ('bumper', 0.03), ('couch', 0.03), ('irobot', 0.03), ('argmin', 0.029), ('reach', 0.027), ('regularizer', 0.027), ('lr', 0.026), ('pressed', 0.026), ('ri', 0.025), ('iterations', 0.024), ('might', 0.024), ('leverages', 0.023), ('school', 0.023), ('avoid', 0.022), ('turn', 0.022), ('imitation', 0.022), ('policies', 0.021), ('measurable', 0.021), ('optimal', 0.02), ('leveraging', 0.02), ('lp', 0.02), ('rmax', 0.02), ('error', 0.02), ('horizon', 0.02), ('scenarios', 0.019), ('conventional', 0.019), ('box', 0.019), ('margin', 0.019), ('zi', 0.019), ('trajectories', 0.019), ('dynamics', 0.019), ('assumptions', 0.019), ('goal', 0.019), ('ghavamzadeh', 0.019), ('violation', 0.019), ('suboptimal', 0.019), ('upper', 0.018), ('assumption', 0.018), ('task', 0.018), ('wall', 0.018), ('maxa', 0.018), ('ross', 0.018), ('front', 0.018), ('coef', 0.017), ('aggregation', 0.017), ('munos', 0.017), ('state', 0.017), ('simulated', 0.017), ('primitives', 0.017), ('approximate', 0.016), ('querying', 0.016), ('position', 0.016), ('initial', 0.016), ('states', 0.016), ('regularized', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 165 nips-2013-Learning from Limited Demonstrations
Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup
Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task. 1
2 0.20629725 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
3 0.19703491 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
4 0.18623094 348 nips-2013-Variational Policy Search via Trajectory Optimization
Author: Sergey Levine, Vladlen Koltun
Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1
5 0.1527805 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
Author: Mahdi Milani Fard, Yuri Grinberg, Amir massoud Farahmand, Joelle Pineau, Doina Precup
Abstract: This paper addresses the problem of automatic generation of features for value function approximation in reinforcement learning. Bellman Error Basis Functions (BEBFs) have been shown to improve policy evaluation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections, which generates BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space guarantee a contraction in the error. Empirical results demonstrate the strength of this method in domains in which choosing a good state representation is challenging. 1
6 0.14700136 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
7 0.12761277 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
8 0.12381994 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
9 0.12331489 241 nips-2013-Optimizing Instructional Policies
10 0.12290134 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
11 0.12052383 257 nips-2013-Projected Natural Actor-Critic
12 0.12030232 255 nips-2013-Probabilistic Movement Primitives
13 0.11757749 347 nips-2013-Variational Planning for Graph-based MDPs
14 0.11517124 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
15 0.11380485 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
16 0.11044189 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
17 0.1027803 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
18 0.099731334 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
19 0.093354359 325 nips-2013-The Pareto Regret Frontier
20 0.091273211 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
topicId topicWeight
[(0, 0.163), (1, -0.232), (2, -0.102), (3, 0.107), (4, -0.006), (5, 0.015), (6, -0.058), (7, 0.051), (8, -0.013), (9, 0.022), (10, 0.009), (11, -0.041), (12, -0.018), (13, -0.007), (14, 0.018), (15, 0.006), (16, -0.0), (17, -0.026), (18, -0.01), (19, 0.007), (20, -0.056), (21, -0.062), (22, 0.01), (23, 0.043), (24, -0.049), (25, -0.033), (26, 0.042), (27, 0.05), (28, 0.044), (29, -0.049), (30, 0.003), (31, 0.029), (32, 0.08), (33, -0.052), (34, -0.047), (35, 0.033), (36, 0.005), (37, -0.078), (38, -0.0), (39, -0.081), (40, -0.033), (41, 0.06), (42, 0.036), (43, -0.07), (44, 0.036), (45, -0.077), (46, -0.118), (47, -0.049), (48, 0.084), (49, -0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.91781092 165 nips-2013-Learning from Limited Demonstrations
Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup
Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task. 1
2 0.7129733 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
3 0.69283581 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
Author: Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer
Abstract: Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI’s results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. 1
4 0.68744409 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
5 0.68504524 257 nips-2013-Projected Natural Actor-Critic
Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan
Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1
6 0.67278129 255 nips-2013-Probabilistic Movement Primitives
7 0.66441643 348 nips-2013-Variational Policy Search via Trajectory Optimization
8 0.63776523 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
9 0.63156259 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
10 0.62177277 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
11 0.60014188 241 nips-2013-Optimizing Instructional Policies
12 0.5819577 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
13 0.57670069 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
14 0.52823162 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
15 0.49383059 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
16 0.4926483 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
17 0.48154318 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
18 0.46108314 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
19 0.45379558 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
20 0.42067108 347 nips-2013-Variational Planning for Graph-based MDPs
topicId topicWeight
[(2, 0.016), (16, 0.026), (33, 0.094), (34, 0.12), (41, 0.027), (49, 0.022), (56, 0.109), (70, 0.043), (77, 0.335), (85, 0.037), (89, 0.019), (93, 0.031), (95, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.70914048 165 nips-2013-Learning from Limited Demonstrations
Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup
Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task. 1
2 0.66732609 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
Author: Dae Il Kim, Prem Gopalan, David Blei, Erik Sudderth
Abstract: Stochastic block models characterize observed network relationships via latent community memberships. In large social networks, we expect entities to participate in multiple communities, and the number of communities to grow with the network size. We introduce a new model for these phenomena, the hierarchical Dirichlet process relational model, which allows nodes to have mixed membership in an unbounded set of communities. To allow scalable learning, we derive an online stochastic variational inference algorithm. Focusing on assortative models of undirected networks, we also propose an efficient structured mean field variational bound, and online methods for automatically pruning unused communities. Compared to state-of-the-art online learning methods for parametric relational models, we show significantly improved perplexity and link prediction accuracy for sparse networks with tens of thousands of nodes. We also showcase an analysis of LittleSis, a large network of who-knows-who at the heights of business and government. 1
3 0.64228648 326 nips-2013-The Power of Asymmetry in Binary Hashing
Author: Behnam Neyshabur, Nati Srebro, Ruslan Salakhutdinov, Yury Makarychev, Payman Yadollahpour
Abstract: When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between x and x as the hamming distance between f (x) and g(x ), for two distinct binary codes f, g, rather than as the hamming distance between f (x) and f (x ). 1
4 0.63751656 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
Author: Shunan Zhang, Angela J. Yu
Abstract: How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects’ choices, on a trial-totrial basis, are best captured by a “forgetful” Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects’ trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment. 1
5 0.50542617 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
6 0.50356692 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
7 0.50288957 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
8 0.50073546 249 nips-2013-Polar Operators for Structured Sparse Estimation
9 0.50046122 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
10 0.49853414 348 nips-2013-Variational Policy Search via Trajectory Optimization
11 0.4984729 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
12 0.49787712 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
13 0.4974944 184 nips-2013-Marginals-to-Models Reducibility
14 0.49637389 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
15 0.4963018 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
16 0.49605739 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
17 0.49553552 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
18 0.49457172 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
19 0.49456707 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
20 0.4944936 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries