nips nips2011 nips2011-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 kr Abstract The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. [sent-7, score-1.306]
2 Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. [sent-8, score-0.505]
3 We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. [sent-9, score-0.154]
4 1 Introduction The objective of inverse reinforcement learning (IRL) is to determine the decision making agent’s underlying reward function from its behaviour data and the model of environment [1]. [sent-11, score-0.801]
5 In animal and human behaviour studies [2], the agent’s behaviour could be understood by the reward function since the reward function reflects the agent’s objectives and preferences. [sent-13, score-1.354]
6 In robotics [3], IRL provides a framework for making robots learn to imitate the demonstrator’s behaviour using the inferred reward function. [sent-14, score-0.677]
7 In other areas related to reinforcement learning, such as neuroscience [4] and economics [5], IRL addresses the non-trivial problem of finding an appropriate reward function when building a computational model for decision making. [sent-15, score-0.582]
8 In IRL, we generally assume that the agent is an expert in the problem domain and hence it behaves optimally in the environment. [sent-16, score-0.225]
9 Using the Markov decision process (MDP) formalism, the IRL problem is defined as finding the reward function that the expert is optimizing given the behaviour data of state-action histories and the environment model of state transition probabilities. [sent-17, score-0.909]
10 In the last decade, a number of studies have addressed IRL in a direct (reward learning) and indirect (policy learning by inferring the reward function, i. [sent-18, score-0.523]
11 Ng and Russell [6] proposed a sufficient and necessary condition on the reward functions that guarantees the optimality of the expert’s policy and formulated a linear programming (LP) problem to find the reward function from the behaviour data. [sent-21, score-1.603]
12 Extending their work, Abbeel and Ng [7] presented an algorithm for finding the expert’s policy from its behaviour data with a performance guarantee on the learned policy. [sent-22, score-0.514]
13 [8] applied the structured max-margin optimization to IRL and proposed a method for finding the reward function that maximizes the margin between the expert’s policy and all other policies. [sent-24, score-0.852]
14 Neu and Szepesv´ ri [9] provided an algorithm for finding the policy that minimizes the a deviation from the behaviour. [sent-25, score-0.321]
15 Their algorithm unifies the direct method that minimizes a loss function of the deviation and the indirect method that finds an optimal policy from the learned reward function using IRL. [sent-26, score-0.891]
16 Syed and Schapire [10] proposed a method to find a policy that improves the expert’s policy using a game-theoretic framework. [sent-27, score-0.642]
17 [11] adopted the principle of the 1 maximum entropy for learning the policy whose feature expectations are constrained to match those of the expert’s behaviour. [sent-29, score-0.321]
18 IRL is an inherently ill-posed problem since there may be an infinite number of reward functions that yield the expert’s policy as optimal. [sent-31, score-0.826]
19 Previous approaches summarized above employ various preferences on the reward function to address the non-uniqueness. [sent-32, score-0.505]
20 For example, Ng and Russell [6] search for the reward function that maximizes the difference in the values of the expert’s policy and the second best policy. [sent-33, score-0.852]
21 More recently, Ramachandran and Amir [13] presented a Bayesian approach formulating the reward preference as the prior and the behaviour compatibility as the likelihood, and proposed a Markov chain Monte Carlo (MCMC) algorithm to find the posterior mean of the reward function. [sent-34, score-1.468]
22 This is achieved by searching for the maximum-a-posteriori (MAP) reward function, in contrast to computing the posterior mean. [sent-36, score-0.609]
23 We show that the posterior mean can be problematic for the reward inference since the loss function is integrated over the entire reward space, even including those inconsistent with the behaviour data. [sent-37, score-1.369]
24 Hence, the inferred reward function can induce a policy much different from the expert’s policy. [sent-38, score-0.826]
25 The MAP estimate, however, is more robust in the sense that the objective function (the posterior probability in our case) is evaluated on a single reward function. [sent-39, score-0.609]
26 In order to find the MAP reward function, we present a gradient method using the differentiability result of the posterior, and show the effectiveness of our approach through experiments. [sent-40, score-0.587]
27 Using matrix notations, the transition function is denoted as an |S||A| × |S| matrix T , and the reward function is denoted as an |S||A|-dimensional vector R. [sent-43, score-0.505]
28 The value of policy π is the expected discounted ∞ return of executing the policy and defined as V π = E [ t=0 γ t R(st , at )|α, π] where the initial state s0 is determined according to initial state distribution α and action at is chosen by policy π in state st . [sent-45, score-1.141]
29 The value function of policy π for each state s is computed by V π (s) = R(s, π(s)) + γ s′ ∈S T (s, π(s), s′ )V π (s′ ) such that the value of policy π is calculated by V π = s α(s)V π (s). [sent-46, score-0.683]
30 a An optimal policy π ∗ maximizes the value function for all the states, and thus should satisfy the Bellman optimality equation: π is an optimal policy if and only if for all s ∈ S, π(s) ∈ ∗ ∗ argmaxa∈A Qπ (s, a). [sent-49, score-0.82]
31 When the state space is large, the reward function is often linearly parameterized: R(s, a) = d i=1 wi φi (s, a) with the basis functions φi : S × A → R and the weight vector w = [w1 , w2 , · · · , wd ]. [sent-51, score-0.589]
32 Each basis function φi has a corresponding basis value Viπ of policy π : Viπ = ∞ E [ t=0 γ t φi (st , at )|α, π]. [sent-52, score-0.371]
33 2 We also assume that the expert’s behaviour is given as the set X of M trajectories executed by the expert’s policy πE , where the m-th trajectory is an H-step sequence of state-action pairs: {(sm , am ), (sm , am ), · · · , (sm , am )}. [sent-53, score-0.555]
34 Given the set of trajectories, the value and the basis value 1 1 2 2 H H of the expert’s policy πE can be empirically estimated by ˆ VE = 1 M M m=1 H h=1 ˆ ViE = γ h−1 R(sm , am ), h h M m=1 1 M H h=1 γ h−1 φi (sm , am ). [sent-54, score-0.346]
35 h h In addition, we can empirically estimate the expert’s policy πE and its state visitation frequency µE ˆ ˆ from the trajectories: πE (s, a) = ˆ M H m m m=1 h=1 1(sh =s∧ah =a) , M H m m=1 h=1 1(sh =s) µE (s) = ˆ 1 MH M m=1 H h=1 1(sm =s) . [sent-55, score-0.362]
36 h In the rest of the paper, we use the notation f (R) or f (x; R) for function f in order to be explicit that f is computed using reward function R. [sent-56, score-0.505]
37 For example, the value function V π (s; R) denotes the value of policy π for state s using reward function R. [sent-57, score-0.867]
38 2 Reward Optimality Condition Ng and Russell [6] presented a necessary and sufficient condition for reward function R of an MDP to guarantee the optimality of policy π: Qπ (R) ≤ V π (R) for all a ∈ A. [sent-59, score-0.926]
39 We refer to Equation (2) as the reward optimality condition w. [sent-62, score-0.605]
40 Since the linear inequalities define the region of the reward functions that yield policy π as optimal, we refer to the region bounded by Equation (2) as the reward optimality region w. [sent-66, score-1.545]
41 Note that there exist infinitely many reward functions in the reward optimality region even including constant reward functions (e. [sent-70, score-1.653]
42 In other words, even when we are presented with the expert’s policy, there are infinitely many reward functions to choose from, including the degenerate ones. [sent-73, score-0.505]
43 To resolve this non-uniqueness in solutions, IRL algorithms in the literature employ various preferences on reward functions. [sent-74, score-0.505]
44 3 Bayesian framework for IRL (BIRL) Ramachandran and Amir [13] proposed a Bayesian framework for IRL by encoding the reward function preference as the prior and the optimality confidence of the behaviour data as the likelihood. [sent-76, score-0.855]
45 For example, the uniform prior can be used if we have no knowledge about the reward function other than its range, and a Gaussian or a Laplacian prior can be used if we prefer rewards to be close to some specific values. [sent-83, score-0.638]
46 The posterior over the reward function is then formulated by combining the prior and the likelihood, using Bayes theorem: P (R|X ) ∝ P (X |R)P (R). [sent-94, score-0.655]
47 (5) BIRL uses a Markov chain Monte Carlo (MCMC) algorithm to compute the posterior mean of the reward function. [sent-95, score-0.661]
48 3 MAP Inference in Bayesian IRL In the Bayesian approach to IRL, the reward function can be determined using different estimates, such as the posterior mean, median, or maximum-a-posterior (MAP). [sent-96, score-0.609]
49 The posterior mean is commonly used since it can be shown to be optimal under the mean square error function. [sent-97, score-0.18]
50 However, the problem with the posterior mean in Bayesian IRL is that the error is integrated over the entire space of reward functions, even including infinitely many rewards that induce policies inconsistent with the behaviour data. [sent-98, score-0.902]
51 This can yield a posterior mean reward function with an optimal policy again inconsistent with the data. [sent-99, score-1.018]
52 On the other hand, the MAP does not involve an objective function that is integrated over the reward function space; it is simply a point that maximizes the posterior probability. [sent-100, score-0.656]
53 Hence, it is more robust to infinitely many inconsistent reward functions. [sent-101, score-0.542]
54 We present a simple example that compares the posterior mean and the MAP reward function estimation. [sent-102, score-0.634]
55 1, 0, 0, 0, 1], hence the optimal policy chooses a1 in every state. [sent-111, score-0.347]
56 Suppose that we already know R(s2 ), R(s3 ), and R(s4 ) which are all 0, and estimate R(s1 ) and R(s5 ) from the behaviour data X which contains optimal actions for all the states. [sent-112, score-0.218]
57 The reward optimality region can be also computed using Equation (2). [sent-115, score-0.643]
58 Figure 1(b) presents the posterior distribution of the reward function. [sent-116, score-0.609]
59 The true reward, the MAP reward, and the posterior mean reward are marked with the black star, the blue circle, and the red cross, respectively. [sent-117, score-0.652]
60 The black solid line is the boundary of the reward optimality region. [sent-118, score-0.605]
61 Although the prior mean is set to the true reward, the posterior mean is outside the reward optimality region. [sent-119, score-0.823]
62 An optimal policy for the posterior mean reward function chooses action a2 rather than action a1 in state s1 , while an optimal policy for the MAP reward function is identical to the true one. [sent-120, score-1.964]
63 An optimal policy for the posterior mean reward function chooses action a2 in states s1 and s2 , while an optimal policy for the MAP reward function is again identical to the true one. [sent-122, score-1.887]
64 In the rest of this section, we additionally show that most of the IRL algorithms in the literature can be cast as searching for the MAP reward function in Bayesian IRL. [sent-123, score-0.505]
65 The main insight comes from the fact that these algorithms try to optimize an objective function consisting of a regularization term for the preference on the reward function and an assessment term for the compatibility of the reward function with the behaviour data. [sent-124, score-1.285]
66 The objective function is naturally formulated as the posterior in a Bayesian framework by encoding the regularization into the prior and the data compatibility into the likelihood. [sent-125, score-0.202]
67 Note that this framework can exploit the insights behind apprenticeship learning algorithms even if they do not explicitly learn a reward function (e. [sent-131, score-0.578]
68 The likelihood is defined for measuring the compatibility of the reward function R with the behaviour data X . [sent-136, score-0.757]
69 For example, the empirical value of X is compared with V ∗ [6, 8], X is directly compared to the learned policy (e. [sent-141, score-0.342]
70 the greedy policy from Q∗ ) [9], or the probability of following the trajectories in X is computing using Q∗ [13]. [sent-143, score-0.364]
71 We have obtained the same result e 5 based on the reward optimality region, and additionally identified the condition for which V ∗ (R) and Q∗ (R) are non-differentiable (refer to the proof for details). [sent-156, score-0.605]
72 Assuming a differentiable prior, we can compute the gradient of the posterior using the result in Theorem 3 and the chain rule. [sent-160, score-0.217]
73 If the posterior is convex, we will find the MAP reward function. [sent-161, score-0.609]
74 In the next section, we will experimentally show that the locally optimal solutions are nonetheless better than the posterior mean in practice. [sent-163, score-0.155]
75 This is due to the property that they are generally found within the reward optimality region w. [sent-164, score-0.643]
76 Since computing ∇R P (R|X ) involves computing an optimal policy for the current reward function and a matrix inversion, caching these results helps reduce repetitive computation. [sent-169, score-0.874]
77 The idea is to compute the reward optimality region for checking whether we can reuse the cached result. [sent-170, score-0.686]
78 If Rnew is inside the reward optimality region of an already visited reward function R′ , they share the same optimal policy and hence the same ∇R V π (R) or ∇R Qπ (R). [sent-171, score-1.495]
79 Given policy π, the reward optimality region is defined by H π = I − (I A − γT )(I − γT π )−1 E π , and we can reuse the cached result if H π · Rnew ≤ 0. [sent-172, score-1.007]
80 The linearly parameterized reward function is determined by the weight vector w sampled i. [sent-179, score-0.523]
81 We evaluated the performance of the algorithms using the difference between V ∗ (the value of the expert’s policy) and V L (the value of the optimal policy induced by the learned weight wL measured on the true weight w∗ ) and the difference between w∗ and wL using L2 norm. [sent-189, score-0.422]
82 Most of the algorithms found the weight that induces an optimal policy whose performance is as good as that of the expert’s policy (i. [sent-324, score-0.686]
83 The poor performance of NatPM may be due to the ineffectiveness of pseudo-metric in high dimensional reward spaces, since PlainPM was able produce good performance. [sent-328, score-0.505]
84 BIRL takes much longer CPU time to converge than MAP-B since the former takes much larger number of iterations to converge, and in addition, each iteration requires solving an MDP with a sampled reward function. [sent-337, score-0.505]
85 The CPU time gap gets larger as we increase the dimension of the reward function. [sent-338, score-0.505]
86 The velocities in the vertical and horizontal directions are represented as 0, 1, or 2, and the net velocity is computed as the squared sum of directional velocities. [sent-346, score-0.152]
87 If the control fails, the velocity is maintained, and if the car attempts to move outside the racetrack, it remains in the previous location with velocity 0. [sent-352, score-0.262]
88 The basis functions are 0-1 indicator functions for the goal locations, off-track locations, and 3 net velocity values (zero, low, high) while the car is on track. [sent-353, score-0.206]
89 Goal Off-track Velocity while on track Zero Fast expert BIRL MAP-B 1. [sent-358, score-0.191]
90 steps in velocity to goal Fast expert BIRL MAP-B Off-track On-track Zero Low High 20. [sent-387, score-0.297]
91 The slow expert prefers low velocity and avoids the off-track locations, w = [1, −0. [sent-418, score-0.321]
92 The fast expert prefers high velocity, w = [1, 0, 0, 0, 0. [sent-421, score-0.215]
93 We compared the posterior mean and the MAP using the prior P (w1 )=N (1, 1) and P (w2 )=P (w3 )=P (w4 )=P (w5 )= N (0, 1) assuming that we do not know the experts’ preference on the locations nor the velocity, but we know the experts’ ultimate goal is to reach one of the goal locations. [sent-423, score-0.241]
94 We used BIRL for the posterior mean and MAP-B for the MAP estimation, hence using the identical prior and likelihood. [sent-424, score-0.175]
95 We omit the results regarding the slow expert since both BIRL and MAPB successfully found the weight similar with the true one, which induced the slow expert’s policy as optimal. [sent-426, score-0.548]
96 G G G G S S 6 Conclusion We have argued that, when using a Bayesian framework for learning reward functions in IRL, the MAP estimate is preferable over the posterior mean. [sent-431, score-0.609]
97 We have also shown that the MAP estimation approach subsumes nonBayesian IRL algorithms in the literature, and allows us to incorporate various types of a priori knowledge about the reward functions and the measurement of the compatibility with behaviour data. [sent-433, score-0.729]
98 We proved that the generalized posterior is differentiable almost everywhere, and proposed a gradient method to find a locally optimal solution to the MAP estimation. [sent-434, score-0.216]
99 We provided the theoretical result equivalent to the previous work on gradient methods for non-Bayesian IRL, but used a different proof based on the reward optimality region. [sent-435, score-0.655]
100 Apprenticeship learning using inverse reinforcement learning and gradient a methods. [sent-490, score-0.174]
wordName wordTfidf (topN-words)
[('irl', 0.505), ('reward', 0.505), ('policy', 0.321), ('birl', 0.319), ('expert', 0.191), ('behaviour', 0.172), ('rnew', 0.128), ('map', 0.107), ('velocity', 0.106), ('posterior', 0.104), ('optimality', 0.1), ('wl', 0.088), ('mmp', 0.077), ('reinforcement', 0.077), ('apprenticeship', 0.073), ('neu', 0.07), ('plainpm', 0.064), ('cpu', 0.062), ('sm', 0.061), ('mdp', 0.052), ('compatibility', 0.052), ('korea', 0.052), ('maxent', 0.052), ('gradient', 0.05), ('car', 0.05), ('natpm', 0.048), ('inverse', 0.047), ('bayesian', 0.046), ('prior', 0.046), ('trajectories', 0.043), ('ng', 0.041), ('state', 0.041), ('szepesv', 0.04), ('russell', 0.039), ('ratliff', 0.039), ('ramachandran', 0.039), ('fv', 0.039), ('region', 0.038), ('inconsistent', 0.037), ('action', 0.036), ('fe', 0.036), ('gridworld', 0.036), ('differentiable', 0.036), ('race', 0.034), ('agent', 0.034), ('locations', 0.034), ('sec', 0.033), ('preference', 0.032), ('argmaxr', 0.032), ('computerewardoptrgn', 0.032), ('fg', 0.032), ('pmaxent', 0.032), ('racetrack', 0.032), ('rmap', 0.032), ('solvemdp', 0.032), ('vl', 0.032), ('differentiability', 0.032), ('dim', 0.031), ('rmax', 0.03), ('discount', 0.029), ('likelihood', 0.028), ('nitely', 0.028), ('mwal', 0.028), ('ziebart', 0.028), ('chain', 0.027), ('optimal', 0.026), ('sh', 0.026), ('maximizes', 0.026), ('syed', 0.026), ('net', 0.025), ('mean', 0.025), ('basis', 0.025), ('cached', 0.024), ('abbeel', 0.024), ('prefers', 0.024), ('bagnell', 0.023), ('uniform', 0.022), ('caching', 0.022), ('fj', 0.022), ('integrated', 0.021), ('directional', 0.021), ('amir', 0.021), ('element', 0.021), ('learned', 0.021), ('mcmc', 0.021), ('actions', 0.02), ('moves', 0.019), ('choi', 0.019), ('executing', 0.019), ('ra', 0.019), ('reuse', 0.019), ('nding', 0.019), ('policies', 0.019), ('rewards', 0.019), ('executed', 0.019), ('assessment', 0.019), ('weight', 0.018), ('defense', 0.018), ('indirect', 0.018), ('true', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
2 0.37749866 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
Author: Sergey Levine, Zoran Popovic, Vladlen Koltun
Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert’s policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. 1
3 0.27795753 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
4 0.25070056 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
Author: Dylan A. Simon, Nathaniel D. Daw
Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1
5 0.21537831 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
6 0.19040303 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
7 0.17351264 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
8 0.14979292 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
9 0.14078312 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
10 0.13689879 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
11 0.12371741 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
12 0.12296474 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
13 0.10341699 56 nips-2011-Committing Bandits
14 0.10219061 283 nips-2011-The Fixed Points of Off-Policy TD
15 0.098365627 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
16 0.09172634 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
17 0.090929106 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
18 0.086613767 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
19 0.082262807 32 nips-2011-An Empirical Evaluation of Thompson Sampling
20 0.081858143 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
topicId topicWeight
[(0, 0.164), (1, -0.231), (2, 0.135), (3, 0.286), (4, -0.292), (5, 0.038), (6, 0.026), (7, 0.044), (8, 0.009), (9, 0.052), (10, -0.033), (11, 0.04), (12, -0.027), (13, -0.03), (14, -0.075), (15, -0.058), (16, 0.077), (17, 0.143), (18, 0.052), (19, -0.042), (20, -0.024), (21, -0.029), (22, -0.022), (23, 0.029), (24, 0.04), (25, -0.041), (26, 0.089), (27, -0.053), (28, 0.11), (29, 0.048), (30, 0.119), (31, -0.063), (32, 0.022), (33, -0.075), (34, 0.032), (35, -0.054), (36, -0.065), (37, 0.001), (38, 0.057), (39, -0.008), (40, -0.009), (41, -0.258), (42, -0.116), (43, -0.094), (44, 0.017), (45, -0.09), (46, 0.119), (47, 0.074), (48, -0.016), (49, -0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.97444272 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
2 0.85211831 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
Author: Sergey Levine, Zoran Popovic, Vladlen Koltun
Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert’s policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. 1
3 0.74722809 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
Author: Mehdi Keramati, Boris S. Gutkin
Abstract: Reinforcement learning models address animal’s behavioral adaptation to its changing “external” environment, and are based on the assumption that Pavlovian, habitual and goal-directed responses seek to maximize reward acquisition. Negative-feedback models of homeostatic regulation, on the other hand, are concerned with behavioral adaptation in response to the “internal” state of the animal, and assume that animals’ behavioral objective is to minimize deviations of some key physiological variables from their hypothetical setpoints. Building upon the drive-reduction theory of reward, we propose a new analytical framework that integrates learning and regulatory systems, such that the two seemingly unrelated objectives of reward maximization and physiological-stability prove to be identical. The proposed theory shows behavioral adaptation to both internal and external states in a disciplined way. We further show that the proposed framework allows for a unified explanation of some behavioral pattern like motivational sensitivity of different associative learning mechanism, anticipatory responses, interaction among competing motivational systems, and risk aversion.
4 0.66471583 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
5 0.65971589 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
6 0.63650227 215 nips-2011-Policy Gradient Coagent Networks
7 0.62997073 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
8 0.58647251 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
9 0.52173883 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
10 0.50877947 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
11 0.50391924 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
12 0.47653109 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
13 0.4745599 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
14 0.45862079 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
15 0.40869868 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
16 0.40670675 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
17 0.32663888 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
18 0.32544038 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
19 0.30160984 32 nips-2011-An Empirical Evaluation of Thompson Sampling
20 0.30100015 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
topicId topicWeight
[(0, 0.024), (4, 0.029), (20, 0.045), (22, 0.188), (26, 0.034), (31, 0.13), (33, 0.018), (37, 0.02), (43, 0.051), (45, 0.147), (57, 0.05), (74, 0.04), (83, 0.032), (99, 0.073)]
simIndex simValue paperId paperTitle
1 0.87345356 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
Author: Ambuj Tewari, Pradeep K. Ravikumar, Inderjit S. Dhillon
Abstract: A hallmark of modern machine learning is its ability to deal with high dimensional problems by exploiting structural assumptions that limit the degrees of freedom in the underlying model. A deep understanding of the capabilities and limits of high dimensional learning methods under specific assumptions such as sparsity, group sparsity, and low rank has been attsined. Efforts [1,2] are now underway to distill this valuable experience by proposing general unified frameworks that can achieve the twio goals of summarizing previous analyses and enabling their application to notions of structure hitherto unexplored. Inspired by these developments, we propose and analyze a general computational scheme based on a greedy strategy to solve convex optimization problems that arise when dealing with structurally constrained high-dimensional problems. Our framework not only unifies existing greedy algorithms by recovering them as special cases but also yields novel ones. Finally, we extend our results to infinite dimensional settings by using interesting connections between smoothness of norms and behavior of martingales in Banach spaces.
same-paper 2 0.84840798 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
3 0.76408708 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
Author: George Konidaris, Scott Niekum, Philip S. Thomas
Abstract: We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an nstep return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which defines the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ (C), with a capacity parameter C. TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. 1
4 0.75991905 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
Author: Alex K. Susemihl, Ron Meir, Manfred Opper
Abstract: Bayesian filtering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We find that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1
5 0.75953716 180 nips-2011-Multiple Instance Filtering
Author: Kamil A. Wnuk, Stefano Soatto
Abstract: We propose a robust filtering approach based on semi-supervised and multiple instance learning (MIL). We assume that the posterior density would be unimodal if not for the effect of outliers that we do not wish to explicitly model. Therefore, we seek for a point estimate at the outset, rather than a generic approximation of the entire posterior. Our approach can be thought of as a combination of standard finite-dimensional filtering (Extended Kalman Filter, or Unscented Filter) with multiple instance learning, whereby the initial condition comes with a putative set of inlier measurements. We show how both the state (regression) and the inlier set (classification) can be estimated iteratively and causally by processing only the current measurement. We illustrate our approach on visual tracking problems whereby the object of interest (target) moves and evolves as a result of occlusions and deformations, and partial knowledge of the target is given in the form of a bounding box (training set). 1
6 0.75850534 32 nips-2011-An Empirical Evaluation of Thompson Sampling
7 0.75593436 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
8 0.75493801 283 nips-2011-The Fixed Points of Off-Policy TD
9 0.75284785 241 nips-2011-Scalable Training of Mixture Models via Coresets
10 0.75263637 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
11 0.75147206 271 nips-2011-Statistical Tests for Optimization Efficiency
12 0.75067466 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
13 0.74862349 229 nips-2011-Query-Aware MCMC
14 0.74784148 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
15 0.74643517 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
16 0.74572885 197 nips-2011-On Tracking The Partition Function
17 0.74550694 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
18 0.74304658 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
19 0.74146622 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
20 0.74136406 66 nips-2011-Crowdclustering