nips nips2011 nips2011-190 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sergey Levine, Zoran Popovic, Vladlen Koltun
Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert’s policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a probabilistic algorithm for nonlinear inverse reinforcement learning. [sent-7, score-0.19]
2 The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. [sent-8, score-0.726]
3 While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert’s policy. [sent-9, score-1.102]
4 Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. [sent-10, score-0.502]
5 1 Introduction Inverse reinforcement learning (IRL) methods learn a reward function in a Markov decision process (MDP) from expert demonstrations, allowing the expert’s policy to be generalized to unobserved situations [7]. [sent-11, score-0.774]
6 Each task is consistent with many reward functions, but not all rewards provide a compact, portable representation of the task, so the central challenge in IRL is to find a reward with meaningful structure [7]. [sent-12, score-0.869]
7 Many prior methods impose structure by describing the reward as a linear combination of hand selected features [1, 12]. [sent-13, score-0.507]
8 In this paper, we extend the Gaussian process model to learn highly nonlinear reward functions that still compactly capture the demonstrated behavior. [sent-14, score-0.461]
9 This allows GPIRL to balance the simplicity of the learned reward function against its consistency with the expert’s actions, without assuming the expert to be optimal. [sent-17, score-0.592]
10 The learned GP kernel hyperparameters capture the structure of the reward, including the relevance of each feature. [sent-18, score-0.174]
11 Once learned, the GP can recover the reward for the current state space, and can predict the reward for any unseen state space within the domain of the features. [sent-19, score-0.822]
12 While several margin-based methods learn nonlinear reward functions through feature construction [13, 14, 5], such methods assume optimal expert behavior. [sent-21, score-0.636]
13 The optimal policy π maximizes ∞ the expected discounted sum of rewards E [ t=0 γ t rst |π ]. [sent-24, score-0.245]
14 In inverse reinforcement learning, the algorithm is presented with M \ r, as well as expert demonstrations, denoted D = {ζ1 , . [sent-25, score-0.304]
15 The algorithm is also presented with features of the form f : S → R that can be used to represent the unknown reward r. [sent-32, score-0.463]
16 IRL aims to find a reward function r under which the optimal policy matches the expert’s demonstrations. [sent-33, score-0.489]
17 To this end, we could assume that the examples D are drawn from the optimal policy π . [sent-34, score-0.136]
18 One approach to learning from a suboptimal expert is to use a probabilistic model of the expert’s behavior. [sent-36, score-0.197]
19 We employ the maximum entropy IRL (MaxEnt) model [17], which is closely related to linearly-solvable MDPs [3], and has been used extensively to learn from human demonstrations [16, 17]. [sent-37, score-0.197]
20 This model is convenient for IRL, because its likelihood is differentiable [17], and a complete stochastic policy uniquely determines the reward function [3]. [sent-39, score-0.532]
21 Intuitively, such a stochastic policy is more deterministic when the stakes are high, and more random when all choices have similar value. [sent-40, score-0.125]
22 Under this policy, the probability of an action a in state s can be shown to be proportional to the exponential of the expected total reward after taking the action, denoted P (a|s) ∝ exp(Qr ), where sa Qr = r + γT Vr . [sent-41, score-0.472]
23 3 The Gaussian Process Inverse Reinforcement Learning Algorithm GPIRL represents the reward as a nonlinear function of feature values. [sent-49, score-0.444]
24 Since the reward is not known, we use Equation 1 to specify a distribution over GP outputs, and learn both the output values and the kernel function. [sent-52, score-0.483]
25 We also learn the kernel hyperparameters θ in order to recover the structure of the reward. [sent-57, score-0.135]
26 When Λ is learned, less relevant features receive low weights, and more relevant features receive high weights. [sent-62, score-0.16]
27 States distinguished by highly-weighted features can take on different reward values, while those that have similar values for all highly-weighted features take on similar rewards. [sent-63, score-0.543]
28 Kr,u is the covariance of the rewards at all states with the inducing point r,u u,u values u, located respectively at Xr and Xu [11]. [sent-65, score-0.206]
29 Instead, we can consider this problem as analogous to sparse approximation for GP regression [8], where a small set of inducing points u acts as the support for the full set of training points r. [sent-67, score-0.126]
30 This approximation is particularly appropriate in our case, because if the learned GP is used to predict a reward for a novel state space, the most likely reward would have the same form as the mean of the training conditional. [sent-70, score-0.85]
31 The resulting log likelihood is simply r,u u,u log P (D, u, θ|Xu ) = log P (D|r = KT K−1 u) + log P (u, θ|Xu ) r,u u,u IRL log likelihood (4) GP log likelihood Once the likelihood is optimized, the reward r = KT K−1 u can be used to recover the expert’s r,u u,u policy on the entire state space. [sent-72, score-0.775]
32 The GP can also predict the reward function for any novel state space in the domain of the features. [sent-73, score-0.411]
33 The most likely reward for a novel state space is the mean posterior KT,u K−1 u, where K ,u is the covariance of the new states and the inducing points. [sent-74, score-0.551]
34 Intuitively, this indicates that two or more inducing points are deterministically covarying, and therefore redundant. [sent-81, score-0.13]
35 While the regularized kernel prevents singular covariance matrices when many features become irrelevant, the log likelihood can still increase to infinity as Λ → 0 or β → 0: in 1 both cases, − 2 log |Ku,u | → ∞ and, so long as u → 0, all other terms remain finite. [sent-84, score-0.219]
36 To prevent such degeneracies, we use a hyperparameter prior that discourages kernels under which two inducing points become deterministically covarying. [sent-85, score-0.248]
37 We can u,u therefore prevent degeneracies with a prior term of the form − 1 ij [K−1 ]2 = − 1 tr(K−2 ), which u,u ij u,u 2 2 discourages large partial correlations between inducing points. [sent-87, score-0.19]
38 However, unlike in GP regression, Xu and u are parameters of the algorithm rather than data, and since the inducing point positions are fixed in advance, it is possible to condition the prior on Xu . [sent-89, score-0.136]
39 5 Inducing Points and Large State Spaces A straightforward choice for the inducing points Xu is the feature values of all states in the state space S. [sent-93, score-0.19]
40 In principle, the minimum size of Xu corresponds to the complexity of the reward function. [sent-97, score-0.383]
41 For example, if the true reward has two constant regions, it can be represented by just two properly placed inducing points. [sent-98, score-0.475]
42 In practice, Xu must cover the space of feature values well enough to represent an unknown reward function, but we can nonetheless use many fewer points than there are states in S. [sent-99, score-0.453]
43 The stationary kernel in Equation 5 favors rewards that are smooth with respect to feature values. [sent-104, score-0.166]
44 For example, a reward function might have wide regions with uniform values, punctuated by regions of high-frequency variation, as is the case for piecewise constant rewards. [sent-106, score-0.419]
45 Instead, we can warp each coordinate xik of xi by a function wk (xik ) to give high resolution to one region, and low resolution everywhere else. [sent-108, score-0.194]
46 Note that this procedure is not equivalent to merely fitting a sigmoid to the reward function, since the reward can still vary nonlinearly in the high resolution regions around each sigmoid center mk . [sent-112, score-0.901]
47 The accompanying supplement includes details about the priors placed on the warp parameters in our implementation, a σ complete derivation of wk , and the derivatives of the warped kernel function. [sent-113, score-0.237]
48 We presented just one example of how an alternative kernel allows us to learn a reward with a particular structure. [sent-119, score-0.483]
49 7 Experiments We compared GPIRL with prior methods on several IRL tasks, using examples sampled from the stochastic MaxEnt policy (see Section 2) as well as human demonstrations. [sent-121, score-0.238]
50 Examples drawn from the stochastic policy can intuitively be viewed as noisy samples of an underlying optimal policy, while the human demonstrations contain the stochasticity inherent in human behavior. [sent-122, score-0.322]
51 We compare the algorithms using the “expected value difference” score, which is a measure of how suboptimal the learned policy is under the true reward. [sent-126, score-0.206]
52 To compute this score, we find the optimal deterministic policy under each learned reward, measure its expected sum of discounted rewards under the true reward function, and subtract this quantity from the expected sum of discounted rewards under the true policy. [sent-127, score-0.823]
53 To determine how well each algorithm captured the structure of the reward function, we evaluated the learned reward on the environment on which it was learned, and on 4 additional random environments (denoted “transfer”). [sent-129, score-0.874]
54 Algorithms that do not express the reward function in terms of the correct features are expected to perform poorly on the transfer environments, even if they perform well on the training environment. [sent-130, score-0.564]
55 In the latter case, GPIRL used the warped kernel in Section 6 and FIRL, which requires discrete features, was not tested. [sent-133, score-0.162]
56 1 Objectworld Experiments The objectworld is an N × N grid of states with five actions per state, corresponding to steps in each direction and staying in place. [sent-136, score-0.123]
57 The true reward is positive in states that are both within 3 cells of outer color 1 and 2 cells of outer color 2, negative within 3 cells of outer color 1, and zero otherwise. [sent-145, score-0.642]
58 Inner colors and all other outer colors are distractors. [sent-146, score-0.157]
59 GPIRL learned accurate rewards that generalized well to new state spaces. [sent-150, score-0.167]
60 Because of the large number of irrelevant features and the nonlinearity of the reward, this example is particularly challenging for methods that learn linear reward functions. [sent-153, score-0.522]
61 With 16 or more examples, GPIRL consistently learned reward functions that performed as well as the true reward, as shown in Figure 1, and was able to sustain this performance as the number of distractors increased, as shown in Figure 2. [sent-154, score-0.481]
62 In the case of FIRL, this was likely due to the suboptimal expert examples. [sent-156, score-0.197]
63 In the case of MaxEnt, although the Laplace prior improved the results, the inability to represent nonlinear rewards limited the algorithm’s accuracy. [sent-157, score-0.166]
64 These issues are evident in Figure 3, which shows part of a reward function learned by each method. [sent-158, score-0.439]
65 When using continuous features, the performance of MaxEnt suffered even more from the increased nonlinearity of the reward function, while GPIRL maintained a similar level of accuracy. [sent-159, score-0.439]
66 True Reward outer color 1 objects GPIRL MaxEnt/Lp outer color 2 objects other objects (distractors) FIRL expert actions Figure 3: Part of a reward function learned by each algorithm on an objectworld. [sent-160, score-0.769]
67 While GPIRL learned the correct reward function, MaxEnt was unable to represent the nonlinearities, and FIRL learned an overly complex reward under which the suboptimal expert would have been optimal. [sent-161, score-1.075]
68 While GPIRL achieved only modest improvement over prior methods on the training environment, the large improvement in the transfer tests indicates that the underlying reward structure was captured more accurately. [sent-163, score-0.49]
69 GPIRL learned a reward function that more accurately reflected the true policy the expert was attempting to emulate. [sent-165, score-0.698]
70 2 Highway Driving Behavior In addition to the objectworld environment, we evaluated the algorithms on more concrete behaviors in the context of a simple highway driving simulator, modeled on the experiment in [5] and similar evaluations in other work [1]. [sent-167, score-0.153]
71 Continuous features indicate the distance to the nearest vehicle of a specific class (car or motorcycle) or category (civilian or police) in front of the agent, either in the same lane, the lane to the right, the lane to the left, or any lane. [sent-171, score-0.193]
72 Another set of features gives the distance to the nearest such vehicle in a given lane behind the agent. [sent-172, score-0.15]
73 In this section, we present results from synthetic and manmade demonstrations of a policy that drives as fast as possible, but avoids driving more than double the speed of traffic within two car-lengths of a police vehicle. [sent-175, score-0.36]
74 Due to the connection between the police and speed features, the reward for this policy is nonlinear. [sent-176, score-0.588]
75 We also evaluated a second policy that instead avoids driving more than double the speed of traffic in the rightmost lane. [sent-177, score-0.166]
76 Figure 4 shows a comparison of GPIRL and prior algorithms on highways with varying numbers of 32-step synthetic demonstrations of the “police” task. [sent-179, score-0.19]
77 GPIRL only modestly outperformed prior methods on the training environments with discrete features, but achieved large improvement on the transfer experiment. [sent-180, score-0.167]
78 This indicates that, while prior algorithms learned a reasonable reward, this reward was not expressed in terms of the correct features, and did not generalize correctly. [sent-181, score-0.483]
79 With continuous features, the nonlinearity of the reward was further exacerbated, making it difficult for linear methods to represent it even on the training environment. [sent-182, score-0.439]
80 7 MaxEnt/Lp True Reward FIRL GPIRL Figure 6: Highway reward functions learned from human demonstration. [sent-184, score-0.478]
81 Road color indicates the reward at the highest speed, when the agent should be penalized for driving fast near police vehicles. [sent-185, score-0.534]
82 The reward learned by GPIRL most closely resembles the true one. [sent-186, score-0.439]
83 Although the human demonstrations were suboptimal, GPIRL was still able to learn a reward function that reflected the true policy more accurately than prior methods. [sent-187, score-0.73]
84 Furthermore, the similarity of GPIRL’s performance with the human and synthetic demonstrations suggests that its model of suboptimal expert behavior is a reasonable reflection of actual human suboptimality. [sent-188, score-0.394]
85 An example of rewards learned from human demonstrations is shown in Figure 6. [sent-189, score-0.297]
86 htm 8 Discussion and Future Work We presented an algorithm for inverse reinforcement learning that represents nonlinear reward functions with Gaussian processes. [sent-193, score-0.573]
87 Using a probabilistic model of a stochastic expert with a GP prior on reward values, our method is able to recover both a reward function and the hyperparameters of a kernel function that describes the structure of the reward. [sent-194, score-1.078]
88 The learned GP can be used to predict a reward function consistent with the expert on any state space in the domain of the features. [sent-195, score-0.62]
89 In experiments with nonlinear reward functions, GPIRL consistently outperformed prior methods, especially when generalizing the learned reward to new state spaces. [sent-196, score-0.955]
90 When using the warped kernel function, a random restart procedure was needed to consistently find a good optimum. [sent-198, score-0.182]
91 When good features that form a linear basis for the reward are already known, prior methods such as MaxEnt would be expected to perform comparably to GPIRL. [sent-201, score-0.545]
92 When presented with a novel state space, GPIRL currently uses the mean posterior of the GP to estimate the reward function. [sent-203, score-0.428]
93 In principle, we could leverage the fact that GPs learn distributions over functions to account for the uncertainty about the reward in states that are different from any of the inducing points. [sent-204, score-0.545]
94 For example, such an approach could be used to learn a “conservative” policy that aims to achieve high rewards with some degree of certainty, avoiding regions where the reward distribution has high variance. [sent-205, score-0.629]
95 In an interactive training setting, such a method could also inform the expert about states that have high reward variance and require additional demonstrations. [sent-206, score-0.567]
96 More generally, by introducing Gaussian processes into inverse reinforcement learning, GPIRL can benefit from the wealth of prior work on Gaussian process regression. [sent-207, score-0.214]
97 For instance, we apply ideas from sparse GP approximation in the use of a small set of inducing points to learn the reward function in time linear in the number of states. [sent-208, score-0.531]
98 A substantial body of prior work discusses techniques for automatically choosing or optimizing these inducing points [8], and such methods could be incorporated into GPIRL to learn reward functions with even smaller active sets. [sent-209, score-0.575]
99 We also demonstrate how different kernels can be used to learn different types of reward structure, and further investigation into the kinds of kernel functions that are useful for IRL is another exciting avenue for future work. [sent-210, score-0.523]
100 Apprenticeship learning using inverse reinforcement learning and a gradient methods. [sent-250, score-0.151]
wordName wordTfidf (topN-words)
[('gpirl', 0.666), ('reward', 0.383), ('maxent', 0.29), ('firl', 0.213), ('gp', 0.165), ('expert', 0.153), ('xu', 0.15), ('irl', 0.129), ('demonstrations', 0.119), ('policy', 0.106), ('xik', 0.097), ('reinforcement', 0.093), ('inducing', 0.092), ('rewards', 0.083), ('features', 0.08), ('police', 0.075), ('objectworld', 0.067), ('warped', 0.067), ('transfer', 0.063), ('kernel', 0.061), ('inverse', 0.058), ('learned', 0.056), ('wk', 0.053), ('outer', 0.053), ('colors', 0.052), ('highway', 0.05), ('sigmoid', 0.049), ('suboptimal', 0.044), ('prior', 0.044), ('lane', 0.043), ('learn', 0.039), ('human', 0.039), ('nonlinear', 0.039), ('expected', 0.038), ('qr', 0.037), ('continuous', 0.036), ('driving', 0.036), ('xjk', 0.035), ('kt', 0.035), ('hyperparameters', 0.035), ('discrete', 0.034), ('restart', 0.032), ('ratliff', 0.032), ('states', 0.031), ('examples', 0.03), ('apprenticeship', 0.03), ('bagnell', 0.029), ('traf', 0.029), ('state', 0.028), ('xj', 0.028), ('vehicle', 0.027), ('difference', 0.027), ('log', 0.027), ('civilian', 0.027), ('degeneracies', 0.027), ('discourages', 0.027), ('dvijotham', 0.027), ('highways', 0.027), ('popovi', 0.027), ('zoran', 0.027), ('environments', 0.026), ('environment', 0.026), ('actions', 0.025), ('hyperparameter', 0.024), ('gaussian', 0.024), ('likelihood', 0.024), ('speed', 0.024), ('vladlen', 0.023), ('levine', 0.023), ('warp', 0.023), ('color', 0.023), ('sa', 0.023), ('kernels', 0.023), ('relevance', 0.022), ('feature', 0.022), ('consistently', 0.022), ('imitation', 0.022), ('xi', 0.021), ('deterministically', 0.021), ('vehicles', 0.02), ('abbeel', 0.02), ('portable', 0.02), ('distractors', 0.02), ('rasmussen', 0.02), ('nonlinearity', 0.02), ('stochastic', 0.019), ('optima', 0.019), ('mk', 0.019), ('processes', 0.019), ('vr', 0.018), ('discounted', 0.018), ('regions', 0.018), ('car', 0.018), ('tr', 0.017), ('agent', 0.017), ('avenue', 0.017), ('supplement', 0.017), ('posterior', 0.017), ('points', 0.017), ('priors', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
Author: Sergey Levine, Zoran Popovic, Vladlen Koltun
Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert’s policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. 1
2 0.37749866 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
3 0.19678469 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
Author: Dylan A. Simon, Nathaniel D. Daw
Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1
4 0.1204536 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
5 0.1160742 26 nips-2011-Additive Gaussian Processes
Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen
Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1
6 0.11535129 100 nips-2011-Gaussian Process Training with Input Noise
7 0.10680538 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
8 0.10532475 215 nips-2011-Policy Gradient Coagent Networks
9 0.10386361 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
10 0.10364238 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
11 0.10090902 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
12 0.094115235 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
13 0.08787401 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
14 0.078298844 30 nips-2011-Algorithms for Hyper-Parameter Optimization
15 0.076094866 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
16 0.075316392 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
17 0.074818492 291 nips-2011-Transfer from Multiple MDPs
18 0.071310192 61 nips-2011-Contextual Gaussian Process Bandit Optimization
19 0.068984196 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
20 0.066110857 56 nips-2011-Committing Bandits
topicId topicWeight
[(0, 0.16), (1, -0.14), (2, 0.088), (3, 0.193), (4, -0.189), (5, 0.014), (6, 0.05), (7, 0.04), (8, 0.073), (9, 0.082), (10, -0.128), (11, 0.023), (12, 0.007), (13, 0.03), (14, -0.042), (15, 0.033), (16, 0.061), (17, 0.186), (18, -0.014), (19, -0.122), (20, -0.019), (21, 0.051), (22, -0.062), (23, -0.039), (24, 0.068), (25, -0.043), (26, 0.111), (27, -0.065), (28, 0.093), (29, -0.008), (30, 0.077), (31, -0.064), (32, 0.072), (33, -0.026), (34, 0.055), (35, -0.075), (36, -0.053), (37, -0.02), (38, 0.061), (39, -0.038), (40, 0.02), (41, -0.252), (42, -0.139), (43, -0.111), (44, 0.008), (45, -0.088), (46, 0.137), (47, 0.054), (48, -0.045), (49, -0.116)]
simIndex simValue paperId paperTitle
same-paper 1 0.95107704 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
Author: Sergey Levine, Zoran Popovic, Vladlen Koltun
Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert’s policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. 1
2 0.86971211 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
3 0.72320139 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
Author: Mehdi Keramati, Boris S. Gutkin
Abstract: Reinforcement learning models address animal’s behavioral adaptation to its changing “external” environment, and are based on the assumption that Pavlovian, habitual and goal-directed responses seek to maximize reward acquisition. Negative-feedback models of homeostatic regulation, on the other hand, are concerned with behavioral adaptation in response to the “internal” state of the animal, and assume that animals’ behavioral objective is to minimize deviations of some key physiological variables from their hypothetical setpoints. Building upon the drive-reduction theory of reward, we propose a new analytical framework that integrates learning and regulatory systems, such that the two seemingly unrelated objectives of reward maximization and physiological-stability prove to be identical. The proposed theory shows behavioral adaptation to both internal and external states in a disciplined way. We further show that the proposed framework allows for a unified explanation of some behavioral pattern like motivational sensitivity of different associative learning mechanism, anticipatory responses, interaction among competing motivational systems, and risk aversion.
4 0.55307478 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
Author: Dylan A. Simon, Nathaniel D. Daw
Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1
5 0.52093834 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
6 0.44499281 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
7 0.42527115 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
8 0.42426029 26 nips-2011-Additive Gaussian Processes
9 0.41497728 215 nips-2011-Policy Gradient Coagent Networks
10 0.4038724 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
11 0.37995887 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
12 0.37733844 100 nips-2011-Gaussian Process Training with Input Noise
13 0.37622237 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
14 0.34135759 30 nips-2011-Algorithms for Hyper-Parameter Optimization
15 0.33312654 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
16 0.31658533 61 nips-2011-Contextual Gaussian Process Bandit Optimization
17 0.31529984 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
18 0.3112804 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
19 0.3048158 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
20 0.30417508 32 nips-2011-An Empirical Evaluation of Thompson Sampling
topicId topicWeight
[(0, 0.012), (4, 0.033), (20, 0.026), (22, 0.025), (26, 0.015), (31, 0.094), (33, 0.014), (43, 0.077), (45, 0.44), (57, 0.03), (60, 0.011), (74, 0.034), (83, 0.034), (99, 0.041)]
simIndex simValue paperId paperTitle
1 0.99657333 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama
Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1
2 0.9942314 272 nips-2011-Stochastic convex optimization with bandit feedback
Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin
Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1
3 0.994205 293 nips-2011-Understanding the Intrinsic Memorability of Images
Author: Phillip Isola, Devi Parikh, Antonio Torralba, Aude Oliva
Abstract: Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects’ contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al. [13], and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision. 1
4 0.99410874 143 nips-2011-Learning Anchor Planes for Classification
Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari
Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1
5 0.99158722 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
6 0.99053073 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
7 0.99035853 28 nips-2011-Agnostic Selective Classification
same-paper 8 0.98533785 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
9 0.95837009 19 nips-2011-Active Classification based on Value of Classifier
10 0.94312555 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
11 0.94100326 220 nips-2011-Prediction strategies without loss
12 0.93974584 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
13 0.93535781 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
14 0.93419373 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
15 0.93401057 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
16 0.93121463 78 nips-2011-Efficient Methods for Overlapping Group Lasso
17 0.92989641 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
18 0.92942578 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
19 0.92842293 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
20 0.9180038 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices