nips nips2012 nips2012-245 knowledge-graph by maker-knowledge-mining

245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions


Source: pdf

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 kr Abstract We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. [sent-7, score-0.767]

2 Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. [sent-8, score-0.952]

3 We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. [sent-10, score-0.667]

4 1 Introduction Inverse reinforcement learning (IRL) aims to find the agent’s underlying reward function given the behaviour data and the model of environment [1]. [sent-11, score-0.868]

5 IRL algorithms often assume that the behaviour data is from an agent who behaves optimally without mistakes with respect to a single reward function. [sent-12, score-0.952]

6 From the Markov decision process (MDP) perspective, the IRL can be defined as the problem of finding the reward function given the trajectory data of an optimal policy, consisting of stateaction histories. [sent-13, score-0.752]

7 In practice, the behaviour data is often gathered collectively from multiple agents whose reward functions are potentially different from each other. [sent-16, score-0.857]

8 The amount of data generated from a single agent may be severely limited, and hence we may suffer from the sparsity of data if we try to infer the reward function individually. [sent-17, score-0.821]

9 Moreover, even when we have enough data from a single agent, the reward function may change depending on the situation. [sent-18, score-0.578]

10 However, most of the previous IRL algorithms assume that the behaviour data is generated by a single agent optimizing a fixed reward function, although there are a few exceptions that address IRL for multiple reward functions. [sent-19, score-1.572]

11 In this work, the reward functions are individually estimated for each trajectory, which are assumed to share a common prior. [sent-21, score-0.619]

12 Other than the common prior assumption, there is no effort to group trajectories that are likely to be generated from the same or similar reward functions. [sent-22, score-0.783]

13 The behaviour data are clustered 1 based on the inferred reward functions, where the reward functions are defined per cluster. [sent-25, score-1.41]

14 However, the number of clusters (hence the number of reward functions) has to be specified as a parameter in order to use the approach. [sent-26, score-0.616]

15 In this paper, we present a nonparametric Bayesian approach using the Dirichlet process mixture model in order to address the IRL problem with multiple reward functions. [sent-27, score-0.686]

16 We develop an efficient Metropolis-Hastings (MH) sampler utilizing the gradient of the reward function posterior to infer reward functions from the behaviour data. [sent-28, score-1.506]

17 In addition, after completing IRL on the behaviour data, we can efficiently estimate the reward function for a new trajectory by computing the mean of the reward function posterior given the pre-learned results. [sent-29, score-1.546]

18 We assume that the agent’s behavior data is generated by executing an optimal policy with some unknown reward function(s) R, given as the set X of M trajectories where the m-th trajectory is an H-step sequence of state-action pairs: Xm = {(sm,1 , am,1 ), (sm,2 , am,2 ), . [sent-40, score-1.015]

19 1 Bayesian Inverse Reinforcement Learning (BIRL) Ramachandran and Amir [4] proposed a Bayesian approach to IRL with the assumption that the behaviour data is generated from a single reward function. [sent-45, score-0.788]

20 The prior encodes the the reward function preference and the likelihood measures the compatibility of the reward function with the data. [sent-46, score-1.2]

21 Assuming that the reward function entries are independently distributed, the prior is defined as D P (r) = d=1 P (rd ). [sent-47, score-0.602]

22 We can use various distributions for the reward prior. [sent-48, score-0.578]

23 The posterior over the reward functions is then formulated by 1 D denotes the number of features. [sent-55, score-0.666]

24 Note that we can assign individual reward values to every state-action pair by using |S||A| indicator functions for features. [sent-56, score-0.619]

25 Initialize c and {r k }K k=1 for t = 1 to MaxIter do for m = 1 to M do c∗ ∼ P (c|c−m , α) m if c∗ ∈ c−m then r c∗ ∼ P (r|µ, σ) m / m cm , r cm ← c∗ , r c∗ with prob. [sent-59, score-0.442]

26 of m m min{1, P (Xm |r c∗ ,η) m P (Xm |r cm ,η) } for k = 1 to K do ǫ ∼ N (0, 1) 2 r ∗ ← r k + τ2 ∇ log f (r k ) + τ ǫ k r k ← r ∗ with prob. [sent-60, score-0.221]

27 (2) We can infer the reward function from the model by computing the posterior mean using a Markov chain Monte Carlo (MCMC) algorithm [4] or the maximum-a-posteriori (MAP) estimates using a gradient method [12]. [sent-63, score-0.678]

28 3 Nonparametric Bayesian IRL for Multiple Reward Functions In this section, we present our approach to IRL for multiple reward functions. [sent-66, score-0.597]

29 We assume that each trajectory in the behaviour data is generated by an agent with a fixed reward function. [sent-67, score-1.131]

30 In other words, we assume that the reward function does not change within a trajectory. [sent-68, score-0.578]

31 However, the whole trajectories are assumed be generated by one or more agents whose reward functions are distinct from each other. [sent-69, score-0.832]

32 We do not assume any information regarding which trajectory is generated by which agent as well as the number of agents. [sent-70, score-0.366]

33 Hence, the goal is to infer an unknown number of reward functions from the unlabeled behaviour data. [sent-71, score-0.839]

34 A naive approach to this problem setting would be solving M separate and independent IRL problems by treating each trajectory as the sole behaviour data and employing one of the well-known IRL algorithms designed for a single reward function. [sent-72, score-0.921]

35 We can then use an unsupervised learning method with the M reward functions as data points. [sent-73, score-0.619]

36 However, this approach would suffer from the sparsity of data, since each trajectory may not contain a sufficient amount of data to infer the reward function reliably, or the number of trajectories may not be enough for the unsupervised learning method to yield a meaningful result. [sent-74, score-0.925]

37 It clusters trajectories and assumes that all the trajectories in a cluster are generated by a single reward function. [sent-77, score-1.022]

38 the number of distinct reward functions) as a parameter. [sent-80, score-0.578]

39 First, we do not need to specify the number of distinct reward functions due to the nonparametric nature of our model. [sent-83, score-0.658]

40 Second, we can encode our preference or domain knowledge on the reward function into the prior since it is a Bayesian approach to IRL. [sent-84, score-0.622]

41 Third, we can acquire rich information from the behaviour data such as the distribution over the reward functions. [sent-85, score-0.765]

42 The DPM model for a data {xm }M using a set of latent parameters {θm }M can be defined as: m=1 m=1 G|α, G0 ∼ DP (α, G0 ), θm |G ∼ G xm |θm ∼ F (θm ) where G is the prior used to draw each θm and F (θm ) is the parameterized distribution for data xm . [sent-89, score-0.354]

43 , pK ) φk ∼ G0 xm |cm , φ ∼ F (φcm ) (3) {pk }K k=1 of xm so where p = is the mixing proportion for the latent classes, cm ∈ {1, . [sent-96, score-0.551]

44 , K} is the class assignment that cm = k when xm is assigned to class k, φk is the parameter of the data distribution for class k, and φ = {φk }K . [sent-99, score-0.453]

45 2 DPM-BIRL for Multiple Reward Functions We address the IRL for multiple reward functions by extending BIRL with the DPM model. [sent-101, score-0.638]

46 We place a Dirichlet process prior on the reward functions r k . [sent-102, score-0.661]

47 The base distribution G0 is defined as the reward function prior, i. [sent-103, score-0.578]

48 the product of the normal distribution for each reward entry D d=1 N (rk,d ; µ, σ). [sent-105, score-0.578]

49 The cluster assignment cm = k indicates that the trajectory Xm belongs to the cluster k, which represents that the trajectory is generated by the agent with the reward function r k . [sent-106, score-1.503]

50 The cluster assignment cm is drawn by the first two equations in Eqn. [sent-111, score-0.336]

51 The reward function r k is drawn from d=1 N (rk,d ; µ, σ). [sent-114, score-0.578]

52 The trajectory Xm is drawn from P (Xm |r cm , η) in Eqn. [sent-116, score-0.377]

53 The joint posterior of the cluster assignment c = {cm }M and the set of reward functions {r k }K is defined as: m=1 k=1 P (c, {r k }K |X , η, µ, σ, α) = P (c|α) k=1 K k=1 P (r k |Xc(k) , η, µ, σ) (4) where Xc(k) = {Xm |cm = k for m = 1, . [sent-120, score-0.781]

54 First, note that we can safely assume that there are K distinct values of cm ’s so that cm ∈ {1, . [sent-126, score-0.442]

55 The conditional distribution to sample cm for the MH update can be defined as P (cm |c−m , {r k }K , X , η, α) ∝ P (Xm |r cm , η)P (cm |c−m , α) k=1 n−m,cj , if cm = cj for some j α, if cm = cj for all j P (cm |c−m , α) ∝ (5) where c−m = {ci |i = m for i = 1, . [sent-130, score-0.97]

56 , M }, P (Xm |r cm , η) is the likelihood defined in Eqn. [sent-133, score-0.221]

57 Note that if the sampled cm = cj for all j then Xm is assigned to a new cluster. [sent-138, score-0.283]

58 , c∗ = cj for all j, we draw new reward m m m function r c∗ from the reward prior P (r|µ, σ). [sent-149, score-1.223]

59 We then set cm = c∗ with the acceptance probability m m P (Xm |r c∗ ,η) of min{1, P (Xm |rcm ,η) }, since we are using a non-conjugate prior [13]. [sent-150, score-0.263]

60 The second step updates m the reward functions {r k }K . [sent-151, score-0.619]

61 3 Information Transfer to a New Trajectory Suppose that we would like to infer the reward function of a new trajectory after we finish IRL on the behaviour data consisting of M trajectories. [sent-160, score-0.954]

62 [10] use the ¸ weighted average of cluster reward functions assuming that the new trajectory is generated from the same population of the behaviour data. [sent-164, score-1.052]

63 Note that we can relax this assumption and allow the new trajectory generated by a novel reward function, as a direct result of using DPM model. [sent-165, score-0.757]

64 , M }| is the number of trajectories assigned to cluster k and δ(x) is the Dirac delta function. [sent-169, score-0.244]

65 We can then re-draw samples of r using the approximated posterior and take the sample average as the inferred reward function. [sent-173, score-0.625]

66 We choose r = argmaxr P (Xnew |r, η)P (r|µ, σ), which is the MAP estimate of the reward function for the new trajectory Xnew only, ignoring the previous behaviour data X . [sent-180, score-0.921]

67 5 0 2 4 6 8 10 # of trajectories per agent 12 0. [sent-181, score-0.371]

68 8 2 4 6 8 10 # of trajectories per agent 12 0. [sent-185, score-0.371]

69 7 2 4 6 8 10 # of trajectories per agent 12 EVD for the new trajectory 0. [sent-186, score-0.527]

70 5 BIRL EM−MLIRL(3) EM−MLIRL(6) EM−MLIRL(9) DPM−BIRL(U) DPM−BIRL(G) 4 3 2 2 4 6 8 10 # of trajectories per agent 12 1. [sent-189, score-0.371]

71 5 0 2 4 6 8 10 12 # of trajectories per agent Figure 3: Results with increasing number of trajectories per agent in the gridworld problem. [sent-191, score-0.806]

72 The experiments consisted of two tasks: The first task was finding multiple reward functions from the behaviour data with a number of trajectories. [sent-194, score-0.825]

73 The second task was inferring the reward function underlying a new trajectory, while exploiting the results learned in the first task. [sent-195, score-0.578]

74 The EVD thus measures the performance difference between the agent’s optimal policy and the optimal policy induced by the learned reward function. [sent-197, score-0.728]

75 In the first task, we evaluated the EVD for the true and learned reward functions of each trajectory and computed the average EVD over the trajectories in the behaviour data. [sent-198, score-1.12]

76 In all the experiments, we assumed that the reward function was linearly parameterized such that D R(s, a) = d=1 rd φd (s, a) with feature functions φd : S × A → R, hence r = [r1 , . [sent-201, score-0.639]

77 Random instances of IRL with three reward functions were generated as follows: each element of r was sampled to have a non-zero value with probability of 0. [sent-211, score-0.642]

78 We obtained the trajectories of 40 time steps and measured the performance as we increased the number of trajectories per reward function. [sent-213, score-0.92]

79 The left four panels in the figure present the results for the first task of learning multiple reward functions from the behaviour data. [sent-216, score-0.825]

80 However, as we increased the size of the data, both DPM-BIRL and EMMLIRL achieved better EVD results than the baseline since they could utilize more information by grouping the trajectories to infer the reward functions. [sent-218, score-0.769]

81 The rightmost panel in the figure present the results for the second task of inferring the reward function for a new trajectory. [sent-221, score-0.598]

82 DPM-BIRL clearly outperformed EM-MLIRL since it exploits the rich information from the reward function posterior. [sent-222, score-0.578]

83 4 compares the average CPU timing results of DPM-BIRL and EM-MLIRL with 10 trajectories per reward function. [sent-278, score-0.786]

84 2 Simulated-highway Problem The second set of experiments was conducted in Simulated-highway problem [15] where the agent drives on a three lane road. [sent-283, score-0.272]

85 The agent can move one lane left or right and drive at speeds 2 through 3, but it fails to change the lane with probability of 0. [sent-286, score-0.383]

86 The reward function is defined by using 6 binary feature functions: one function for indicating the agent’s collision with other cars, 3 functions for indicating the agent’s current lane, 2 functions for indicating the agent’s current speed. [sent-290, score-0.66]

87 We prepared 3 trajectories of 40 time steps per driver agent for the first task and 20 trajectories of 40 time steps yielded by a driver randomly chosen among the three for the second task. [sent-295, score-0.585]

88 Mario’s goal is to reach the end of the level by traversing from left to right while collecting coins and avoiding or killing enemies. [sent-305, score-0.232]

89 We collected the behaviour data from 4 players: The expert player is good at both collecting coins and killing enemies. [sent-307, score-0.464]

90 The coin collector likes to collect coins but avoids killing enemies. [sent-308, score-0.327]

91 75 speedy Gonzales avoids both collecting coins and killing enemies. [sent-332, score-0.264]

92 The behaviour data consisted of 3 trajectories per player. [sent-334, score-0.371]

93 Each column represents each trajectory and the number denotes the cluster assignment cm of trajectory Xm . [sent-338, score-0.648]

94 On the other hand, DPM-BIRL was incorrect on only one trajectory, assigning a coin collector’s trajectory to the expert player cluster. [sent-344, score-0.285]

95 3 presents the reward function entries (rk,d ) learned from DPM-BIRL and the average feature counts acquired by the players with the learned reward functions. [sent-346, score-1.22]

96 To compute each player’s feature counts, we executed an n-step lookahead policy yielded by each reward function r k on the simulator in 20 randomly chosen levels. [sent-348, score-0.69]

97 The reward function entries align well with each playing style. [sent-349, score-0.578]

98 For example, the cluster 2 represents the coin collector, and its reward function entry for killing an enemy is negative but that for collecting a coin is positive. [sent-350, score-0.995]

99 5 Conclusion We proposed a nonparametric Bayesian approach to IRL for multiple reward functions using the Dirichlet process mixture model, which extends the previous Bayesian approach to IRL assuming a single reward function. [sent-353, score-1.305]

100 We can learn an appropriate number of reward functions from the behavior data due to the nonparametric nature and facilitates incorporating domain knowledge on the reward function by utilizing a Bayesian approach. [sent-354, score-1.258]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reward', 0.578), ('irl', 0.392), ('cm', 0.221), ('evd', 0.209), ('agent', 0.187), ('behaviour', 0.187), ('xnew', 0.184), ('mario', 0.168), ('xm', 0.165), ('birl', 0.161), ('trajectories', 0.158), ('trajectory', 0.156), ('killing', 0.096), ('mlirl', 0.096), ('dpm', 0.086), ('mh', 0.086), ('reinforcement', 0.085), ('lane', 0.085), ('policy', 0.075), ('enemy', 0.071), ('cluster', 0.067), ('coin', 0.065), ('coins', 0.064), ('gridworld', 0.064), ('collector', 0.057), ('collecting', 0.053), ('dirichlet', 0.051), ('assignment', 0.048), ('em', 0.048), ('xc', 0.048), ('posterior', 0.047), ('inverse', 0.046), ('players', 0.045), ('korea', 0.045), ('cj', 0.043), ('choi', 0.042), ('functions', 0.041), ('bayesian', 0.041), ('nonparametric', 0.039), ('clusters', 0.038), ('simulator', 0.037), ('andrew', 0.036), ('player', 0.035), ('apprenticeship', 0.035), ('nmi', 0.035), ('driving', 0.035), ('infer', 0.033), ('agents', 0.032), ('mixture', 0.032), ('bros', 0.032), ('dimitrakakis', 0.032), ('emmlirl', 0.032), ('gonzales', 0.032), ('jaedeug', 0.032), ('killer', 0.032), ('ramachandran', 0.032), ('speedy', 0.032), ('expert', 0.029), ('anind', 0.028), ('dialogue', 0.028), ('driver', 0.028), ('screenshot', 0.028), ('nk', 0.028), ('move', 0.026), ('per', 0.026), ('ziebart', 0.026), ('likes', 0.026), ('prefers', 0.026), ('executing', 0.025), ('maas', 0.025), ('langevin', 0.025), ('clustering', 0.024), ('prior', 0.024), ('speed', 0.024), ('timing', 0.024), ('preferences', 0.023), ('generated', 0.023), ('utilizing', 0.022), ('stuart', 0.022), ('mdp', 0.021), ('rewards', 0.021), ('cpu', 0.021), ('rd', 0.02), ('gradient', 0.02), ('preference', 0.02), ('panel', 0.02), ('avoiding', 0.019), ('avoids', 0.019), ('assigned', 0.019), ('multiple', 0.019), ('defense', 0.019), ('proceedings', 0.019), ('counts', 0.019), ('moving', 0.018), ('transfer', 0.018), ('process', 0.018), ('cars', 0.018), ('environment', 0.018), ('brian', 0.018), ('acceptance', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

2 0.41210631 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

3 0.26644257 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

4 0.23683941 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

5 0.19063318 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

6 0.18389845 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

7 0.15704088 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

8 0.13728209 348 nips-2012-Tractable Objectives for Robust Policy Optimization

9 0.13716508 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

10 0.12898174 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

11 0.12583706 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

12 0.12335594 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

13 0.1185627 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

14 0.11571768 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

15 0.11380376 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

16 0.11083715 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

17 0.10971095 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

18 0.10400546 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

19 0.090730317 344 nips-2012-Timely Object Recognition

20 0.080071926 295 nips-2012-Risk-Aversion in Multi-armed Bandits


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.166), (1, -0.381), (2, -0.049), (3, -0.033), (4, -0.121), (5, 0.017), (6, 0.009), (7, -0.032), (8, 0.029), (9, -0.011), (10, 0.008), (11, -0.073), (12, -0.019), (13, 0.033), (14, 0.047), (15, -0.02), (16, -0.037), (17, 0.05), (18, -0.008), (19, -0.047), (20, -0.003), (21, -0.066), (22, -0.011), (23, 0.035), (24, -0.015), (25, 0.004), (26, -0.032), (27, -0.129), (28, 0.04), (29, -0.107), (30, -0.022), (31, -0.033), (32, 0.177), (33, -0.054), (34, -0.027), (35, -0.016), (36, 0.072), (37, -0.107), (38, 0.006), (39, -0.159), (40, -0.023), (41, 0.006), (42, 0.013), (43, 0.044), (44, 0.031), (45, 0.132), (46, 0.005), (47, -0.18), (48, 0.104), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96706909 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

2 0.81109273 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

3 0.73014027 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

4 0.71298057 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

5 0.65859568 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1

6 0.64588922 38 nips-2012-Algorithms for Learning Markov Field Policies

7 0.62039578 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

8 0.54050404 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

9 0.51652068 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

10 0.51295847 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

11 0.49790061 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

12 0.49666539 160 nips-2012-Imitation Learning by Coaching

13 0.48469079 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

14 0.48435166 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

15 0.44810212 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

16 0.44186977 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

17 0.40631986 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

18 0.40414032 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

19 0.40404174 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

20 0.36374322 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.046), (5, 0.26), (21, 0.019), (38, 0.086), (42, 0.024), (54, 0.1), (55, 0.014), (62, 0.014), (74, 0.082), (76, 0.156), (80, 0.067), (92, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77354085 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

2 0.66098279 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1

3 0.64850879 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

4 0.64575356 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller

Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1

5 0.64288008 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

Author: Zhirong Yang, Tele Hao, Onur Dikmen, Xi Chen, Erkki Oja

Abstract: Nonnegative Matrix Factorization (NMF) is a promising relaxation technique for clustering analysis. However, conventional NMF methods that directly approximate the pairwise similarities using the least square error often yield mediocre performance for data in curved manifolds because they can capture only the immediate similarities between data samples. Here we propose a new NMF clustering method which replaces the approximated matrix with its smoothed version using random walk. Our method can thus accommodate farther relationships between data samples. Furthermore, we introduce a novel regularization in the proposed objective function in order to improve over spectral clustering. The new learning objective is optimized by a multiplicative Majorization-Minimization algorithm with a scalable implementation for learning the factorizing matrix. Extensive experimental results on real-world datasets show that our method has strong performance in terms of cluster purity. 1

6 0.64252317 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

7 0.63896 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

8 0.62316501 38 nips-2012-Algorithms for Learning Markov Field Policies

9 0.6206919 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

10 0.61811489 303 nips-2012-Searching for objects driven by context

11 0.61576253 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

12 0.61496651 210 nips-2012-Memorability of Image Regions

13 0.61474115 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

14 0.61315501 348 nips-2012-Tractable Objectives for Robust Policy Optimization

15 0.61302483 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

16 0.61119831 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

17 0.60955161 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

18 0.60895169 160 nips-2012-Imitation Learning by Coaching

19 0.60854959 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

20 0.6059835 274 nips-2012-Priors for Diversity in Generative Latent Variable Models