nips nips2012 nips2012-162 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr 3 Abstract This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. [sent-8, score-0.904]
2 We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. [sent-9, score-0.726]
3 This approach produces a reward function for which the expert policy is provably near-optimal. [sent-10, score-1.151]
4 Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. [sent-12, score-0.524]
5 1 Introduction Inverse reinforcement learning (IRL) [14] consists in finding a reward function such that a demonstrated expert behavior is optimal. [sent-14, score-0.857]
6 5) search for a reward function such that the associated optimal policy induces a distribution over trajectories (or some measure of this distribution) which matches the one induced by the expert. [sent-16, score-0.841]
7 Often, this distribution is characterized by the so-called feature expectation (see Sec. [sent-17, score-0.21]
8 1): given a reward function linearly parameterized by some feature vector, it is the expected discounted cumulative feature vector for starting in a given state, applying a given action and following the related policy. [sent-19, score-0.75]
9 The expert behavior could be mimicked by a supervised learning algorithm generalizing the mapping from states to actions. [sent-21, score-0.435]
10 Here, we consider generally multi-class classifiers which compute from a training set the parameters of a linearly parameterized score function; the decision rule for a given state is the argument (the action) which maximizes the score function for this state (see Sec. [sent-22, score-0.46]
11 The basic idea of our SCIRL (Structured Classificationbased IRL) algorithm is simply to take an estimate of the expert feature expectation as the parameterization of the score function (see Sec. [sent-25, score-0.755]
12 The computed parameter vector actually defines a reward function for which we show the expert policy to be near-optimal (Sec. [sent-28, score-1.173]
13 It requires estimating the expert feature expectation, but this is roughly a policy evaluation problem (for an observed policy, so is less involved than repeated policy optimization problems), see Sec. [sent-32, score-1.33]
14 Moreover, up to the use of some heuristic, SCIRL may be trained solely from transitions sampled from the expert policy (no need to sample the whole dynamic). [sent-34, score-0.921]
15 A deterministic policy π ∈ S A defines the behavior of an agent. [sent-39, score-0.387]
16 The quality of this control is quantified by the π value function vR ∈ RS , associating to each state the cumulative discounted reward for starting in π this state and following the policy π afterwards: vR (s) = E[ t≥0 γ t R(St )|S0 = s, π]. [sent-40, score-0.855]
17 An optimal ∗ ∗ policy πR (according to the reward function R) is a policy of associated value function vR satisfying ∗ π vR ≥ vR , for any policy π and componentwise. [sent-41, score-1.535]
18 With a slight abuse of notation, we may write a the policy which associates the action a to each state s. [sent-43, score-0.507]
19 We also R write ρπ the stationary distribution of the policy π (satisfying ρπ Pπ = ρπ ). [sent-49, score-0.387]
20 On the contrary, (approximate) inverse reinforcement learning [11] aim at estimating a reward function for which an observed policy is (nearly) optimal. [sent-52, score-0.939]
21 Let us call this policy the expert policy, denoted πE . [sent-53, score-0.8]
22 We may assume that it optimizes some unknown ˆ reward function RE . [sent-54, score-0.371]
23 The aim of IRL is to compute some reward R such that the expert policy is π ∗ (close to be) optimal, that is such that vR ≈ vRE . [sent-55, score-1.171]
24 ˆ ˆ Similarly to the direct problem, the state space may be too large for the reward function to admit a practical exact representation. [sent-57, score-0.441]
25 Therefore, we restrict our search of a good reward among linearly parameterized functions. [sent-58, score-0.469]
26 φp (s)) be a feature vector composed of p basis funcp tion φi ∈ RS , we define the parameterized reward functions as Rθ (s) = θ φ(s) = i=1 θi φi (s). [sent-62, score-0.564]
27 p Searching a good reward thus reduces to searching a good parameter vector θ ∈ R . [sent-63, score-0.373]
28 Parameterizing the reward this way implies a related parameterization for the action-value function: Qπ (s, a) = θ µπ (s, a) with µπ (s, a) = E[ θ γ t φ(St )|S0 = s, A0 = a, π]. [sent-67, score-0.378]
29 (1) t≥0 Therefore, the action-value function shares the parameter vector of the reward function, with an associated feature vector µπ called the feature expectation. [sent-68, score-0.599]
30 Notice that each component µπ of this feature vector is actually the i action-value function of the policy π assuming the reward is φi : µπ (s, a) = Qπi (s, a). [sent-70, score-0.862]
31 Therefore, i φ any algorithm designed for estimating an action-value function may be used to estimate the feature expectation, such as Monte-Carlo rollouts or temporal difference learning [7]. [sent-71, score-0.255]
32 We assume that the decision rule associates to an input the argument which maximizes a related score function, this score function being linearly parameterized and the associated parameters being learnt by the algorithm. [sent-78, score-0.402]
33 The linearly parameterized score function sw ∈ RX ×Y of parameter vector w ∈ Rd is defined as sw (x, y) = w ψ(x, y). [sent-83, score-0.278]
34 Using a training set {(xi , yi )1≤i≤N }, a linearly parameterized score function-based multi-class classification (MC2 for short) algorithm computes a parameter vector θc . [sent-85, score-0.238]
35 We do not consider a specific MC2 algorithm, as long as it classifies inputs by maximizing the argument of a linearly parameterized score function. [sent-87, score-0.239]
36 For example, one may choose a multi-class support vector machine [6] (taking the kernel induced by the feature vector) or a structured large margin approach [18]. [sent-88, score-0.229]
37 Then, the decision rule gw (x) can be interpreted as a policy which is greedy according to the score function w ψ(x, y), which may itself be seen as an action-value function. [sent-95, score-0.631]
38 (1), if ψ(x, y) is the feature expectation of some policy π which produces labels of the training set, and if the classification error is small, then w will be the parameter vector of a reward function for which we may hope the policy π to be near optimal. [sent-97, score-1.418]
39 Let πE be the expert policy from which we would like to recover a reward function. [sent-99, score-1.151]
40 Assume that we have a training set D = {(si , ai = πE (si ))1≤i≤N } where states are sampled according to the expert stationary distribution2 ρE = ρπE . [sent-100, score-0.613]
41 Assume also that we have an estimate µπE of the expert ˆ feature expectation µπE defined in Eq. [sent-101, score-0.652]
42 1; however, recall that estimating µπE is simply a policy evaluation problem (estimating the action-value function of a given policy), as noted in Sec. [sent-105, score-0.428]
43 The proposed algorithm simply consists in choosing θ µπE (s, a) as the linearly ˆ parameterized score function, training the classifier on D which produces a parameter vector θc , and outputting the reward function Rθc (s) = θc φ(s). [sent-109, score-0.617]
44 We call this Structured Classification-based IRL because using the (estimated) expert feature expectation as the feature vector for the classifier somehow implies taking into account the MDP structure into the classification problem and allows outputting a reward vector. [sent-112, score-1.16]
45 If it possibly requires estimating the expert feature expectation, it is just a policy evaluation problem, less difficult than the policy optimization issue involved by the direct problem. [sent-114, score-1.365]
46 2 For example, if the Markov chain induced by the expert policy is fast-mixing, sampling a trajectory will quickly lead to sample states according to this distribution. [sent-117, score-0.879]
47 2 Analysis In this section, we show that the expert policy πE is close to be optimal according to the reward πE ∗ function Rθc , more precisely that Es∼ρE [vθc (s) − vθc (s)] is small. [sent-119, score-1.206]
48 As the proof only relies on the reward Rθc , we omit the related subscripts to keep the notaπ π tions simple (e. [sent-140, score-0.372]
49 ˆ The policy πc (the decision rule of the classifier) is greedy with respect to QπE = θc µπE . [sent-154, score-0.465]
50 1 is that given the true expert feature expectation µπE and a perfect classifier ( c = 0), πE is the unique optimal policy for Rθc . [sent-169, score-1.033]
51 One may argue that this bounds trivially holds for the null reward function (a reward often exhibited to show that IRL is an ill-posed problem), obtained if θc = 0. [sent-170, score-0.747]
52 With θc = 0, the decision rule would be a random policy and we would have c = |A|−1 , the worst possible classification error. [sent-172, score-0.465]
53 |A| Therefore, we advocate that the proposed approach somehow allows disambiguating the IRL problem (at least, it does not output trivial reward functions such as the null vector). [sent-174, score-0.41]
54 ∞ One should notice that there is a hidden dependency of the classification error c to the estimated expert feature expectation µπE . [sent-176, score-0.671]
55 ˆ Nevertheless, provided a good representation for the reward function (that is a good choice of basis functions φi ) and a small estimation error, this should not be a practical problem. [sent-178, score-0.369]
56 Finally, if our bound relies on the generalization errors c and ¯Q , the classifier will only use (ˆπE (si , a))1≤i≤N,a∈A in the training phase, where si are the states from the set D. [sent-179, score-0.238]
57 It outµ puts θc , seen as a reward function, thus the estimated feature expectation µπE is no longer reˆ quired. [sent-180, score-0.561]
58 1 A Practical Approach Estimating the Expert Feature Expectation SCIRL relies on an estimate µπE of the expert feature expectation. [sent-183, score-0.544]
59 An already made key observation is that each component of µπE is the action-value a πE function of πE for a reward function φi : µπE (s, a) = QπE (s, a) = [Tφi vφi ](s). [sent-185, score-0.351]
60 For a fixed a ∈ A, π let µa E ∈ R|S|×p be the feature expectation matrix whose rows are the expert feature vectors, that πE is (µ (s, a)) for any s ∈ S. [sent-189, score-0.725]
61 a Moreover, the related computational cost is the same order of magnitude as evaluating a single policy (as the costly part, computing (I − γPπE )−1 , is shared by all components). [sent-191, score-0.387]
62 If the model is unknown, any temporal difference learning algorithm can be used to estimate the expert feature expectation [7], as LSTD (Least-Squares Temporal Differences) [4]. [sent-192, score-0.672]
63 Each component µπE of the i 5 expert feature expectation is parameterized by a vector ξi ∈ Rd : µπE (s, a) ≈ ξi ψ(s, a). [sent-194, score-0.716]
64 Assume i that we have a training set {(si , ai , si , ai = πE (si ))1≤i≤M } with actions ai not necessarily sampled according to policy πE (e. [sent-195, score-1.001]
65 Ψ ) be the feature matrix whose rows are the feature vectors ˜ ψ(si , ai ) (resp. [sent-199, score-0.304]
66 As for the exact case, ˆ the costly part (computing the inverse matrix) is shared by all feature expectation components, the computational cost is reasonable (same order as LSTD). [sent-207, score-0.257]
67 Provided a simulator and the ability to sample according to the expert policy, the expert feature expectation may also be estimated using Monte-Carlo rollouts for a given state-action pair (as noted in Sec. [sent-208, score-1.185]
68 In order to have a small error ¯Q , one may learn using transitions whose starting state is sampled according to ρE and whose actions are uniformly distributed. [sent-212, score-0.249]
69 However, it may happen that only transitions of the expert are available: T = {(si , ai = πE (si ), si )1≤i≤N }. [sent-213, score-0.804]
70 If the state-action couples (si , ai ) may be used to feed the classifier, the transitions (si , ai , si ) are not enough to provide an accurate estimate of the feature expectation. [sent-214, score-0.653]
71 We may adopt an optimistic point of view by assuming that applying a non-expert action just delays the effect of the expert action. [sent-222, score-0.464]
72 |s, πE (s)) for any action a and for which the reward feature expectation is the null vector, φ(sv ) = 0. [sent-225, score-0.617]
73 Applying this idea to the available estimate (recalling that the classifiers only requires evaluating µπE on (si , a)1≤i≤N,a∈A ) provides the proposed heuristic: ˆ for 1 ≤ i ≤ N , µπE (si , a = ai ) = γ µπE (si , ai ). [sent-227, score-0.229]
74 ˆ ˆ We may even push this idea further, to get the simpler estimate of the expert feature expectation (but with the weakest guarantees). [sent-228, score-0.672]
75 We may estimate µπE (si , ai ) using the single rollout available in the training set and use the proposed heuristic for other actions: N ∀1 ≤ i ≤ N, µπE (si , ai ) = ˆ γ j−i φ(sj ) and µπE (si , a = ai ) = γ µπE (si , ai ). [sent-233, score-0.516]
76 ˆ ˆ (5) j=i To sum up, the expert feature expectation may be seen as a vector of action-value functions (for the same policy πE and different reward functions φi ). [sent-234, score-1.403]
77 Depending on the available data, one may have to rely on some heuristic to assess the feature expectation for a unexperienced (non-expert) action. [sent-236, score-0.275]
78 Also, this expert feature expectation estimate is only required for training the classifier, so it is sufficient to estimate on state-action couples (si , a)1≤i≤N,a∈A . [sent-237, score-0.734]
79 In any case, estimating µπE is not harder than estimating the action-value function of a given policy in the on-policy case, which is much easier than computing an optimal policy for an arbitrary reward function (as required by most of existing IRL algorithms, see Sec. [sent-238, score-1.23]
80 Let L : S × A → R+ be a user-defined margin function satisfying L(s, πE (s)) ≤ 6 L(s, a) (here, L(si , ai ) = 0 and L(si , a = ai ) = 1). [sent-243, score-0.229]
81 The expert feature expectation is estimated using the scheme described in Eq. [sent-249, score-0.623]
82 A classic approach to IRL, initiated in [1], consists in finding a policy (through some reward function) such that its feature expectation (or more generally some measure of the underlying trajectories’ distribution) matches the one of the expert policy. [sent-252, score-1.361]
83 Notice that related algorithms are not always able to output a reward function, even if they may make use of IRL as an intermediate step. [sent-254, score-0.371]
84 In [13], a classification algorithm is also used to produce a reward function. [sent-259, score-0.351]
85 Estimating the subgradient requires sampling trajectories according to the policy being optimal for the current estimated reward. [sent-265, score-0.519]
86 Still, this requires sampling trajectories according to a non-expert policy and the direct problem remains at the core of the approach (even if solving it is avoided). [sent-267, score-0.528]
87 SCIRL does not require solving the direct problem, just estimating the feature expectation of the expert policy. [sent-268, score-0.718]
88 In other words, instead of solving multiple policy optimization problems, we only solve one policy evaluation problem. [sent-269, score-0.793]
89 Moreover, using heuristics which go beyond our analysis, SCIRL may rely solely on data provided by expert trajectories. [sent-273, score-0.433]
90 The goal si to drive a car on a busy three-lane highway with randomly generated traffic (driving off-road is allowed on both sides). [sent-277, score-0.302]
91 The expert optimizes a handcrafted reward RE which favours speed, punish off-road, punish collisions even more and is neutral otherwise. [sent-279, score-0.834]
92 We also consider the optimal behavior according to a randomly sampled reward function as a baseline (using the same reward feature vector as SCIRL and PIRL, the associated parameter vector is randomly sampled). [sent-283, score-0.927]
93 The policy corresponding to the random reward is in blue, the policy outputted by the classifier is in yellow and the optimal policy according the SCIRL’s reward is in red. [sent-287, score-1.946]
94 PIRL is an iterative algorithm, each iteration requiring to solve the MDP for some reward function. [sent-298, score-0.351]
95 It is run for 70 iterations, all required objects (a feature expectations for a non-expert policy and an optimal policy according to some reward function at each iteration) are computed exactly π using the model. [sent-299, score-1.282]
96 1 shows the performance of each approach as a number of used expert transitions (except PIRL which uses the model). [sent-302, score-0.49]
97 SCIRL works pretty well here: after only a hundred of transitions it reaches the performance of PIRL, both being close to the expert value. [sent-305, score-0.508]
98 7 Conclusion We have introduced a new way to perform IRL by structuring a linearly parameterized score function-based multi-class classification algorithm with an estimate of the expert feature expectation. [sent-307, score-0.738]
99 This outputs a reward function for which we have shown the expert to be near optimal, provided a small classification error and a good expert feature expectation estimate. [sent-308, score-1.406]
100 How to practically estimate this quantity has been discussed and we have introduced a heuristic for the case where only transitions from the expert are available, along with a specific instantiation of the SCIRL algorithm. [sent-309, score-0.611]
wordName wordTfidf (topN-words)
[('expert', 0.413), ('policy', 0.387), ('scirl', 0.385), ('reward', 0.351), ('irl', 0.313), ('vr', 0.204), ('si', 0.194), ('pirl', 0.157), ('expectation', 0.108), ('feature', 0.102), ('ai', 0.1), ('reinforcement', 0.093), ('classi', 0.08), ('transitions', 0.077), ('score', 0.076), ('rs', 0.074), ('car', 0.073), ('parameterized', 0.071), ('vre', 0.07), ('er', 0.058), ('cf', 0.058), ('trajectories', 0.055), ('simulator', 0.054), ('apprenticeship', 0.051), ('driving', 0.047), ('linearly', 0.047), ('inverse', 0.047), ('mdp', 0.046), ('bellman', 0.045), ('heuristic', 0.045), ('decision', 0.044), ('rollouts', 0.043), ('tr', 0.042), ('actions', 0.042), ('estimating', 0.041), ('lstd', 0.04), ('gw', 0.038), ('state', 0.035), ('contrary', 0.035), ('direct', 0.035), ('edouard', 0.035), ('highway', 0.035), ('metz', 0.035), ('punish', 0.035), ('associates', 0.034), ('somehow', 0.034), ('rule', 0.034), ('according', 0.032), ('france', 0.031), ('action', 0.031), ('sw', 0.031), ('couples', 0.031), ('structured', 0.031), ('sv', 0.03), ('margin', 0.029), ('notice', 0.029), ('estimate', 0.029), ('outputting', 0.028), ('outputted', 0.028), ('matthieu', 0.028), ('andrew', 0.028), ('parameterization', 0.027), ('es', 0.027), ('practically', 0.027), ('fed', 0.027), ('notations', 0.026), ('induced', 0.025), ('inputs', 0.025), ('null', 0.025), ('unstructured', 0.024), ('sampled', 0.024), ('discounted', 0.024), ('cation', 0.024), ('stuart', 0.023), ('olivier', 0.023), ('rl', 0.023), ('rx', 0.023), ('eu', 0.023), ('associating', 0.023), ('optimal', 0.023), ('states', 0.022), ('vector', 0.022), ('training', 0.022), ('subgradient', 0.022), ('dark', 0.021), ('avoided', 0.021), ('traf', 0.021), ('subscripts', 0.021), ('argument', 0.02), ('instantiation', 0.02), ('basically', 0.02), ('may', 0.02), ('temporal', 0.02), ('aim', 0.02), ('solving', 0.019), ('error', 0.019), ('rd', 0.019), ('recalling', 0.018), ('hundred', 0.018), ('basis', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
2 0.41523373 38 nips-2012-Algorithms for Learning Markov Field Policies
Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer
Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1
3 0.41210631 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
Author: Jaedeug Choi, Kee-eung Kim
Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1
4 0.37687904 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
Author: Aaron Wilson, Alan Fern, Prasad Tadepalli
Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1
5 0.30317867 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
6 0.24069692 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
7 0.22018558 348 nips-2012-Tractable Objectives for Robust Policy Optimization
8 0.20486352 160 nips-2012-Imitation Learning by Coaching
9 0.19839442 344 nips-2012-Timely Object Recognition
10 0.19400989 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
11 0.16316524 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
12 0.15804318 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
13 0.14994761 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
14 0.14237027 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
15 0.14009349 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
16 0.13773534 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
17 0.13420416 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
18 0.12576041 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
19 0.1242004 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
20 0.11753377 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
topicId topicWeight
[(0, 0.209), (1, -0.546), (2, -0.06), (3, -0.096), (4, -0.039), (5, 0.038), (6, 0.012), (7, -0.012), (8, 0.009), (9, -0.038), (10, 0.008), (11, -0.039), (12, 0.003), (13, 0.08), (14, 0.085), (15, -0.034), (16, 0.026), (17, 0.038), (18, 0.061), (19, -0.004), (20, -0.032), (21, -0.063), (22, 0.016), (23, 0.02), (24, -0.03), (25, 0.006), (26, -0.056), (27, -0.079), (28, -0.021), (29, -0.118), (30, -0.037), (31, 0.005), (32, 0.107), (33, -0.026), (34, 0.006), (35, -0.075), (36, 0.042), (37, -0.05), (38, -0.029), (39, -0.065), (40, 0.018), (41, 0.018), (42, -0.032), (43, 0.039), (44, 0.072), (45, 0.147), (46, -0.056), (47, -0.096), (48, 0.12), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.95510077 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
2 0.85648501 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
Author: Jaedeug Choi, Kee-eung Kim
Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1
3 0.8560639 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
Author: Aaron Wilson, Alan Fern, Prasad Tadepalli
Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1
4 0.8495006 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
5 0.8446157 38 nips-2012-Algorithms for Learning Markov Field Policies
Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer
Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1
6 0.71625865 160 nips-2012-Imitation Learning by Coaching
7 0.70774877 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
8 0.69692755 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
9 0.6375013 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
10 0.61672777 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
11 0.61531687 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
12 0.59952277 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
13 0.58025324 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
14 0.57686371 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
15 0.5689261 348 nips-2012-Tractable Objectives for Robust Policy Optimization
16 0.55304754 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
17 0.50193268 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
18 0.4923301 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
19 0.47496855 344 nips-2012-Timely Object Recognition
20 0.41790944 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
topicId topicWeight
[(0, 0.027), (21, 0.024), (35, 0.153), (36, 0.011), (38, 0.138), (42, 0.057), (53, 0.01), (54, 0.111), (55, 0.019), (62, 0.011), (74, 0.098), (76, 0.113), (80, 0.086), (92, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.85726529 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
2 0.8277269 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation
Author: Jason Pacheco, Erik B. Sudderth
Abstract: We develop convergent minimization algorithms for Bethe variational approximations which explicitly constrain marginal estimates to families of valid distributions. While existing message passing algorithms define fixed point iterations corresponding to stationary points of the Bethe free energy, their greedy dynamics do not distinguish between local minima and maxima, and can fail to converge. For continuous estimation problems, this instability is linked to the creation of invalid marginal estimates, such as Gaussians with negative variance. Conversely, our approach leverages multiplier methods with well-understood convergence properties, and uses bound projection methods to ensure that marginal approximations are valid at all iterations. We derive general algorithms for discrete and Gaussian pairwise Markov random fields, showing improvements over standard loopy belief propagation. We also apply our method to a hybrid model with both discrete and continuous variables, showing improvements over expectation propagation. 1
3 0.82587188 344 nips-2012-Timely Object Recognition
Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell
Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1
4 0.81213659 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy
Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1
5 0.81113476 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller
Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1
6 0.80745494 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
7 0.80143094 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
8 0.79673356 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
9 0.79604167 38 nips-2012-Algorithms for Learning Markov Field Policies
10 0.78820986 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
11 0.78513205 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
12 0.78102767 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
13 0.77910393 160 nips-2012-Imitation Learning by Coaching
14 0.77567005 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
15 0.77173108 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
16 0.77082896 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
17 0.76799434 348 nips-2012-Tractable Objectives for Robust Policy Optimization
18 0.76734632 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
19 0.76546252 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
20 0.7652691 303 nips-2012-Searching for objects driven by context