nips nips2012 nips2012-3 knowledge-graph by maker-knowledge-mining

3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries


Source: pdf

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. [sent-2, score-1.499]

2 The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. [sent-3, score-1.042]

3 To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. [sent-4, score-0.493]

4 Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. [sent-5, score-0.963]

5 Successful implementation requires expert knowledge of the target system and a means of communicating control knowledge to the agent. [sent-7, score-0.49]

6 One way the expert can communicate the desired behavior is to directly demonstrate it and have the agent learn from the demonstrations, e. [sent-8, score-0.597]

7 In these cases an expert may still recognize when an agent’s behavior matches a desired behavior, or is close to it, even if it is difficult to directly demonstrate it. [sent-12, score-0.411]

8 In such cases an expert may also be able to evaluate the relative qualities to the desired behavior of a pair of example trajectories and express a preference for one or the other. [sent-13, score-0.718]

9 Given this motivation, we study the problem of learning expert policies via trajectory preference queries to an expert. [sent-14, score-1.003]

10 A trajectory preference query (TPQ) is a pair of short state trajectories originating from a common state. [sent-15, score-0.854]

11 Given a TPQ the expert is asked to indicate which trajectory is most similar to the target behavior. [sent-16, score-0.677]

12 Our first contribution (Section 3) is to introduce a Bayesian model of the querying process along with an inference approach for sampling policies from the posterior given a set of TPQs and their expert responses. [sent-18, score-0.702]

13 Our second contribution (Section 4) is to describe two active query strategies that attempt to leverage the model in order to minimize the number of queries required. [sent-19, score-0.439]

14 [5] action preferences were introduced into the classification based policy iteration ∗ wilsonaa@eecs. [sent-23, score-0.495]

15 Further the approach also relies on knowledge of the reward function, while our work derives all information about the target policy by actively querying an expert. [sent-32, score-0.597]

16 [1] consider the problem of learning a policy from expert queries. [sent-34, score-0.784]

17 However, their queries require the expert to express preferences over approximate state visitation densities and to possess knowledge of the expected performance of demonstrated policies. [sent-36, score-0.703]

18 We remove this requirement and only require short demonstrations; our expert assesses trajectory snippets not whole solutions. [sent-38, score-0.617]

19 2 Preliminaries We explore policy learning from expert preferences in the framework of Markov Decision Processes (MDP). [sent-40, score-0.826]

20 An MDP is a tuple (S, A, T, P0 , R) with state space S, action space A, state transition distribution T , which gives the probability T (s, a, s ) of transitioning to state s given that action a is taken in state s. [sent-41, score-0.416]

21 Note that in this work, the agent will not be able to observe rewards and rather must gather all information about the quality of policies via interaction with an expert. [sent-44, score-0.35]

22 We consider agents that select actions using a policy πθ parameterized by θ, which is a stochastic mapping from states to actions Pπ (a|s, θ). [sent-45, score-0.509]

23 For example, in our experiments, we use a log-linear policy representation, where the parameters correspond to coefficients of features defined over state-action pairs. [sent-46, score-0.399]

24 It follows from the definitions above that the probability of generating a K-length trajectory given that the agent executes policy πθ starting from state s0 is, K P (ξ|θ, s0 ) = t=1 T (st−1 , at−1 , st )Pπ (at−1 |st−1 , θ). [sent-52, score-0.927]

25 They are an intuitive means of communicating policy information. [sent-54, score-0.422]

26 Trajectories have the advantage that the expert need not share a language with the agent. [sent-55, score-0.385]

27 Instead the expert is only required to recognize differences in physical performances presented by the agent. [sent-56, score-0.385]

28 For purposes of generating trajectories we assume that our learner is provided with a strong simulator (or generative model) of the MDP dynamics, which takes as input a start state s, a policy π, and a value K, and outputs a sampled length K trajectory of π starting in s. [sent-57, score-1.02]

29 In this work, we evaluate policies in an episodic setting where an episode starts by drawing an initial state from P0 and then executing the policy for a finite horizon T . [sent-58, score-0.682]

30 The goal of the learner is to select a policy whose value is close to that of an expert’s policy. [sent-60, score-0.469]

31 In order to learn a policy, the agent presents trajectory preference queries (TPQs) to the expert and receives responses back. [sent-62, score-1.087]

32 Typically K will be much smaller than the horizon T , which is important from the perspective of expert usability. [sent-64, score-0.385]

33 Having been provided with a TPQ the expert gives a response y indicating, which trajectory is preferred. [sent-65, score-0.66]

34 Intuitively, the preferred trajectory is the one that is most similar to what the expert’s policy would have produced from the same starting state. [sent-67, score-0.609]

35 As detailed more in the next section, this is modeled by assuming that the expert has a (noisy) evaluation function f (. [sent-68, score-0.412]

36 We assume that the expert’s evaluation function is a function of the observed trajectories and a latent target policy θ∗ . [sent-70, score-0.721]

37 3 Bayesian Model and Inference In this section we first describe a Bayesian model of the expert response process, which will be used to: 1) Infer policies based on expert responses to TPQs, and 2) Guide the action selection of 2 TPQs. [sent-71, score-1.148]

38 Next, we describe a posterior sampling method for this model which is used for both policy inference and TPQ selection. [sent-72, score-0.505]

39 1 Expert Response Model The model for the expert response y given a TPQ (ξi , ξj ) decomposes as follows P (y|(ξi , ξj ), θ∗ )P (θ∗ ) where P (θ∗ ) is a prior over the latent expert policy, and P (y|(ξi , ξj ), θ∗ ) is a response distribution conditioned on the TPQ and expert policy. [sent-74, score-1.352]

40 The conditional response distribution is represented in terms of an expert evaluation function f ∗ (ξi , ξj , θ∗ ), described in detail below, which translates a TPQ and a candidate expert policy θ∗ into a measure of preference for trajectory ξi over ξj . [sent-77, score-1.591]

41 Intuitively, f ∗ measures the degree to which the policy θ∗ agrees with ξi relative to ξj . [sent-78, score-0.422]

42 To translate the evaluation into an expert response we borrow from previous work [6]. [sent-79, score-0.5]

43 In particular, we assume the expert response is 2 given by the indicator I(f ∗ (ξi , ξj , θ∗ ) > ) where ∼ N (0, σr ). [sent-80, score-0.45]

44 This formulation allows the expert to err when demonstrated trajectories are difficult to distinguish as measured by the magnitude of the evaluation function f ∗ . [sent-85, score-0.596]

45 Intuitively the evaluation function must combine distances between the query trajectories and trajectories generated by the latent target policy. [sent-88, score-0.723]

46 We say that a latent policy and query trajectory are in agreement when they produce similar trajectories. [sent-89, score-0.832]

47 Given the trajectory comparison function, we now encode a dissimilarity measure between the latent target policy and an observed trajectory ξi . [sent-92, score-0.982]

48 To do this let ξ ∗ be a random variable ranging over length k trajectories generated by target policy θ∗ starting in the start state of ξi . [sent-93, score-0.786]

49 The dissimilarity measure is given by: d(ξi , θ∗ ) = E[f (ξi , ξ ∗ )] This function computes the expected dissimilarity between a query trajectory ξi and the K-length trajectories generated by the latent policy from the same initial state. [sent-94, score-1.231]

50 We leave the definition of log(Pπ (ak |sk , θ∗ )) i for the experimental results section where we describe a specific kind of stochastic policy space. [sent-116, score-0.399]

51 Given this gradient calculation, we can apply HMC in order to sample policy parameter vectors from the posterior distribution. [sent-117, score-0.509]

52 This can be used for policy selection in a number of ways. [sent-118, score-0.46]

53 For example, a policy could be formed via Bayesian averaging. [sent-119, score-0.399]

54 In our experiments, we select a policy by generating a large set of samples and then select the sample maximizing the energy function. [sent-120, score-0.524]

55 This selection problem is difficult due to the high dimensional continuous space of TPQs, where each TPQ is defined by an initial state and two trajectories originating from the state. [sent-124, score-0.408]

56 avoiding states where the bicycle has crashed) or simply specify bounds on each dimension of the state space and generate states uniformly within the bounds. [sent-129, score-0.387]

57 If the sampled classifiers disagree on the class of the example, then the algorithm queries the expert for the class label. [sent-134, score-0.532]

58 In particular, we generate ˆ a sequence of potential initial TPQ states from P0 and for each draw two policies θi and θj from the current posterior distribution P (θ∗ |D). [sent-137, score-0.348]

59 If the policies “disagree” on the state, then a query is posed based on trajectories generated by the policies. [sent-138, score-0.594]

60 Disagreement on an initial state s0 is measured according to the expected difference between K length trajectories generated by θi and θj starting at s0 . [sent-139, score-0.371]

61 If this measure exceeds a threshold then TPQ is generated and given to the expert by running each policy for K steps from the initial state. [sent-141, score-0.849]

62 If no query is posed after a specified number of initial states, then the state and policy pair that generated the most disagreement are used to generate the TPQ. [sent-143, score-0.87]

63 This is important from a usability perspective, since making preference judgements between similar trajectories can be difficult for an expert and error prone. [sent-147, score-0.666]

64 In practice we observe that the QBD strategy often generates TPQs based on policy pairs that are from different modes of the distribution, which is an intuitively appealing property. [sent-148, score-0.429]

65 ˆ A set of candidate TPQs is generated by first drawing an initial state from from P0 , sampling a pair of policies from the posterior, and then running the policies for K steps from the initial state. [sent-154, score-0.582]

66 A truly Bayesian heuristic selection strategy should account for the overall change in belief about the latent target policy after adding a new data point. [sent-156, score-0.624]

67 By integrating over the entire latent policy space it accounts for the total impact of the query on the agent’s beliefs. [sent-159, score-0.622]

68 1 Setup If the posterior distribution focuses mass on the expert policy parameters the expected value of the MAP parameters will converge to the expected value of the expert’s policy. [sent-172, score-0.917]

69 Therefore, to examine the speed of convergence to the desired expert policy we report the performance of the MAP policy in the MDP task. [sent-173, score-1.209]

70 The expected return of the selected policy is estimated and reported. [sent-175, score-0.423]

71 We produce an automated expert capable of responding to the queries produced by our agent. [sent-177, score-0.532]

72 The expert knows a target policy, and compares, as described above, the query trajectories generated by the agent to the trajectories generated by the target policy. [sent-178, score-1.378]

73 The expert stochastically produces a response based on its evaluations. [sent-179, score-0.45]

74 The policy is executed by sampling an action from this distribution. [sent-184, score-0.474]

75 The gradient of this action selection policy can be derived straightforwardly and substituted into the gradient of the energy function required by our HMC procedure. [sent-185, score-0.615]

76 The random strategy draws an initial TPQ state from P0 and then generates a trajectory pair by executing two policies drawn i. [sent-190, score-0.549]

77 Our expert knows a policy for swinging the acrobot into a balanced handstand. [sent-203, score-0.987]

78 The acrobot is controlled by a 12 dimensional softmax policy selecting between positive, negative, and zero torque to be applied at the hip joint. [sent-205, score-0.626]

79 The mountain car domain simulates an underpowered vehicle which the agent must drive to the top of a steep hill. [sent-209, score-0.371]

80 The goal of the agent controlling the mountain car system is to utilize the hills surrounding the car to generate sufficient energy to escape a basin. [sent-211, score-0.44]

81 Our expert knows a policy for performing this escape. [sent-212, score-0.819]

82 The agent’s softmax control policy space has 16 dimensions and selects between positive and negative accelerations of the car. [sent-213, score-0.399]

83 In the cart-pole domain the agent attempts to balance a pole fixed to a movable cart while maintaining the carts location in space. [sent-217, score-0.38]

84 The state space is composed of the cart velocity v, change in cart velocity v , angle of the pole ω, and angular velocity of the pole ω . [sent-219, score-0.631]

85 The control policy has 12 dimensions and selects the magnitude of the change in velocity (positive or negative) applied to the base of the cart. [sent-220, score-0.451]

86 Agents in the bicycle balancing task must keep the bicycle balanced for 30000 steps. [sent-224, score-0.496]

87 The goal of the agent is to keep the bicycle ˙ from falling. [sent-229, score-0.407]

88 In both of these domains examining the generated query trajectories shows that the Random strategy tended to produce difficult to distinguish trajectory data and later queries tended to resemble earlier queries. [sent-252, score-0.858]

89 This is due to “plateaus” in the policy space which produce nearly identical behaviors. [sent-253, score-0.399]

90 By 7 Figure 1: Results: We report the expected return of the MAP policy, sampled during Hybrid MCMC simulation of the posterior, as a function of the number of expert queries. [sent-255, score-0.409]

91 comparison the selection heuristics ensure that selected queries have high impact on the posterior distribution and exhibit high query diversity. [sent-258, score-0.487]

92 In Acrobot all of the query selection procedure quickly converge to the target policy (only 25 queries are needed for Random to identify the target). [sent-262, score-0.883]

93 We believe this is due to the length of the query trajectories (300) and the importance of the initial state distribution. [sent-265, score-0.518]

94 Most bicycle configurations lead to out of control spirals from which no policy can return the bicycle to balanced. [sent-266, score-0.841]

95 In these configurations inputs from the agent result in small impact on the observed state trajectory making policies difficult to distinguish. [sent-267, score-0.637]

96 In these configurations poor balancing policies are easily distinguished from better policies and the better policies are not rare. [sent-269, score-0.521]

97 6 Summary We examined the problem of learning a target policy via trajectory preference queries. [sent-274, score-0.788]

98 Two query selection methods were introduced, which heuristically select queries with an aim to efficiently identify the target. [sent-276, score-0.458]

99 Experiments in four RL benchmarks indicate that our model and inference approach is able to infer quality policies and that the query selection methods are generally more effective than random selection. [sent-277, score-0.419]

100 Preferenceu u based policy iteration: Leveraging preference learning for reinforcement learning. [sent-302, score-0.528]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.399), ('expert', 0.385), ('tpq', 0.324), ('tpqs', 0.306), ('bicycle', 0.221), ('trajectory', 0.21), ('query', 0.194), ('agent', 0.186), ('trajectories', 0.184), ('policies', 0.164), ('queries', 0.147), ('acrobot', 0.143), ('hmc', 0.105), ('preference', 0.097), ('qbd', 0.09), ('pole', 0.088), ('posterior', 0.085), ('target', 0.082), ('cart', 0.079), ('state', 0.077), ('ebc', 0.072), ('mountain', 0.068), ('active', 0.068), ('car', 0.066), ('response', 0.065), ('selection', 0.061), ('demonstrations', 0.055), ('disagreement', 0.055), ('action', 0.054), ('dissimilarity', 0.052), ('velocity', 0.052), ('torque', 0.048), ('querying', 0.047), ('agents', 0.045), ('originating', 0.044), ('preferences', 0.042), ('selective', 0.042), ('initial', 0.042), ('simulator', 0.041), ('reward', 0.041), ('mdp', 0.039), ('decomposes', 0.038), ('learner', 0.037), ('handlebars', 0.036), ('hip', 0.036), ('knows', 0.035), ('responses', 0.034), ('oregon', 0.034), ('select', 0.033), ('states', 0.032), ('angle', 0.032), ('angular', 0.032), ('reinforcement', 0.032), ('ak', 0.031), ('eecs', 0.031), ('strategies', 0.03), ('generating', 0.03), ('strategy', 0.03), ('posed', 0.029), ('latent', 0.029), ('balancing', 0.029), ('energy', 0.029), ('receives', 0.028), ('actively', 0.028), ('visitation', 0.028), ('variational', 0.027), ('domain', 0.027), ('evaluation', 0.027), ('desired', 0.026), ('sk', 0.026), ('pair', 0.026), ('gradient', 0.025), ('generate', 0.025), ('executes', 0.025), ('gurations', 0.025), ('balanced', 0.025), ('expected', 0.024), ('simulates', 0.024), ('domains', 0.024), ('mcmc', 0.024), ('hybrid', 0.024), ('generated', 0.023), ('candidate', 0.023), ('agrees', 0.023), ('borrow', 0.023), ('communicating', 0.023), ('heuristic', 0.023), ('heuristically', 0.023), ('tended', 0.023), ('computes', 0.022), ('short', 0.022), ('returns', 0.022), ('seung', 0.022), ('straightforwardly', 0.022), ('benchmark', 0.022), ('cult', 0.022), ('purposes', 0.021), ('bayesian', 0.021), ('sampling', 0.021), ('length', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

2 0.37687904 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

3 0.3077229 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

4 0.26629806 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

5 0.24984069 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.21629518 348 nips-2012-Tractable Objectives for Robust Policy Optimization

7 0.2106773 160 nips-2012-Imitation Learning by Coaching

8 0.19063318 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

9 0.18876754 344 nips-2012-Timely Object Recognition

10 0.16788839 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

11 0.16537978 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

12 0.13989511 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

13 0.13594228 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

14 0.13429007 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

15 0.12868324 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

16 0.12058304 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

17 0.11697298 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

18 0.10559835 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

19 0.098179415 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

20 0.096115313 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.208), (1, -0.469), (2, -0.065), (3, -0.055), (4, -0.094), (5, 0.001), (6, 0.012), (7, -0.004), (8, 0.037), (9, -0.048), (10, -0.041), (11, -0.02), (12, -0.003), (13, 0.059), (14, 0.043), (15, -0.014), (16, 0.047), (17, -0.023), (18, 0.047), (19, 0.053), (20, -0.074), (21, 0.025), (22, 0.04), (23, 0.018), (24, 0.024), (25, -0.001), (26, -0.059), (27, 0.0), (28, -0.009), (29, -0.116), (30, -0.052), (31, -0.03), (32, 0.027), (33, 0.016), (34, -0.017), (35, -0.074), (36, -0.018), (37, -0.063), (38, 0.024), (39, -0.006), (40, 0.033), (41, 0.036), (42, -0.061), (43, 0.076), (44, 0.064), (45, 0.075), (46, 0.017), (47, -0.004), (48, 0.092), (49, -0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96415597 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

2 0.87748611 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

3 0.8643502 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

4 0.84267443 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

5 0.76867348 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

6 0.75427616 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

7 0.73738903 160 nips-2012-Imitation Learning by Coaching

8 0.73602766 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

9 0.66331577 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

10 0.6456697 348 nips-2012-Tractable Objectives for Robust Policy Optimization

11 0.64540714 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

12 0.64474285 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

13 0.64082927 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

14 0.60538626 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

15 0.60377175 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

16 0.59415764 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

17 0.58212233 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

18 0.52893126 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

19 0.52135962 344 nips-2012-Timely Object Recognition

20 0.4895151 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (17, 0.01), (21, 0.015), (38, 0.076), (42, 0.021), (53, 0.018), (54, 0.06), (55, 0.015), (74, 0.423), (76, 0.147), (80, 0.074), (92, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95461565 202 nips-2012-Locally Uniform Comparison Image Descriptor

Author: Andrew Ziegler, Eric Christiansen, David Kriegman, Serge J. Belongie

Abstract: Keypoint matching between pairs of images using popular descriptors like SIFT or a faster variant called SURF is at the heart of many computer vision algorithms including recognition, mosaicing, and structure from motion. However, SIFT and SURF do not perform well for real-time or mobile applications. As an alternative very fast binary descriptors like BRIEF and related methods use pairwise comparisons of pixel intensities in an image patch. We present an analysis of BRIEF and related approaches revealing that they are hashing schemes on the ordinal correlation metric Kendall’s tau. Here, we introduce Locally Uniform Comparison Image Descriptor (LUCID), a simple description method based on linear time permutation distances between the ordering of RGB values of two image patches. LUCID is computable in linear time with respect to the number of pixels and does not require floating point computation. 1

2 0.95052648 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

Author: Angela Eigenstetter, Bjorn Ommer

Abstract: Category-level object detection has a crucial need for informative object representations. This demand has led to feature descriptors of ever increasing dimensionality like co-occurrence statistics and self-similarity. In this paper we propose a new object representation based on curvature self-similarity that goes beyond the currently popular approximation of objects using straight lines. However, like all descriptors using second order statistics, ours also exhibits a high dimensionality. Although improving discriminability, the high dimensionality becomes a critical issue due to lack of generalization ability and curse of dimensionality. Given only a limited amount of training data, even sophisticated learning algorithms such as the popular kernel methods are not able to suppress noisy or superfluous dimensions of such high-dimensional data. Consequently, there is a natural need for feature selection when using present-day informative features and, particularly, curvature self-similarity. We therefore suggest an embedded feature selection method for SVMs that reduces complexity and improves generalization capability of object models. By successfully integrating the proposed curvature self-similarity representation together with the embedded feature selection in a widely used state-of-the-art object detection framework we show the general pertinence of the approach. 1

3 0.93765461 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

Abstract: We present an approach to detecting and analyzing the 3D configuration of objects in real-world images with heavy occlusion and clutter. We focus on the application of finding and analyzing cars. We do so with a two-stage model; the first stage reasons about 2D shape and appearance variation due to within-class variation (station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then refined by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset. 1

4 0.8965404 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: The Lov´ sz ϑ function of a graph, a fundamental tool in combinatorial optimizaa tion and approximation algorithms, is computed by solving a SDP. In this paper we establish that the Lov´ sz ϑ function is equivalent to a kernel learning problem a related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call SVM − ϑ graphs, on which the Lov´ sz ϑ function a can be approximated well by a one-class SVM. This leads to novel use of SVM techniques for solving algorithmic problems in large graphs e.g. identifying a √ 1 planted clique of size Θ( n) in a random graph G(n, 2 ). A classic approach for this problem involves computing the ϑ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of SVM − ϑ graph. As a consequence a SVM based approach easily identifies the clique in large graphs and is competitive with the state-of-the-art. We introduce the notion of common orthogonal labelling and show that it can be computed by solving a Multiple Kernel learning problem. It is further shown that such a labelling is extremely useful in identifying a large common dense subgraph in multiple graphs, which is known to be a computationally difficult problem. The proposed algorithm achieves an order of magnitude scalability compared to state of the art methods. 1

5 0.88108867 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

Author: Levi Boyles, Max Welling

Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1

same-paper 6 0.85566509 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

7 0.78367174 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

8 0.77650481 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

9 0.77382863 185 nips-2012-Learning about Canonical Views from Internet Image Collections

10 0.7661733 201 nips-2012-Localizing 3D cuboids in single-view images

11 0.74569917 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

12 0.7341783 8 nips-2012-A Generative Model for Parts-based Object Segmentation

13 0.73006749 210 nips-2012-Memorability of Image Regions

14 0.72356403 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

15 0.70569485 303 nips-2012-Searching for objects driven by context

16 0.68909895 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

17 0.68498534 146 nips-2012-Graphical Gaussian Vector for Image Categorization

18 0.68212086 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

19 0.680435 168 nips-2012-Kernel Latent SVM for Visual Recognition

20 0.66597003 81 nips-2012-Context-Sensitive Decision Forests for Object Detection