nips nips2012 nips2012-38 knowledge-graph by maker-knowledge-mining

38 nips-2012-Algorithms for Learning Markov Field Policies


Source: pdf

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We use a graphical model for representing policies in Markov Decision Processes. [sent-4, score-0.334]

2 This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. [sent-5, score-0.298]

3 A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i. [sent-6, score-1.11]

4 We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. [sent-10, score-0.482]

5 In practice, significant domain knowledge is often necessary for finding a near-optimal policy in a reasonable amount of time. [sent-13, score-0.4]

6 For example, one needs a suitable set of basis functions, or features, to approximate the value functions in reinforcement learning and the reward functions in inverse reinforcement learning. [sent-14, score-0.417]

7 Designing value or reward features can itself be a challenging problem. [sent-15, score-0.247]

8 Many features are also nontrivial, such as the features related to the shape of an object, used for calculating grasp stability. [sent-18, score-0.213]

9 In this paper, we show how to overcome the difficult problem of designing precise value or reward features. [sent-19, score-0.213]

10 We start by specifying a graph that loosely indicates which pairs of states are supposed to have similar actions under an optimal policy. [sent-23, score-0.323]

11 In an object manipulation task for example, the states correspond to the points of contact between the robot hand and the object surface. [sent-24, score-0.223]

12 Kernels have been widely used before in reinforcement learning (Ormoneit & Sen, 1999), however, they were used for approximating the values of different policies in a search for an optimal policy. [sent-27, score-0.49]

13 Therefore, the value function of an optimal policy is assumed to have a low approximation error, measured by the Bellman error, using that kernel. [sent-31, score-0.406]

14 Subsequently, we derive a distribution on policies, wherein the probability of a policy is proportional to its estimated value, and inversely proportional to its Bellman error. [sent-32, score-0.424]

15 In other terms, the Bellman error is used as a surrogate function for measuring how close a policy is to an optimal one. [sent-33, score-0.434]

16 We show that this 1 probability distribution is an MRF, and use a Markov chain Monte Carlo algorithm for sampling policies from it. [sent-34, score-0.334]

17 A deterministic policy π is a function that returns an action a = π(s) for each state s. [sent-41, score-0.504]

18 We denote by πt:H a non-stationary policy (πt , πt+1 , . [sent-43, score-0.366]

19 The value of policy πt:H is the expected sum of rewards received by following πt:H , starting from a state s H ∗ at time t, Vπt:H (s) = i=t γ i−t Esi [R(si )|st = s, Tπt:i ]. [sent-47, score-0.443]

20 An optimal policy πt:H is one satisfying ∗ πt:H ∈ arg maxπt:H Vπt:H (s), ∀s ∈ S. [sent-48, score-0.406]

21 Searching for an optimal policy is generally an iterative process with two phases: policy evaluation, and policy improvement. [sent-49, score-1.138]

22 We define the Bellman error of two consecutive policies πt and πt+1 using the feature matrix F and the weights wt , wt+1 ∈ Rn as BE(F, wt:t+1 , πt:t+1 ) = Fπt wt − γTπt Fπt+1 wt+1 − R 1 . [sent-54, score-1.108]

23 Similarly, we define the Bellman error of a distribution P on policies πt and πt+1 as BE(F, wt:t+1 , P ) = Eπt:t+1 ∼P [Fπt wt − γTπt Fπt+1 wt+1 ] − R 1 . [sent-55, score-0.721]

24 1 Structure penalty Optimal policies of many real-world problems are structured and change smoothly over the state space. [sent-59, score-0.42]

25 Specifically, we restrain the policy search to a set of policies that have a low estimated Bellman error when their values are approximated using the provided features, knowing that the optimal policy has a low Bellman error. [sent-62, score-1.249]

26 Matrix K is the adjacency π matrix of a graph that indicates which states and actions are similar under an optimal policy. [sent-66, score-0.309]

27 Let wt , wt+1 ∈ R|S| , ∈ R, if Eπt:t+1 ∼P [Kπt wt − γTπt Kπt+1 wt+1 ] − R 1 ≤ then BE ∗ (F, P ) ≤ . [sent-69, score-0.718]

28 This result is obtained by setting F T ΠT wt and F T ΠT wt+1 as the weight vecπ π tors of the values of policies πt and πt+1 . [sent-70, score-0.693]

29 The condition above implies that the policy distribution P has a value function that can be approximated by using F . [sent-71, score-0.408]

30 Enforcing this condition results in a bias favoring policies with a low Bellman error. [sent-72, score-0.334]

31 We start by calculating a distribution over deterministic policies πH that will be executed at the last time-step H. [sent-75, score-0.405]

32 , 0}, we calculate a distribution P (πt |πt+1:H ) over deterministic policies πt given policies πt+1:H that we sample from P (πt+1:H ). [sent-79, score-0.704]

33 2 Primal problem Let ρ ∈ R be a lower bound on the entropy of a distribution P on deterministic policies πt , conditioned on πt+1:H . [sent-82, score-0.408]

34 ∂L(P, τ, η, λ) = ∂P (πt |πt+1:H ) By setting V πt:H (s) + λT [Kπt wt − γTπt Kπt+1 wt+1 ] − τ log P (πt |πt+1:H )) − η − 1. [sent-91, score-0.359]

35 s∈S ∂L(P,τ,η,λ) ∂P (πt |πt+1:H ) = 0 (Karush-Kuhn-Tucker condition), we get the solution expected sum of rewards P (πt |πt+1:H ) ∝ exp 1 τ V πt:H (s) + λT [Kπt wt − γTπt Kπt+1 wt+1 ] s∈S exploration factor . [sent-92, score-0.388]

36 In fact, the kernel K = F F T is the adjacency matrix of a graph (E, S), where (si , sj ) ∈ E if and only if ∃ai , aj ∈ A : K((si , ai ), (sj , aj )) = 0. [sent-94, score-0.613]

37 Local Markov property is verified, ∀si ∈ S : P (πt (si )|πt+1:H , {πt (sj ) : sj ∈ S, sj = si })=P (πt (si )|πt+1:H , {πt (sj ) : (si , sj ) ∈ E, sj = si }). [sent-95, score-1.246]

38 In other terms, the probability of selecting an action in a given state depends on the expected long term reward of the action, as well as on the selected actions in the neighboring states. [sent-96, score-0.452]

39 Since P (π0:H ) = P (πH ) t=0 P (πt |πt+1:H ), then P (π0:H ) ∝ exp 1 τ H t=0 s∈D ˆ V πt:H (s) + λT [Kπt wt − γ Tπt Kπt+1 wt+1 ] t 3 . [sent-102, score-0.359]

40 The algorithm iterates between two main steps: (i) sampling and executing policies from Equation 2, and (ii), updating the value functions and the parameters λt using the samples. [sent-108, score-0.334]

41 Discard policies π0:H that have an empirical Bellman error higher than . [sent-114, score-0.362]

42 This approach is straightforward to extend to handle samples of continuous states and actions , in which case, a policy is represented by a vector Θt ∈ RN of continuous parameters (for instance, the center and the width of a gaussian). [sent-127, score-0.558]

43 4 Markov Random Field Policies for Apprenticeship Learning We now derive a policy shaping approach for apprenticeship learning using Markov Random Fields. [sent-130, score-0.53]

44 1 Apprenticeship learning The aim of apprenticeship learning is to find a policy π that is nearly as good as a policy π demonˆ strated by an expert, i. [sent-132, score-0.896]

45 Abbeel & Ng (2004) proposed to learn a ˆ reward function, assuming that the expert is optimal, and to use it to recover the expert’s generalized policy. [sent-135, score-0.346]

46 The process of learning a reward function is known as inverse reinforcement learning. [sent-136, score-0.329]

47 The reward function is assumed to be a linear combination of m feature vectors φk with weights m θk , ∀s ∈ S : R(s) = k=1 θk φk (s). [sent-137, score-0.241]

48 The expected discounted sum of feature φk , given policy H πt:H and starting from s, is defined as φπt:H (s) = i=t γ i−t Est:H [φk (si )|st = s, Tπt:i ]. [sent-138, score-0.394]

49 Using this k definition, the expected return of a policy π can be written as a linear function of the feature expectam π tions, Vπt:H (s) = k=1 θk φk t:H (s). [sent-139, score-0.394]

50 2 Structure matching The classical framework of apprenticeship learning is based on designing features φ of the reward and learning corresponding weights θ. [sent-144, score-0.411]

51 In practice, as we show in the experiments, it is often difficult to find an appropriate set of reward features. [sent-145, score-0.213]

52 Moreover, the values of the reward features are usually 4 obtained from empirical data and are subject to measurement errors. [sent-146, score-0.247]

53 This information about the structure of the expert’s policy can be used to partially overcome the problem of finding reward features. [sent-148, score-0.579]

54 Given an expert’s policy π0:H and feature matrix F , we are interested in finding a ˆ distribution P on policies π0:H that has a Bellman error similar to that of the expert’s policy. [sent-150, score-0.756]

55 Let P be a distribuπ T tion on policies πt and πt+1 such that Eπt:t+1 ∼P [Kπt ] = Kπt , and Eπt:t+1 ∼P [γTπt Kπt+1 Tπt ] = ˆ T ∗ ∗ γTπt Kπt+1 Tπt , then BE (F, πt:t+1 ) = BE (F, P ). [sent-154, score-0.334]

56 Let wt , wt+1 ∈ R|S| be the weight vectors o ˆ ˆ that minimize the Bellman error of the expert’s policy, i. [sent-160, score-0.387]

57 Ππt F wt − γTπt Ππt+1 F wt+1 − R p = ˆ ˆ ˆ ˆ ˆ BE ∗ (F, πt:t+1 ). [sent-162, score-0.359]

58 Let us write wt = wt + wt⊥ , where wt is the projection of wt on the rows ˆ ˆ ˆ ˆ ˆ ˆ of Ππt F , i. [sent-163, score-1.436]

59 ∃ˆ t ∈ R|S| : wt = F T ΠTt αt , and wt⊥ is orthogonal to the rows of Ππt F . [sent-165, score-0.359]

60 α ˆ ˆ ˆ ˆ π ˆ ˆ Thus, Ππt F wt = Ππt F (wt + wt⊥ ) = Ππt F wt = Kπt αt . [sent-166, score-0.718]

61 Let wt = F T ΠTt αt and wt+1 = F T ΠTt+1 Tπt αt+1 , ˆ ˆ ˆ ˆ ˆ ˆ π ˆ π ˆ ˆ ∗ then we have BE (F, P ) ≤ Eπt:t+1 [Ππt F wt − γTπt Ππt+1 F wt+1 ] − R 1 = Eπt:t+1 [Kπt αt − ˆ T T T T γTπt Kπt+1 Tπt αt+1 ] − R 1 = Kπt απt − γTπt Kπt+1 Tπt απt+1 − R 1 = BE ∗ (F, πt:t+1 ). [sent-168, score-0.718]

62 3 Problem statement Our problem now is to find a distribution on deterministic policies P that satisfies the conditions ˆ stated in Proposition 1 in addition to the feature matching conditions φπ (s) = φk . [sent-170, score-0.398]

63 The conditions k of Proposition 1 ensure that P assigns high probabilities to policies that have a structure similar to the expert’s policy π . [sent-171, score-0.7]

64 3), we derive the distribution s π θk φk t:H (s)+ P (πt |πt+1:H ) ∝ exp k s∈S λi,j Kπt (si , sj )+γ (si ,sj )∈S 2 T ξi,j (Tπt Kπt+1 Tπt )(si , sj ) . [sent-182, score-0.444]

65 The parameters θ, λ and ξ are learned by maximizing the likelihood P (ˆt:H ) of the expert’s policy πt:H . [sent-184, score-0.397]

66 The learned parameters can then π ˆ be used for sampling policies that have the same expected value (from the second constraint), and the same Bellman error (from the last two constraints and Proposition 1) as the expert’s policy. [sent-185, score-0.393]

67 This simplification is based on the fact that the reward function is independent of the initial state. [sent-190, score-0.213]

68 5 For a sparse matrix K, one can create a corresponding graph (E, S), where (si , sj ) ∈ E if and only if ∃ai , aj ∈ A : K((si , ai ), (sj , aj )) = 0 or ∃ai , aj ∈ A, (si , sj ) ∈ E : γT (si , ai , si )T (sj , aj , sj ) = 0. [sent-192, score-1.391]

69 Finally, the policy distribution can be rewritten as Vθπt:H (s) + λ P (πt |πt+1:H ) ∝ exp s∈S where Vθπt:H (s) = k θk φk (s) + γ Kπt (si , sj ) + γξ (si ,sj )∈E , (3) (si ,sj )∈E πt+1:H s ∈S T (Tπt Kπt+1 Tπt )(si , sj ) Tπt (s, s )Vθ (s ). [sent-193, score-0.81]

70 The probability of choosing action a in a given state s depends on the expected value of (s, a) and the actions chosen in neighboring states. [sent-195, score-0.239]

71 5 Learning procedure In the learning phase, Equation 3 is used for finding parameters θ, λ and ξ that maximize the likelihood of the expert’s policy π . [sent-203, score-0.366]

72 For instance, we reuse the values calculated for a given policy π as the initial values of all the policies that differ from π in one state only. [sent-207, score-0.748]

73 6 Planning procedure ∗ ∗ ∗ Algorithm 2 describes a dynamic programming procedure for finding a policy (π0 , π1 , . [sent-211, score-0.366]

74 Denote by πt the labeling policy returned by the inference algorithm; end for ∗ ∗ ∗ Return the policy π ∗ = (π0 , π1 , . [sent-224, score-0.732]

75 , πH ); 5 Experimental Results We present experiments on two problems: learning to swing-up and balance an inverted pendulum on a cart, and learning to grasp unknown objects. [sent-227, score-0.244]

76 1 Swing-up cart-balancing The simulated swing-up cart-balancing system (Figure 1) consists of a 6 kg cart running on a 2 m track and a freely-swinging 1 kg pendulum with mass attached to the cart with a 50 cm rod. [sent-229, score-0.377]

77 The state of the system is the position and velocity of the cart (x, x), as well as the angle and angular ˙ ˙ An action a ∈ R is a horizontal force applied to the cart. [sent-230, score-0.263]

78 The objective is to learn, in a series of 5s episodes, a policy that swings the pendulum up and balances it in the inverted position. [sent-235, score-0.5]

79 Since the pendulum falls down after hitting one of the two track limits, the policy should also learn to maintain the cart in the middle of the track. [sent-236, score-0.599]

80 Moreover, the track has a nonuniform friction modeled as a force slowing down the cart. [sent-237, score-0.242]

81 ˙ ˙ We consider parametric policies of the form π(x, x, θ, θ) = i pi qi (x, x, θ, θ), where pi are real ˙ ˙ weights and qi are basis functions corresponding to the signs of the angle and the angular velocity and an exponential function centered at the middle of the track. [sent-240, score-0.413]

82 A reward of 1 is given for each step the pendulum is above the horizon. [sent-242, score-0.316]

83 9 Average reward per time−step Since the friction changes smoothly along the track (domain knowledge), we use the adjacency matrix of a nearest-neighbor graph as the MRF kernel K in Equation 2. [sent-245, score-0.605]

84 Figure 1 shows the average reward per time-step of the learned policies as a function of the learning time. [sent-248, score-0.578]

85 Our attempts to solve this variant using different policy gradient methods, e. [sent-249, score-0.366]

86 We report the values of the policies sampled with MetropolisHastings using Equation 2, and compare to the case where the policies are sampled solely according to their expected values, i. [sent-252, score-0.668]

87 In fact, policies that change smoothly have a lower Bellman error as their values can be better approximated with kernel K. [sent-258, score-0.486]

88 The friction is nonuniform, the red area has a higher friction than the blue one. [sent-271, score-0.314]

89 Consequently, restraining the search to smooth policies yield faster convergence. [sent-273, score-0.362]

90 Precision grasps of unknown objects From a high-level point of view, grasping an object can be seen as an MDP with three steps: reaching, preshaping, and grasping. [sent-274, score-0.215]

91 At any step, the robot can either proceed to the next step or restart from the beginning and get a reward of 0. [sent-275, score-0.301]

92 At t = 0, the robot always starts from the same initial state s0 , and the set of actions corresponds to the set of points on the surface of the object. [sent-276, score-0.269]

93 At t = 1, the state is given by a surface point and an approach direction, and the set of actions corresponds to the set of all possible hand orientations. [sent-278, score-0.214]

94 The reward of each step depends on the current state. [sent-282, score-0.213]

95 The reward R1 defined at t = 1 is a function of the first three eigenvalues of the scatter matrix defined by the 3D coordinates of the points inside a small ball centered on the selected point (Boularias et al. [sent-284, score-0.241]

96 The reward R2 , 7 Regression AMN AL-MRF MaxEnt IRL Table 2: Learned Q-values at t = 0. [sent-286, score-0.213]

97 The black arrow indicates the approach direction in the optimal policy according to the learned reward function. [sent-289, score-0.65]

98 same approach direction and hand orientation), the kernel K is given by the k-nearest neighbors graph, using the Euclidean distance and k = 6 in the state space of positions (or surface points), and the angular distance, with k = 2 in the discretized state space of hand orientations. [sent-295, score-0.239]

99 Percentage of successful grasps 100 80 60 40 20 Table 2 shows the Q-values at t = 0 and the approach directions at optimal grasping points. [sent-301, score-0.216]

100 The values of Figure 2: Percentage of the other points are zeros because the optimal action at these points grasps located on a handle is to restart rather than to grasp. [sent-303, score-0.245]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.366), ('wt', 0.359), ('policies', 0.334), ('bellman', 0.222), ('sj', 0.222), ('reward', 0.213), ('si', 0.179), ('amn', 0.179), ('apprenticeship', 0.164), ('friction', 0.157), ('expert', 0.133), ('grasp', 0.11), ('maxent', 0.11), ('ziebart', 0.11), ('actions', 0.107), ('pendulum', 0.103), ('boularias', 0.103), ('mrf', 0.097), ('aj', 0.092), ('irl', 0.092), ('grasps', 0.09), ('reinforcement', 0.088), ('grasping', 0.086), ('cart', 0.082), ('peters', 0.082), ('markov', 0.07), ('mer', 0.067), ('kr', 0.065), ('hastings', 0.063), ('graph', 0.062), ('boyan', 0.06), ('surface', 0.059), ('metropolis', 0.058), ('wherein', 0.058), ('ai', 0.058), ('states', 0.057), ('st', 0.057), ('robot', 0.055), ('oliver', 0.055), ('action', 0.054), ('abdeslam', 0.052), ('lstd', 0.052), ('tt', 0.05), ('track', 0.048), ('state', 0.048), ('temperature', 0.048), ('henb', 0.045), ('koltun', 0.045), ('restrain', 0.045), ('taskar', 0.044), ('kernel', 0.044), ('adjacency', 0.043), ('approximated', 0.042), ('field', 0.041), ('optimal', 0.04), ('angular', 0.04), ('lagrangian', 0.04), ('munoz', 0.04), ('velocity', 0.039), ('qt', 0.039), ('mrfs', 0.039), ('object', 0.039), ('smoothly', 0.038), ('entropy', 0.038), ('deisenroth', 0.037), ('collision', 0.037), ('nonuniform', 0.037), ('ormoneit', 0.037), ('deterministic', 0.036), ('calculating', 0.035), ('nger', 0.034), ('dudik', 0.034), ('sen', 0.034), ('domain', 0.034), ('features', 0.034), ('equation', 0.033), ('kober', 0.033), ('restart', 0.033), ('boykov', 0.033), ('contact', 0.033), ('fields', 0.033), ('inverted', 0.031), ('kg', 0.031), ('learned', 0.031), ('kohli', 0.03), ('abbeel', 0.03), ('supposed', 0.03), ('neighboring', 0.03), ('proposition', 0.03), ('mdp', 0.029), ('rewards', 0.029), ('representer', 0.029), ('planning', 0.029), ('handle', 0.028), ('feature', 0.028), ('search', 0.028), ('error', 0.028), ('inverse', 0.028), ('et', 0.028), ('loosely', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

2 0.41523373 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

3 0.3236964 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

4 0.3077229 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

5 0.29547912 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.27145544 348 nips-2012-Tractable Objectives for Robust Policy Optimization

7 0.25104764 344 nips-2012-Timely Object Recognition

8 0.23683941 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

9 0.22215448 160 nips-2012-Imitation Learning by Coaching

10 0.18516102 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

11 0.18514845 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

12 0.1842228 292 nips-2012-Regularized Off-Policy TD-Learning

13 0.16452117 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

14 0.16131794 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

15 0.15443741 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

16 0.15272558 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

17 0.14742452 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

18 0.14402297 303 nips-2012-Searching for objects driven by context

19 0.14198148 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

20 0.1406372 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.253), (1, -0.534), (2, -0.051), (3, -0.088), (4, -0.03), (5, 0.01), (6, -0.001), (7, -0.064), (8, 0.036), (9, -0.034), (10, -0.027), (11, -0.089), (12, 0.03), (13, 0.078), (14, 0.089), (15, -0.066), (16, 0.059), (17, -0.01), (18, 0.041), (19, 0.001), (20, -0.01), (21, -0.049), (22, 0.033), (23, -0.033), (24, -0.025), (25, 0.02), (26, -0.055), (27, 0.023), (28, -0.039), (29, -0.068), (30, -0.033), (31, -0.019), (32, 0.025), (33, 0.027), (34, 0.028), (35, -0.075), (36, -0.011), (37, -0.025), (38, -0.025), (39, 0.036), (40, 0.012), (41, 0.042), (42, -0.025), (43, 0.022), (44, 0.023), (45, 0.076), (46, -0.001), (47, 0.034), (48, 0.082), (49, -0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95880359 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

2 0.89482212 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

3 0.88955379 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

4 0.84941858 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

5 0.84047371 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.77685571 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

7 0.76410609 160 nips-2012-Imitation Learning by Coaching

8 0.75418895 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

9 0.73187667 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

10 0.72962773 348 nips-2012-Tractable Objectives for Robust Policy Optimization

11 0.71845192 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

12 0.69318247 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

13 0.66893679 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

14 0.64614922 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

15 0.63193172 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

16 0.62245828 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

17 0.61288428 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

18 0.60279334 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

19 0.57408452 344 nips-2012-Timely Object Recognition

20 0.55499655 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (21, 0.023), (38, 0.129), (39, 0.017), (42, 0.042), (54, 0.105), (55, 0.019), (62, 0.196), (74, 0.101), (76, 0.128), (80, 0.068), (92, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80299765 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

2 0.78736323 160 nips-2012-Imitation Learning by Coaching

Author: He He, Jason Eisner, Hal Daume

Abstract: Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. 1

3 0.77119935 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1

4 0.76991659 346 nips-2012-Topology Constraints in Graphical Models

Author: Marcelo Fiori, Pablo Musé, Guillermo Sapiro

Abstract: Graphical models are a very useful tool to describe and understand natural phenomena, from gene expression to climate change and social interactions. The topological structure of these graphs/networks is a fundamental part of the analysis, and in many cases the main goal of the study. However, little work has been done on incorporating prior topological knowledge onto the estimation of the underlying graphical models from sample data. In this work we propose extensions to the basic joint regression model for network estimation, which explicitly incorporate graph-topological constraints into the corresponding optimization approach. The first proposed extension includes an eigenvector centrality constraint, thereby promoting this important prior topological property. The second developed extension promotes the formation of certain motifs, triangle-shaped ones in particular, which are known to exist for example in genetic regulatory networks. The presentation of the underlying formulations, which serve as examples of the introduction of topological constraints in network estimation, is complemented with examples in diverse datasets demonstrating the importance of incorporating such critical prior knowledge. 1

5 0.75514466 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller

Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1

6 0.75047922 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

7 0.74342918 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

8 0.74013096 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

9 0.73775709 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

10 0.72714794 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

11 0.72299176 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

12 0.7223286 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.72218633 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

14 0.72191238 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

15 0.71857518 303 nips-2012-Searching for objects driven by context

16 0.71604151 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

17 0.71477592 348 nips-2012-Tractable Objectives for Robust Policy Optimization

18 0.7147184 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

19 0.70848364 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

20 0.70827967 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes