nips nips2012 nips2012-160 knowledge-graph by maker-knowledge-mining

160 nips-2012-Imitation Learning by Coaching


Source: pdf

Author: He He, Jason Eisner, Hal Daume

Abstract: Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Some recent approaches give strong performance guarantees by training the policy iteratively. [sent-6, score-0.418]

2 However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. [sent-7, score-0.787]

3 When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. [sent-8, score-0.827]

4 In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. [sent-9, score-0.481]

5 By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. [sent-10, score-0.795]

6 Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. [sent-12, score-0.367]

7 However, this method ignores the difference between distributions of states induced by executing the oracle’s policy and the learner’s, thus has a quadratic loss in the task horizon T . [sent-15, score-0.664]

8 A recent approach called Dataset Aggregation [3] (DAgger) yields a loss linear in T by iteratively training the policy in states induced by all previously learned policies. [sent-16, score-0.619]

9 Its theoretical guarantees are relative to performance of the policy that best mimics the oracle on the training data. [sent-17, score-0.73]

10 In difficult decision-making problems, however, it can be hard to find a good policy that has a low training error, since the oracle’s policy may resides in a space that is not imitable in the learner’s policy space. [sent-18, score-1.24]

11 When the optimal action is hard to achieve, we propose to coach the learner with easy-to-learn actions and let it gradually approach the oracle (Section 3). [sent-20, score-0.97]

12 A coach trains the learner iteratively in a fashion similar to DAgger. [sent-21, score-0.384]

13 At each iteration it demonstrates actions that the learner’s current policy prefers and have a small task loss. [sent-22, score-0.524]

14 The coach becomes harsher by showing more oracle actions as the learner makes progress. [sent-23, score-0.73]

15 Intuitively, this allows the learner to move towards a better action without much effort. [sent-24, score-0.385]

16 Thus our algorithm achieves the best action gradually instead of aiming at an impractical goal from the beginning. [sent-25, score-0.235]

17 We analyze our algorithm by a reduction to online learning and show that our approach achieves a lower regret bound than DAgger that uses the oracle action (Section 3. [sent-26, score-0.693]

18 Our method is also related to direct loss minimization [4] for structured prediction and methods of selecting oracle translations in machine translation [5, 6] (Section 5). [sent-28, score-0.487]

19 Experimental results show that coaching has a more stable training curve and achieves lower task loss than state-of-the-art imitation learning algorithms. [sent-32, score-0.998]

20 Our major contribution is a meta-algorithm for hard imitation learning tasks where the available policy space is not adequate for imitating the oracle. [sent-33, score-0.694]

21 Our main theoretical result is Theorem 4 which states that coaching as a smooth transition from the learner to the oracle have a lower regret bound than only using the oracle. [sent-34, score-1.288]

22 2 Background In a sequential decision-making problem, we have a set of states S, a set of actions A and a policy space Π. [sent-35, score-0.51]

23 An agent follows a policy π : S → A that determines which action to take in a given state. [sent-36, score-0.575]

24 After taking action a in state s, the environment responds by some immediate loss L(s, a). [sent-37, score-0.323]

25 Our goal is to learn a policy π ∈ Π that minimizes the task loss J(π). [sent-43, score-0.525]

26 We assume that Π is a closed, bounded and non-empty convex set in Euclidean space; a policy π can be parameterized by a vector w ∈ Rd . [sent-44, score-0.434]

27 In imitation learning, we define an oracle that executes policy π ∗ and demonstrates actions a∗ = s arg min L(s, a) in state s. [sent-45, score-1.127]

28 The learner only attempts to imitate the oracle’s behavior without any a∈A notion of the task loss function. [sent-46, score-0.409]

29 Thus minimizing the task loss is reduced to minimizing a surrogate loss with respect to the oracle’s policy. [sent-47, score-0.318]

30 1 Imitation by Classification A typical approach to imitation learning is to use the oracle’s trajectories as supervised data and learn a policy (multiclass classifier) that predicts the oracle action under distribution of states induced by running the oracle’s policy. [sent-49, score-1.194]

31 At each step t, we collect a training example (st , π ∗ (st )), where π ∗ (st ) is the oracle’s action (class label) in state st . [sent-50, score-0.318]

32 Let (s, π, π ∗ (s)) denote the surrogate loss of executing π in state s with respect to π ∗ (s). [sent-51, score-0.297]

33 This can be any convex loss function used for training the classifier, for example, hinge loss in SVM. [sent-52, score-0.269]

34 Using any standard supervised learning algorithm, we can learn a policy π = arg min Es∼dπ∗ [ (s, π, π ∗ (s))]. [sent-53, score-0.493]

35 ˆ (1) π∈Π We then bound J(ˆ ) based on how well the learner imitates the oracle. [sent-54, score-0.24]

36 ˆ π One drawback of the supervised approach is that it ignores the fact that the state distribution is different for the oracle and the learner. [sent-57, score-0.405]

37 When the learner cannot mimic the oracle perfectly (i. [sent-58, score-0.557]

38 Thus the learned policy is not able to handle situations where the learner follows a wrong path that is never chosen by the oracle, hence the quadratically increasing loss. [sent-61, score-0.645]

39 [3] generalized Theorem 1 to any policy that has surrogate loss under its own state distribution, i. [sent-64, score-0.618]

40 2 It basically says that when π chooses a different action from π ∗ at time step t, if the cumulative cost due to this error is bounded by u, then the relative task loss is O(uT ). [sent-73, score-0.379]

41 2 Dataset Aggregation The above problem of insufficient exploration can be alleviated by iteratively learning a policy trained under states visited by both the oracle and the learner. [sent-75, score-0.77]

42 For example, during training one can use a “mixture oracle” that at times takes an action given by the previous learned policy [11]. [sent-76, score-0.61]

43 Alternatively, at each iteration one can learn a policy from trajectories generated by all previous policies [3]. [sent-77, score-0.461]

44 In the first iteration, we collect a training set D1 = {(sπ∗ , π ∗ (sπ∗ ))} from the oracle (π1 = π ∗ ) and learn a policy π2 . [sent-80, score-0.761]

45 In iteration i, we collect trajectories by executing the previous policy πi and form the training set Di by labeling sπi with the oracle action π ∗ (sπi ); πi+1 is then learned on D1 . [sent-82, score-1.093]

46 Intuitively, this enables the learner to make up for past failures to mimic the oracle. [sent-86, score-0.245]

47 Thus we can obtain a policy that performs well under its own induced state distribution. [sent-87, score-0.458]

48 3 Reduction to Online Learning Let i (π) = Es∼dπi [ (s, π, π ∗ (s))] denote the expected surrogate loss of executing π in states distributed according to dπi . [sent-89, score-0.281]

49 In an online learning setting, in iteration i an algorithm executes policy πi and observes loss i (πi ). [sent-90, score-0.625]

50 It then provides a different policy πi+1 in the next iteration and observes i+1 (πi+1 ). [sent-91, score-0.444]

51 In i each iteration it picks the policy that works best so far: πi+1 = arg min j=1 j (π). [sent-94, score-0.5]

52 Similarly, π∈Π in DAgger at iteration i we choose the policy that has the minimum surrogate loss on all previous data. [sent-95, score-0.606]

53 Let N = N 1 minπ∈Π N i=1 Es∼dπi [ (s, π, π ∗ (s))] be the minimum loss we can achieve in the policy space Π. [sent-102, score-0.513]

54 For DAgger, if N is O(uT log T ) and Qπ −t+1 (s, π)−Qπ −t+1 (s, π ∗ ) ≤ u, there exists T T a policy π ∈ π1:N s. [sent-104, score-0.389]

55 3 Imitation by Coaching An oracle can be hard to imitate in two ways. [sent-108, score-0.396]

56 First, the learning policy space is far from the space that the oracle policy lies in, meaning that the learner only has limited learning ability. [sent-109, score-1.34]

57 Second, the environment information known by the oracle cannot be sufficiently inferred from the state, meaning that the learner does not have access to good learning resources. [sent-110, score-0.528]

58 In the online learning setting, a too-good oracle may result in adversarially varying loss functions over iterations from the learner’s perspective. [sent-111, score-0.467]

59 These difficulties result in a substantial gap between the oracle’s performance and the best performance achievable in the policy space Π (i. [sent-113, score-0.469]

60 To better instruct the learner, a coach should demonstrate actions that are not much worse than the oracle action but are easier to achieve within the learner’s ability. [sent-117, score-0.725]

61 The lower an action’s task loss is, the closer it is to the oracle action. [sent-118, score-0.448]

62 Therefore, similar to [6], we define a hope action that combines the task loss and the score of the learner’s current policy. [sent-120, score-0.373]

63 Let scoreπi (s, a) be a measure of how likely πi chooses action a in state s. [sent-121, score-0.237]

64 In the first iteration, we set λ1 = 0 as the learner has not learned any model yet. [sent-123, score-0.239]

65 Our intuition is that when the learner has difficulty performing the optimal action, the coach should lower the goal properly and let the learner gradually achieving the original goal in a more stable way. [sent-125, score-0.626]

66 We denote ˜ ˜ N ˜ 1 ˜N = N minπ∈Π i=1 i (π) the minimum loss of the best policy in hindsight with respect to hope actions. [sent-128, score-0.528]

67 For DAgger with coaching, if N is O(uT log T ) and Qπ −t+1 (s, π)−Qπ −t+1 (s, π ∗ ) ≤ T T u, there exists a policy π ∈ π1:N s. [sent-130, score-0.389]

68 It is important to note that both the DAgger theorem and the coaching theorem provide a relative guarantee. [sent-133, score-0.673]

69 They depend on whether we can find a policy that has small training error in each FollowThe-Leader step. [sent-134, score-0.418]

70 Through coaching, we can always adjust λ to create a more learnable oracle policy space, thus get a relatively good policy that has small training error, at the price of running a few more iterations. [sent-136, score-1.119]

71 We consider a policy π parameterized by a vector w ∈ Rd . [sent-138, score-0.389]

72 The predicted action is aπ,s = arg max wT φ(s, a) ˆ (4) a∈A and the hope action is aπ,s = arg max λ · wT φ(s, a) − L(s, a). [sent-140, score-0.486]

73 4 It has been shown that given a strongly convex loss function , Follow-The-Leader has O(log N ) regret [12, 13]. [sent-144, score-0.233]

74 Then Follow-The-Leader on functions have the following regret: N N − min i (πi ) π∈Π i=1 N i (π) i=1 ≤ N i (πi ) i=1 − 2nb2 log m ≤ i (πi+1 ) i=1 DRmN b +1 To analyze the regret using surrogate loss with respect to hope actions, we use the following lemma: N N ˜ N N ˜ Lemma 1. [sent-148, score-0.35]

75 Similar arguments yield the same ˜ s In the online learning setting, imitating the coach is to obsearve the loss ˜i (πi ) and learn a policy i ˜ πi+1 = arg min j=1 j (π) at iteration i. [sent-180, score-0.843]

76 Consider the best policy in π1:N , we have min Es∼dπ [ (s, π, π ∗ (s))] ≤ π∈π1:N ≤ 1 N N Es∼dπi [ (s, πi , π ∗ (s))] i=1 ˜N + 2 max nλ N + 2nb2 log mN DRmN b +1 When N is Ω(T log T ), the regret is O(1/T ). [sent-185, score-0.543]

77 4 Experiments We apply imitation learning to a novel dynamic feature selection problem. [sent-187, score-0.3]

78 The action space includes a set of non-selected features and the stop action. [sent-192, score-0.234]

79 At each time step, the policy decides whether to stop acquiring features and make a prediction; if not, which feature(s) to purchase next. [sent-193, score-0.437]

80 Achieving an accuracy-cost trade-off corresponds to finding the optimal policy minimizing a loss function. [sent-194, score-0.496]

81 where margin(a) denote the margin of classifying the instance after action a; cost(s) denote the user-defined cost of all selected features in the current state s; and α is a user-specified trade-off parameter. [sent-259, score-0.322]

82 1 Dynamic Feature Selection by Imitation Learning Ideally, an oracle should lead to a subset of features having the maximum reward. [sent-262, score-0.343]

83 In addition, the oracle action may not be unique since the optimal subset of features do not have to be selected in a fixed order. [sent-264, score-0.512]

84 Given a state s, the oracle iterates through the action space and calculates each action’s loss; it then chooses the action that leads to the minimum immediate loss in the current state. [sent-266, score-0.842]

85 In most cases, our oracle can achieve high accuracy with rather small cost. [sent-268, score-0.336]

86 Considering a linear classifier, as the oracle already knows the correct class label of an instance, it can simply choose, for example, a positive feature that has a positive weight to correctly classify a positive instance. [sent-269, score-0.348]

87 In addition, at the start state even when φ(s0 , a) are almost the same for all instances, the oracle may tend to choose features that favor the instance’s class. [sent-270, score-0.39]

88 In the next section we show that in this case coaching achieves better results than using an oracle. [sent-272, score-0.627]

89 In addition, the reward of Coaching changes smoothly and grows stably, which means coaching avoids drastic change of the policy. [sent-279, score-0.648]

90 0 and use the best policy evaluated on the development set for each α. [sent-292, score-0.389]

91 However, the learned policy is still much worse compared to the oracle’s policy. [sent-300, score-0.412]

92 This is because coaching is still inherently limited by the insufficient policy space, which can be fixed by using expensive kernels and nonlinear policies. [sent-301, score-0.994]

93 [5] have used for selecting oracle translations in machine translation. [sent-304, score-0.341]

94 They maximized a linear combination of the BLEU score (similar to negative task loss in our case) and the model score to find good translations that are easier to train against. [sent-305, score-0.262]

95 We propose a coaching algorithm that lets the learner target at easier goals first and gradually approaches the oracle. [sent-314, score-0.89]

96 We show that coaching has a lower regret bound both theoretically and empirically. [sent-315, score-0.729]

97 In the future, we are interested in formally defining the hardness of a problem so that we know exactly in which cases coaching is more suitable than DAgger. [sent-316, score-0.605]

98 Another direction is to develop a similar coaching process in online convex optimization by optimizing a sequence of approximating functions. [sent-317, score-0.679]

99 We are also interested in applying coaching to more complex structured prediction problems in natural language processing and computer vision. [sent-318, score-0.626]

100 A reduction of imitation learning and structured prediction to no-regret online learning. [sent-338, score-0.293]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('coaching', 0.605), ('policy', 0.389), ('dagger', 0.317), ('oracle', 0.312), ('learner', 0.216), ('imitation', 0.206), ('action', 0.169), ('coach', 0.15), ('loss', 0.107), ('regret', 0.1), ('drmn', 0.076), ('surrogate', 0.075), ('executing', 0.068), ('wt', 0.06), ('imitate', 0.057), ('actions', 0.052), ('online', 0.048), ('state', 0.047), ('ut', 0.047), ('gradually', 0.044), ('reward', 0.043), ('ross', 0.043), ('st', 0.042), ('arg', 0.04), ('digit', 0.039), ('imitating', 0.038), ('trajectories', 0.037), ('score', 0.036), ('feature', 0.036), ('min', 0.036), ('iteration', 0.035), ('cost', 0.034), ('theorem', 0.034), ('radar', 0.033), ('forward', 0.033), ('hope', 0.032), ('collect', 0.031), ('states', 0.031), ('selection', 0.031), ('features', 0.031), ('aggregation', 0.03), ('task', 0.029), ('translations', 0.029), ('mimic', 0.029), ('training', 0.029), ('rd', 0.028), ('supervised', 0.028), ('chiang', 0.028), ('dynamic', 0.027), ('hard', 0.027), ('executes', 0.026), ('convex', 0.026), ('proximal', 0.026), ('easier', 0.025), ('di', 0.025), ('accuracy', 0.024), ('bound', 0.024), ('hal', 0.023), ('learned', 0.023), ('achieves', 0.022), ('daum', 0.022), ('wi', 0.022), ('liang', 0.022), ('gap', 0.022), ('induced', 0.022), ('jason', 0.021), ('chooses', 0.021), ('margin', 0.021), ('es', 0.021), ('sequential', 0.021), ('structured', 0.021), ('achievable', 0.021), ('visited', 0.02), ('observes', 0.02), ('dynamically', 0.02), ('instance', 0.02), ('md', 0.02), ('substantial', 0.02), ('mcallester', 0.019), ('insuf', 0.019), ('demonstrates', 0.019), ('hazan', 0.019), ('ranked', 0.019), ('bounded', 0.019), ('classi', 0.018), ('max', 0.018), ('translation', 0.018), ('ignores', 0.018), ('reduction', 0.018), ('iteratively', 0.018), ('wrong', 0.017), ('stop', 0.017), ('space', 0.017), ('agent', 0.017), ('er', 0.017), ('instruct', 0.017), ('argall', 0.017), ('chernova', 0.017), ('adequate', 0.017), ('bleu', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 160 nips-2012-Imitation Learning by Coaching

Author: He He, Jason Eisner, Hal Daume

Abstract: Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. 1

2 0.30996558 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

3 0.23216219 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

4 0.22215448 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

5 0.2106773 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

6 0.20665459 344 nips-2012-Timely Object Recognition

7 0.20486352 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

8 0.16442432 348 nips-2012-Tractable Objectives for Robust Policy Optimization

9 0.1495972 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

10 0.14680368 285 nips-2012-Query Complexity of Derivative-Free Optimization

11 0.14473766 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

12 0.12748727 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

13 0.12702557 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

14 0.11589382 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

15 0.11584995 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

16 0.10270942 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

17 0.099627368 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

18 0.09262199 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

19 0.091082454 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

20 0.088616975 199 nips-2012-Link Prediction in Graphs with Autoregressive Features


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, -0.402), (2, -0.007), (3, -0.057), (4, 0.044), (5, 0.016), (6, 0.005), (7, -0.0), (8, -0.006), (9, -0.006), (10, 0.038), (11, 0.058), (12, -0.03), (13, 0.039), (14, 0.048), (15, 0.015), (16, 0.026), (17, 0.034), (18, 0.022), (19, 0.039), (20, -0.039), (21, -0.001), (22, -0.006), (23, 0.022), (24, -0.028), (25, -0.009), (26, -0.075), (27, 0.017), (28, -0.02), (29, -0.108), (30, 0.032), (31, 0.04), (32, 0.021), (33, -0.007), (34, 0.005), (35, -0.007), (36, 0.014), (37, -0.03), (38, 0.048), (39, -0.015), (40, 0.04), (41, -0.016), (42, -0.072), (43, 0.029), (44, 0.013), (45, -0.031), (46, -0.017), (47, 0.161), (48, 0.08), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94554603 160 nips-2012-Imitation Learning by Coaching

Author: He He, Jason Eisner, Hal Daume

Abstract: Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. 1

2 0.8437472 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

3 0.79971313 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

4 0.79552561 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

5 0.79390007 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

6 0.78644007 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

7 0.72186017 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

8 0.71870208 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

9 0.63899744 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

10 0.62849021 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

11 0.62809247 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

12 0.61756951 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

13 0.61361635 348 nips-2012-Tractable Objectives for Robust Policy Optimization

14 0.61217213 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

15 0.5864442 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

16 0.56788993 344 nips-2012-Timely Object Recognition

17 0.55889928 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

18 0.55583435 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

19 0.54865861 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

20 0.51487088 285 nips-2012-Query Complexity of Derivative-Free Optimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.045), (21, 0.013), (38, 0.134), (42, 0.035), (54, 0.093), (62, 0.218), (74, 0.066), (76, 0.119), (80, 0.115), (92, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7961089 160 nips-2012-Imitation Learning by Coaching

Author: He He, Jason Eisner, Hal Daume

Abstract: Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. 1

2 0.78483176 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

3 0.76257175 346 nips-2012-Topology Constraints in Graphical Models

Author: Marcelo Fiori, Pablo Musé, Guillermo Sapiro

Abstract: Graphical models are a very useful tool to describe and understand natural phenomena, from gene expression to climate change and social interactions. The topological structure of these graphs/networks is a fundamental part of the analysis, and in many cases the main goal of the study. However, little work has been done on incorporating prior topological knowledge onto the estimation of the underlying graphical models from sample data. In this work we propose extensions to the basic joint regression model for network estimation, which explicitly incorporate graph-topological constraints into the corresponding optimization approach. The first proposed extension includes an eigenvector centrality constraint, thereby promoting this important prior topological property. The second developed extension promotes the formation of certain motifs, triangle-shaped ones in particular, which are known to exist for example in genetic regulatory networks. The presentation of the underlying formulations, which serve as examples of the introduction of topological constraints in network estimation, is complemented with examples in diverse datasets demonstrating the importance of incorporating such critical prior knowledge. 1

4 0.73493499 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

5 0.7253772 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.7250734 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

7 0.72435296 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

8 0.72419012 344 nips-2012-Timely Object Recognition

9 0.72245234 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

10 0.72221869 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

11 0.71447277 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

12 0.71318465 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

13 0.70834476 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

14 0.70765173 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

15 0.7073431 292 nips-2012-Regularized Off-Policy TD-Learning

16 0.70673805 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

17 0.70642483 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

18 0.70072794 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

19 0.699893 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

20 0.69912368 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning