nips nips2012 nips2012-173 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. [sent-9, score-0.426]
2 Prioritization heuristics govern the order in which search actions are taken while pruning heuristics explicitly dictate whether particular actions should be taken at all. [sent-18, score-0.577]
3 Examples of prioritization include A∗ [13] and Hierarchical A∗ [19] heuristics, which, in the case of agenda-based parsing, prioritize parse actions so as to reduce work while maintaining the guarantee that the most likely parse is found. [sent-19, score-0.894]
4 1 Prioritized Parsing Given a probabilistic context-free grammar, one approach to inferring the best parse tree for a given sentence is to build the tree from the bottom up by dynamic programming, as in CKY [29]. [sent-35, score-0.383]
5 1 A standard extension of the CKY algorithm [12] uses an agenda—a priority queue of constituents built so far—to decide which constituent is most promising to extend next, as detailed in section 2. [sent-37, score-0.458]
6 The success of the inference algorithm in terms of speed and accuracy hinge on its ability to prioritize “good” actions before “bad” actions. [sent-39, score-0.509]
7 Either CKY or an agenda-based parser that prioritizes by Viterbi inside score will find the highest-scoring parse. [sent-42, score-0.346]
8 ” In this paper, we consider a very simple notion of time: the number of constituents popped from/pushed into the agenda during inference, halting inference as soon as the parser pops its first complete parse. [sent-53, score-1.0]
9 Second, the parser’s total runtime and accuracy on a sentence are unknown until parsing is complete, making this an instance of delayed reward. [sent-56, score-0.363]
10 An agent’s policy π describes how the (memoryless) agent chooses an action based on its current state, where π is either a deterministic function of the state (i. [sent-64, score-0.629]
11 The initial state has an agenda consisting of all single-word constituents, and an empty chart of previously popped constituents. [sent-71, score-0.411]
12 (Duplicates in the chart or agenda are merged: the one of highest Viterbi inside score is kept. [sent-78, score-0.389]
13 We are interested in learning a deterministic policy that always pops the highest-priority available action. [sent-80, score-0.604]
14 Thus, learning a policy corresponds to learning a priority function. [sent-81, score-0.427]
15 Formally, our policy is πθ (s) = arg max θ · φ(a, s) (2) a An admissible policy in the sense of A∗ search [13] would guarantee that we always return the parse of highest Viterbi inside score—but we do not require this, instead aiming to optimize Eq (1). [sent-84, score-1.018]
16 (1) Viterbi inside score; (2) constituent touches start of sentence; (3) constituent touches end of sentence; (4) constituent length; length (5) constituentlength ; (6) log p(constituent label | prev. [sent-87, score-0.762]
17 The log-probability features (1), (6) are inspired by work on figures of merit for agenda-based parsing [4], while case and punctuation patterns (7), (8) are inspired by structure-free parsing [14]. [sent-89, score-0.388]
18 The reward function takes a state of the world s and an agent’s chosen action a and returns a real value r that indicates the “immediate reward” the agent receives for taking that action. [sent-91, score-0.506]
19 The reward function we consider is: r(s, a) = acc(a) − λ · time(s) if a is a full parse tree 0 otherwise (3) Here, acc(a) measures the accuracy of the full parse tree popped by the action a (against a gold standard) and time(s) is a user-defined measure of time. [sent-93, score-1.111]
20 In words, when the parser completes parsing, it receives reward given by Eq (1); at all other times, it receives no reward. [sent-94, score-0.528]
21 1 Boltzmann Exploration At test time, the transition between states is deterministic: our policy always chooses the action a that has highest priority in the current state s. [sent-96, score-0.612]
22 However, during training, we promote exploration of policy space by running with stochastic policies πθ (a | s). [sent-97, score-0.427]
23 In particular, we use Boltzmann exploration to construct a stochastic policy with a Gibbs distribution. [sent-99, score-0.4]
24 Our policy is: πθ (a | s) = 1 1 exp θ · φ(a, s) with Z(s) as the appropriate normalizing constant (4) Z(s) temp That is, the log-likelihood of action a at state s is an affine function of its priority. [sent-100, score-0.619]
25 As temp → 0, πθ approaches the deterministic policy in Eq (2); as temp → ∞, πθ approaches the uniform distribution over available actions. [sent-102, score-0.563]
26 (5) t=0 where τ is a random trajectory chosen by policy πθ , and rt is the reward at step t of τ . [sent-110, score-0.756]
27 We carry out this optimization using a stochastic gradient ascent algorithm known as policy gradient [27, 26]. [sent-113, score-0.533]
28 It also requires computing the gradient of each policy decision, which, by Eq (4), is: θ log πθ (at | st ) = 1 temp φ(at , st ) − πθ (a | st )φ(a , st ) (7) a ∈A Combining Eq (6) and Eq (7) gives the form of the gradient with respect to a single trajectory. [sent-115, score-0.972]
29 The policy gradient algorithm samples one trajectory (or several) according to the current πθ , and then takes a gradient step according to Eq (6). [sent-116, score-0.607]
30 This increases the probability of actions on high-reward trajectories more than actions on low-reward trajectories. [sent-117, score-0.569]
31 The baseline system from Running Example 1 always returns the target parse (the complete parse with maximum Viterbi inside score). [sent-119, score-0.56]
32 Unfortunately, running policy gradient from this starting point degrades speed and accuracy. [sent-123, score-0.581]
33 3 Analysis One might wonder why policy gradient performed so poorly on this problem. [sent-126, score-0.449]
34 Under almost every condition, policy gradient was able to learn a near-optimal policy by placing high weight on this cheating feature. [sent-129, score-0.846]
35 The main difficulty with policy gradient is credit assignment: it has no way to determine which actions were “responsible” for a trajectory’s reward. [sent-132, score-0.816]
36 This is a significant problem for us, since the average trajectory length of an A∗ parser on a 15 word sentence is about 30,000 steps, only about 40 0 of which (less than 0. [sent-134, score-0.416]
37 4 Reward Shaping A classic approach to attenuating the credit assignment problem when one has some knowledge about the domain is reward shaping [10]. [sent-137, score-0.511]
38 The goal of reward shaping is to heuristically associate portions of the total reward with specific time steps, and to favor actions that are observed to be soon followed by a reward, on the assumption that they caused that reward. [sent-138, score-0.872]
39 Thus, the shaped reward: 4 1 − ∆(s, a) − λ 1−λ r(s, a) = ˜ −λ if a pops a complete parse (causing the parser to halt and return a) if a pops a labeled constituent that appears in the gold parse otherwise (8) λ is from Eq (1), penalizing the runtime of each step. [sent-141, score-1.542]
40 The correction ∆(s, a) is the number of correct constituents popped into the chart of s that were not in the first-popped parse a. [sent-143, score-0.643]
41 When rewards are discounted over time, the policy gradient becomes T T ˜ Eτ ∼πθ [Rγ (τ )] = Eτ ∼πθ γt t=0 −t rt ˜ θ log πθ (at | st ) (9) t =t where rt = r(st , at ). [sent-149, score-0.659]
42 1], and therefore following the gradient is equivalent to policy gradient. [sent-151, score-0.449]
43 When γ = 0, the parser gets only immediate reward—and in general, a small γ assigns the credit for a local reward rt mainly to actions at at ˜ closely preceding times. [sent-152, score-0.98]
44 If an action is on a good trajectory but occurs after most of the useful actions (pops of correct constituents), then it does not receive credit for those previously occurring actions. [sent-154, score-0.599]
45 4 Apprenticeship Learning In reinforcement learning, an agent interacts with an environment and attempts to learn to maximize its reward by repeating actions that led to high reward in the past. [sent-162, score-0.958]
46 In apprenticeship learning, we assume access to a collection of trajectories taken by an optimal policy and attempt to learn to mimic those trajectories. [sent-163, score-0.654]
47 In contrast, the related task of inverse reinforcement learning/optimal control [17, 11] attempts to infer a reward function from the teacher’s optimal behavior. [sent-165, score-0.379]
48 Some of them work by first executing inverse reinforcement learning [11, 17] to induce a reward function and then feeding this reward function into an off-the-shelf reinforcement learning algorithm like policy gradient to learn an approximately optimal agent [1]. [sent-167, score-1.286]
49 1 Oracle Actions With a teacher to help guide the learning process, we would like to explore more intelligently than Boltzmann exploration, in particular, focusing on high-reward regions of policy space. [sent-170, score-0.43]
50 We introduce oracle actions as a guidance for areas to explore. [sent-171, score-0.394]
51 Ideally, oracle actions should lead to a maximum-reward tree. [sent-172, score-0.394]
52 In training, we will identify oracle actions to be those that build items in the maximum likelihood parse consistent with the gold parse. [sent-173, score-0.701]
53 When multiple oracle actions are available on the agenda, we will break ties according to the priority assigned by the current policy (i. [sent-174, score-0.845]
54 2 Apprenticeship Learning via Classification Given a notion of oracle actions, a straightforward approach to policy learning is to simply train a classifier to follow the oracle—a popular approach in incremental parsing [6, 5]. [sent-178, score-0.704]
55 Trajectories are generated by following oracle actions, breaking ties using the initial policy (Viterbi inside score) when multiple oracle actions are available. [sent-181, score-1.0]
56 At each step in the trajectory, (st , at ), a classification example is generated, where the action taken by the oracle (at ) is considered the correct class and all other available actions are considered incorrect. [sent-183, score-0.552]
57 In fact, the gradient of this classifier (Eq (10)) is nearly identical to the policy gradient (Eq (6)) except that τ is distributed differently and the total reward R(τ ) does not appear: instead of mimicking high-reward trajectories we now try to mimic oracle trajectories. [sent-185, score-1.121]
58 T φ(at , st ) − Eτ ∼π∗ t=0 πθ (a | st )φ(a , st ) (10) a ∈A where π ∗ denotes the oracle policy so at is the oracle action. [sent-186, score-0.952]
59 The potential benefit of the classifier-based approach over policy gradient with shaped rewards is increased credit assignment. [sent-187, score-0.663]
60 In policy gradient with reward shaping, an action gets credit for all future reward (though no past reward). [sent-188, score-1.255]
61 The classifier-based approach performs only marginally better than policy gradient with shaped rewards. [sent-191, score-0.489]
62 Unfortunately, this is not computationally feasible due to the poor quality of the policy learned in the first iteration. [sent-196, score-0.365]
63 Attempting to follow the learned policy essentially tries to build all possible constituents licensed by the grammar, which can be prohibitively expensive. [sent-197, score-0.572]
64 Even though our current feature set depends only on the action and not on the state, making action scores independent of the current state, there is still an issue since the set of actions to choose from does depend on the state. [sent-202, score-0.474]
65 That is, the classifier is trained to discriminate only among the small set of agenda items available on the oracle trajectory (which are always combinations of correct constituents). [sent-203, score-0.49]
66 But the action sets the parser faces at test time are much larger and more diverse. [sent-204, score-0.379]
67 Some incorrect actions are more expensive than others, if they create constituents that can be combined in many locally-attractive ways and hence slow the parser down or result in errors. [sent-206, score-0.659]
68 The S EARN algorithm [7] would distinguish them by explicitly evaluating the future reward of each possible action (instead of using a teacher) and incorporating this into the classification problem. [sent-208, score-0.395]
69 Recall that the oracle is “supposed to” choose optimal actions for the given reward. [sent-213, score-0.394]
70 There seems to be a contradiction here: our oracle action selector ignores λ, the tradeoff between accuracy and speed, and only focuses on accuracy. [sent-215, score-0.399]
71 This means that under the apprenticeship learning setting, we are actually never going to be able to learn to trade off accuracy and speed: as far as the oracle is concerned, you can have both! [sent-218, score-0.397]
72 We start with the policy gradient algorithm (Section 3. [sent-221, score-0.449]
73 Our formulation preserves the reinforcement learning flavor of our overall setting, which involves delayed reward for a known reward function. [sent-223, score-0.691]
74 This makes the notion of “interleaving” oracle actions with policy actions both feasible and sensible. [sent-225, score-0.987]
75 Like policy gradient, we draw trajectories from a policy and take gradient steps that favor actions with high reward under reward shaping. [sent-226, score-1.699]
76 Like S EARN and DAGGER, we begin by exploring the space around the optimal policy and slowly explore out from there. [sent-227, score-0.365]
77 + We define the oracle-infused policy πδ as follows: + πδ (a | s) = δπ ∗ (a | s) + (1 − δ)π(a | s) (11) + In other words, when choosing an action, πδ explores the policy space with probability 1 − δ (according to its current model), but with probability δ, we force it to take an oracle action. [sent-230, score-0.896]
78 Our algorithm takes policy gradient steps with reward shaping (Eqs (9) and (7)), but with respect to trajectories + drawn from πδ rather than π. [sent-231, score-0.934]
79 If δ = 0, it reduces to policy gradient, with reward shaping if γ < 1 and immediate reward if γ = 0. [sent-232, score-1.049]
80 Similar to DAGGER and S EARN, we do not stay at δ = 1, but wean our learner off the oracle supervision as it starts to find a good policy π that imitates the classifier reasonably well. [sent-234, score-0.531]
81 Over time, δ → 0, so that eventually we are training the policy to do well on the same distribution of states that it will pass through at test time (as in policy gradient). [sent-238, score-0.73]
82 With intermediate values of δ (and γ ≈ 1), an iteration behaves similarly to an iteration of S EARN, except that it “rolls out” the consequences of an action chosen randomly from (11) instead of evaluating all possible actions in parallel. [sent-239, score-0.351]
83 We measure accuracy in terms of labeled recall (including preterminals) and measure speed in terms of the number of pops from on the agenda. [sent-250, score-0.438]
84 A constituent is pruned if its Viterbi inside score is more than ∆ worse than that of some other constituent that covers the same substring. [sent-254, score-0.532]
85 performed stochastic gradient descent for 25 passes over the data, sampling 5 trajectories in a row for each sentence (when δ < 1 so that trajectories are random). [sent-275, score-0.427]
86 Our “oracle-infused” compromise “+” uses some oracle actions: after several passes through the data, the parser learns to make good decisions without help from the oracle. [sent-279, score-0.453]
87 7 # of pops 3 x 10 Change of recall and # of pops 7 2 1 0 0. [sent-280, score-0.508]
88 96 Figure 2: Pareto frontiers: Our I+ parser at different values of λ, against the baselines at different pruning levels. [sent-288, score-0.403]
89 Given our previous results in Table 1, we only consider the I+ model: immediate reward with oracle infusion. [sent-295, score-0.478]
90 To investigate trading off speed and accuracy, we learn and then evaluate a policy for each of several settings of the tradeoff parameter: λ. [sent-296, score-0.563]
91 We train our policy using sentences of at most 15 words from Section 22 and evaluate the learned policy on the held out data (from Section 23). [sent-297, score-0.784]
92 We measure accuracy as labeled constituent recall and evaluate speed in terms of the number of pops (or pushes) performed on the agenda. [sent-298, score-0.659]
93 , 10−8 }, using agenda pops as the measure of time. [sent-302, score-0.422]
94 I+ always does better than the coarse-to-fine parser (CTF) in 0 terms of both speed and accuracy, though using the number of agenda pops as our measure of speed puts both of our hierarchical baselines at a disadvantage. [sent-308, score-0.914]
95 Since our reward shaping was crafted with agenda pops in mind, perhaps it is not surprising that learning performs relatively poorly in this setting. [sent-310, score-0.794]
96 7 Conclusions and Future Work In this paper, we considered the application of both reinforcement learning and apprenticeship learning to prioritize search in a way that is sensitive to a user-defined tradeoff between speed and accuracy. [sent-316, score-0.477]
97 We found that a novel oracle-infused variant of the policy gradient algorithm for reinforcement learning is effective for learning a fast and accurate parser with only a simple set of features. [sent-317, score-0.812]
98 The parser might decide to continue working past the first complete parse, or give up (returning a partial or default parse) before any complete parse is found. [sent-322, score-0.563]
99 Both train on oracle trajectories where all actions receive a reward of 1 − λ, and simply try to make these oracle actions probable. [sent-324, score-1.173]
100 However, D- trains more aggressively on long trajectories, since (9) implies that it weights a given training action by T − t + 1, the number of future actions on that trajectory. [sent-325, score-0.351]
wordName wordTfidf (topN-words)
[('policy', 0.365), ('reward', 0.272), ('parser', 0.256), ('pops', 0.239), ('parse', 0.237), ('actions', 0.228), ('constituent', 0.221), ('agenda', 0.183), ('constituents', 0.175), ('parsing', 0.173), ('oracle', 0.166), ('apprenticeship', 0.139), ('credit', 0.139), ('action', 0.123), ('pruning', 0.121), ('chart', 0.116), ('trajectories', 0.113), ('prioritization', 0.112), ('viterbi', 0.111), ('reinforcement', 0.107), ('speed', 0.105), ('shaping', 0.1), ('eq', 0.099), ('temp', 0.099), ('sentence', 0.086), ('st', 0.085), ('gradient', 0.084), ('grammar', 0.083), ('popped', 0.08), ('prioritize', 0.08), ('agent', 0.079), ('trajectory', 0.074), ('teacher', 0.065), ('accuracy', 0.064), ('ida', 0.064), ('priority', 0.062), ('earn', 0.056), ('dagger', 0.056), ('sentences', 0.054), ('cky', 0.052), ('inside', 0.051), ('linguistics', 0.048), ('ctf', 0.048), ('prioritized', 0.048), ('trading', 0.047), ('tradeoff', 0.046), ('rt', 0.045), ('punctuation', 0.042), ('delayed', 0.04), ('shaped', 0.04), ('immediate', 0.04), ('score', 0.039), ('hal', 0.039), ('daum', 0.038), ('gold', 0.038), ('pushes', 0.037), ('mimic', 0.037), ('exploration', 0.035), ('complete', 0.035), ('correct', 0.035), ('imitation', 0.035), ('pos', 0.035), ('rewards', 0.035), ('classi', 0.034), ('temperature', 0.034), ('nlp', 0.033), ('acl', 0.033), ('items', 0.032), ('er', 0.032), ('state', 0.032), ('cheating', 0.032), ('licensed', 0.032), ('roark', 0.032), ('inference', 0.032), ('passes', 0.031), ('recall', 0.03), ('tree', 0.03), ('chooses', 0.03), ('item', 0.029), ('tag', 0.029), ('coling', 0.028), ('pauls', 0.028), ('tests', 0.028), ('mdp', 0.028), ('boltzmann', 0.028), ('ha', 0.028), ('trade', 0.028), ('running', 0.027), ('brian', 0.026), ('baselines', 0.026), ('eugene', 0.026), ('eisner', 0.026), ('treebank', 0.026), ('penn', 0.026), ('pop', 0.026), ('andrew', 0.025), ('ties', 0.024), ('triples', 0.024), ('pareto', 0.024), ('touches', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
2 0.30996558 160 nips-2012-Imitation Learning by Coaching
Author: He He, Jason Eisner, Hal Daume
Abstract: Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. 1
3 0.30317867 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
4 0.29547912 38 nips-2012-Algorithms for Learning Markov Field Policies
Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer
Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1
5 0.26644257 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
Author: Jaedeug Choi, Kee-eung Kim
Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1
6 0.24984069 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
7 0.2492345 344 nips-2012-Timely Object Recognition
8 0.22511537 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
9 0.21215992 348 nips-2012-Tractable Objectives for Robust Policy Optimization
10 0.19235475 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
11 0.17834534 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
12 0.17527935 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
13 0.16950879 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
14 0.16828285 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
15 0.16435888 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
16 0.15630551 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
17 0.14330155 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
18 0.14219473 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
19 0.13170572 204 nips-2012-MAP Inference in Chains using Column Generation
20 0.12787978 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
topicId topicWeight
[(0, 0.211), (1, -0.507), (2, -0.081), (3, -0.083), (4, -0.068), (5, 0.012), (6, 0.006), (7, -0.053), (8, -0.0), (9, 0.008), (10, -0.004), (11, 0.013), (12, 0.027), (13, 0.062), (14, 0.029), (15, 0.026), (16, 0.008), (17, 0.06), (18, 0.083), (19, 0.058), (20, -0.02), (21, 0.011), (22, -0.007), (23, -0.042), (24, -0.024), (25, -0.009), (26, -0.028), (27, -0.074), (28, 0.054), (29, -0.072), (30, 0.049), (31, 0.022), (32, 0.034), (33, -0.025), (34, -0.05), (35, -0.046), (36, 0.039), (37, -0.078), (38, 0.039), (39, -0.065), (40, -0.036), (41, -0.071), (42, -0.035), (43, -0.004), (44, 0.045), (45, -0.053), (46, -0.024), (47, -0.028), (48, 0.101), (49, 0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.96860033 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
2 0.85499954 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
3 0.83247304 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
Author: Aaron Wilson, Alan Fern, Prasad Tadepalli
Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1
4 0.79539353 38 nips-2012-Algorithms for Learning Markov Field Policies
Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer
Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1
5 0.77566981 160 nips-2012-Imitation Learning by Coaching
Author: He He, Jason Eisner, Hal Daume
Abstract: Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic feature selection and two static feature selection methods. 1
6 0.77202147 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
7 0.72209036 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
8 0.71386588 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
9 0.71355188 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
10 0.70508486 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
11 0.70077658 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
12 0.6930058 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
13 0.66683757 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
14 0.64947909 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
15 0.63198638 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
16 0.5946942 348 nips-2012-Tractable Objectives for Robust Policy Optimization
17 0.58378369 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
18 0.57642025 344 nips-2012-Timely Object Recognition
19 0.53488672 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
20 0.51480931 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
topicId topicWeight
[(0, 0.053), (10, 0.217), (17, 0.015), (21, 0.023), (22, 0.017), (38, 0.104), (42, 0.028), (53, 0.011), (54, 0.107), (55, 0.023), (62, 0.011), (74, 0.055), (76, 0.105), (80, 0.11), (92, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.82187659 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
2 0.79911399 11 nips-2012-A Marginalized Particle Gaussian Process Regression
Author: Yali Wang, Brahim Chaib-draa
Abstract: We present a novel marginalized particle Gaussian process (MPGP) regression, which provides a fast, accurate online Bayesian filtering framework to model the latent function. Using a state space model established by the data construction procedure, our MPGP recursively filters out the estimation of hidden function values by a Gaussian mixture. Meanwhile, it provides a new online method for training hyperparameters with a number of weighted particles. We demonstrate the estimated performance of our MPGP on both simulated and real large data sets. The results show that our MPGP is a robust estimation algorithm with high computational efficiency, which outperforms other state-of-art sparse GP methods. 1
3 0.75739986 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing
Author: Shinsuke Koyama
Abstract: Statistical features of neuronal spike trains are known to be non-Poisson. Here, we investigate the extent to which the non-Poissonian feature affects the efficiency of transmitting information on fluctuating firing rates. For this purpose, we introduce the Kullback-Leibler (KL) divergence as a measure of the efficiency of information encoding, and assume that spike trains are generated by time-rescaled renewal processes. We show that the KL divergence determines the lower bound of the degree of rate fluctuations below which the temporal variation of the firing rates is undetectable from sparse data. We also show that the KL divergence, as well as the lower bound, depends not only on the variability of spikes in terms of the coefficient of variation, but also significantly on the higher-order moments of interspike interval (ISI) distributions. We examine three specific models that are commonly used for describing the stochastic nature of spikes (the gamma, inverse Gaussian (IG) and lognormal ISI distributions), and find that the time-rescaled renewal process with the IG distribution achieves the largest KL divergence, followed by the lognormal and gamma distributions.
4 0.70255536 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller
Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1
5 0.70047718 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy
Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1
6 0.69609457 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
7 0.68744391 344 nips-2012-Timely Object Recognition
8 0.67955965 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
9 0.67147636 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
10 0.66419917 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
11 0.66345161 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs
12 0.66335523 160 nips-2012-Imitation Learning by Coaching
13 0.66181529 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
14 0.66005778 38 nips-2012-Algorithms for Learning Markov Field Policies
15 0.65983564 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
16 0.6586951 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
17 0.65834981 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
18 0.65509474 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
19 0.65238905 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
20 0.65074164 234 nips-2012-Multiresolution analysis on the symmetric group