nips nips2013 nips2013-150 knowledge-graph by maker-knowledge-mining

150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

Source: pdf

Author: David J. Weiss, Ben Taskar

Abstract: Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Signiﬁcant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to ﬁnetune feature extraction to each input at run-time. We address the key challenge of learning to control ﬁne-grained feature extraction adaptively, exploiting nonhomogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efﬁcient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate signiﬁcant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. [sent-5, score-0.348]

2 However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. [sent-6, score-0.666]

3 Such feature selection methods control computation statically and miss the opportunity to ﬁnetune feature extraction to each input at run-time. [sent-8, score-0.778]

4 We address the key challenge of learning to control ﬁne-grained feature extraction adaptively, exploiting nonhomogeneity of the data. [sent-9, score-0.521]

5 We propose an architecture that uses a rich feedback loop between extraction and prediction. [sent-10, score-0.267]

6 The run-time control policy is learned using efﬁcient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. [sent-11, score-0.617]

7 For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task. [sent-13, score-0.254]

8 One source of computational cost is inference in the model, which can be addressed with a variety of approximate inference methods. [sent-15, score-0.238]

9 However, in many applications, computing the scores of the constituent parts of the structured model–i. [sent-16, score-0.211]

10 In this work, we show that large gains in the speed/accuracy trade-off can be obtained by departing from the traditional method of “one-size-ﬁts-all” model and feature selection, in which a static set of features are computed for all inputs uniformly. [sent-20, score-0.44]

11 Instead, we employ an adaptive approach: the parts of the structured model are constructed speciﬁcally at test-time for each particular instance, for example, at the level of individual video frames. [sent-21, score-0.245]

12 The central component of our method is an approximate value function that utilizes a set of meta-features to estimate future changes in value of information given a predictive model and a proposed feature set as input. [sent-28, score-0.273]

13 We frame the control problem in terms of ﬁnding a feature extraction policy that sequentially adds features to the models until a budget limit is reached, and we show how to learn approximate policies that result in accurate structured models that are dramatically more efﬁcient. [sent-31, score-1.712]

14 We ﬁrst learn a prediction model that is trained to use subsets of features computed sparsely across the structure of the input. [sent-34, score-0.421]

15 These feature combinations factorize over the graph structure, and we allow for sparsely computed features such that different vertices and edges may utilize different features of the input. [sent-35, score-0.744]

16 We then use reinforcement learning to estimate a value function that adaptively computes an approximately optimal set of features given a budget constraint. [sent-36, score-0.642]

17 Finally, we apply our method to two sequential prediction domains: articulated human pose estimation and handwriting recognition. [sent-38, score-0.374]

18 In both domains, we achieve more accurate prediction models that utilize less features than the traditional monolithic approach. [sent-39, score-0.394]

19 However, much of this work has focused on the issue of feature extraction for standard classiﬁcation problems, e. [sent-41, score-0.484]

20 through cascades or ensembles of classiﬁers that use different subsets of features at different stages of processing. [sent-43, score-0.262]

21 More recently, feature computation cost has been explicitly incorporated speciﬁcally into the learning procedure (e. [sent-44, score-0.259]

22 And while the goals of these works are similar to ours–explicitly controlling feature computation at test time–none of the classiﬁer cascade literature addresses the structured prediction nor the batch setting. [sent-49, score-0.57]

23 Most work that addresses learning the accuracy/efﬁciency trade-off in a structured setting applies primarily to inference, not feature extraction. [sent-50, score-0.348]

24 , [23] extend the idea of a classiﬁer cascade to the structured prediction setting, with the objective deﬁned in terms of obtaining accurate inference in models with large state spaces after coarse-to-ﬁne pruning. [sent-53, score-0.482]

25 More similar to this work, [7] incrementally prune the edge space of a parsing model using a meta-features based classiﬁer, reducing the total the number of features that need to be extracted. [sent-54, score-0.311]

26 However, both of these prior efforts rely entirely on the marginal scores of the predictive model in order to make their pruning decisions, and do not allow future feature computations to rectify past mistakes, as in the case of our work. [sent-55, score-0.317]

27 2 INPUT EXTRACT FEATURES POLICY INFERENCE EXTRACT METAFEATURES OUTPUT Algorithm 1: Inference for x and budget B. [sent-59, score-0.289]

28 (Left) A high level summary of the processing pipeline: as in standard structured prediction, features are extracted and inference is run to produce an output. [sent-64, score-0.561]

29 However, information may optionally feedback in the form of extracted meta-features that are used by a control policy to determine another set of features to be extracted. [sent-65, score-0.669]

30 Note that we use stochastic subgradient to learn the inference model w ﬁrst and reinforcement learning to learn the control model β given w. [sent-66, score-0.376]

31 (Right) Detailed algorithm for factorwise inference for an example x given a graph structure G and budget B. [sent-67, score-0.451]

32 The policy repeatedly selects the highest valued action from an action space A that represents extracting features for each constituent part of the graph structure G. [sent-68, score-0.804]

33 We consider the setting of structured prediction, in which our goal is to learn a hypothesis mapping inputs x ∈ X to outputs y ∈ Y(x), where |x| = L and y is a L-vector of K-valued variables, i. [sent-70, score-0.215]

34 However, we introduce an additional explicit feature extraction state vector z: h(x, z) = argmax w f (x, y, z). [sent-77, score-0.556]

35 (1) y∈Y(x) Above, f (x, y, z) is a sparse vector of D features that takes time c z to compute for a non-negative cost vector c and binary indicator vector z of length |z| = F . [sent-78, score-0.334]

36 Intuitively, z indicates which of F sets of features are extracted when computing f ; z = 1 means every possible feature is extracted, while z = 0 means that only a minimum set of features is extracted. [sent-79, score-0.772]

37 Note that by incorporating z into the feature function, the predictor h can learn to use different linear weights for the same underlying feature value by conditioning the feature on the value of z. [sent-80, score-0.793]

38 We will discuss how to construct such features in more detail in Section 4. [sent-82, score-0.223]

39 At test time, our goal is to make the most accurate predictions possible for an example under a ﬁxed budget B. [sent-84, score-0.394]

40 The state of the MDP s is the tuple (x, z) for an input x and feature extraction 3 state z (for brevity we will simply write s). [sent-95, score-0.628]

41 The start state is s0 = (x, 0), with x ∼ P (X), and z = 0 indicating only a minimal set of features have been extracted. [sent-96, score-0.295]

42 The action space A(s) is {i | zi = 0} ∪ {0}, where zi is the i’the element of z; given a state-action pair (s, a), the next state is deterministically s = (x, z + ea ), where ea is the indicator vector with a 1 in the a’th component or the zero vector if a = 0. [sent-97, score-0.322]

43 ) However, if we ever exceed the budget, then any further decrease does not count; no more reward can be gained. [sent-101, score-0.239]

44 Furthermore, assuming f (x, y, z) can be cached appropriately, it is clear that we pay only the additional computational cost ca for each action a, so the entire cumulative computational burden of reaching some state s is exactly c z for the corresponding z vector. [sent-102, score-0.282]

45 Therefore, the optimal policy π that maximizes expected reward will compute z minimizing (2) while satisfying the budget constraint. [sent-107, score-0.863]

46 In this work, we focus on a straightforward method for learning an approximate policy: a batch version of least-squares policy iteration [11] based on Q-learning [21]. [sent-109, score-0.356]

47 We parametrize the policy using a linear function of metafeatures φ computed from the current state s = (x, z): πβ (s) = argmaxa β φ(x, z, a). [sent-110, score-0.544]

48 The metafeatures (which we abbreviate as simply φ(s, a) henceforth) need to be rich enough to represent the value of choosing to expand feature a for a given partially-computed example (x, z). [sent-111, score-0.346]

49 To simplify the notation, we will assume given current state s, taking action a deterministically yields state s . [sent-117, score-0.314]

50 Given a policy π, the value of a policy is recursively deﬁned as the immediate expected reward plus the discounted value of the next state: Qπ (s, a) = R(s, a, s ) + γQπ (s , π(s )). [sent-118, score-0.839]

51 (5) The goal of Q-learning is to learn the Q for the optimal policy π with maximal Qπ ; however, it is clear that we can increase Q by simply stopping early when Qπ (s, a) < 0 (the future reward in this case is simply zero. [sent-119, score-0.59]

52 Given an initial policy π, we execute π for each example (xj , yj ) to obtain trajectories sj , . [sent-129, score-0.539]

53 We perform a simple form of policy iteration as follows. [sent-134, score-0.3]

54 We ﬁrst initialize β by estimating the expected reward function (this can be estimated by randomly sampling pairs (s, s ), which are more efﬁcient to compute than Q-functions on trajectories). [sent-135, score-0.274]

55 We then compute trajectories under πβ and use these trajectories to compute β that approximates Qπ . [sent-136, score-0.234]

56 We found that additional iterations of policy iteration did not noticeably change the results. [sent-137, score-0.3]

57 One potential drawback of our approach just described is that we must learn a different policy for every desired budget. [sent-139, score-0.384]

58 A more attractive alternative is to learn a single policy that is tuned to a range of possible budgets. [sent-140, score-0.384]

59 One solution is to set γ = 1 and learn with B = ∞, so that the value Qπ represents the best improvement possible using some optimal budget B ; however, at test time, it may be that B is greater than the available budget B and Qπ is an over-estimate. [sent-141, score-0.662]

60 By choosing γ < 1, we can trade-off between valuing reward for short-term gain with smaller budgets B < B and longer-term gain with the unknown optimal budget B . [sent-142, score-0.598]

61 In fact, we can further encourage our learned policy to be useful for smaller budgets by adjusting the reward function. [sent-143, score-0.575]

62 Note that two trajectories that start at s0 and end at st will have the same reward, yet one trajectory might maintain much lower error rate than the other during the process and therefore be more useful for smaller budgets. [sent-144, score-0.317]

63 We therefore add a shaping component to the expected reward in order to favor the more useful trajectory as follows: Rα (s, a, s ) = η(s) − η(s ) − α [η(s ) − η(s)]+ . [sent-145, score-0.341]

64 We use a development set to tune α as well as γ to ﬁnd the most useful policy when sweeping B across a range of budgets. [sent-149, score-0.3]

65 In this setting the budget applies to the entire inference process, and it may be useful to spend more of the budget on difﬁcult examples rather than allocate the budget evenly across all examples. [sent-152, score-0.965]

66 The action consists of choosing an example and then choosing an action within that example’s sub-state; our policy searches over the space of all actions for all examples simultaneously. [sent-160, score-0.572]

67 (9) Equation (9) states that there is an inherent ordering of feature extractions, such that we cannot compute the a’th feature set without ﬁrst computing feature sets 1, . [sent-168, score-0.686]

68 We compare to two baselines: a simply entropy-based approach and a more complex imitation learning scheme (inspired by [7]) in which we learn a classiﬁer to reproduce a target policy given by an oracle. [sent-174, score-0.546]

69 The entropy-based approach simply computes probabilistic marginals and extracts features for whichever portion of the output space has highest entropy in the predicted distribution. [sent-175, score-0.317]

70 We tune the budget B using a development set to optimize the overall trade-off when the policy is evaluated with multiple budgets. [sent-177, score-0.589]

71 1 While adding features decreases training error on average, even on the training set additional features may lead to increased error for any particular example. [sent-178, score-0.446]

72 75s — Table 1: Trade-off between average elbow and wrist error rate and total runtime time achieved by our method on the pose dataset; each row ﬁxes an error rate and determines the amount of time required by each method to achieve the error. [sent-195, score-0.368]

73 In the standard pair-wise graphical model setting (before considering z), we decompose a feature function f (x, y) into unary and pairwise features. [sent-207, score-0.258]

74 The simplest scheme is to use several different feature functions f 1 , . [sent-209, score-0.217]

75 A much more powerful approach is to create a feature vector as the composite of different extracted features for each vertex and edge in the model. [sent-215, score-0.549]

76 In this setting, we set z = [zu ze ], where |z| = (|V| + |E|)F , and we have F F a zu (a, i)fu (x, yi ) + f (x, y, z) = i∈V a=1 a ze (a, ij)fe (x, yi , yj ). [sent-216, score-0.339]

77 (12) (i,j)∈E a=1 We refer to this latter feature extraction method a factor-level feature extraction, and the former as example-level. [sent-217, score-0.701]

78 Note 2 The restriction (9) also allows us to increase the complexity of the feature function f as follows; when using the a’th extraction, we allow the model to re-weight the features from extractions 1 through a. [sent-220, score-0.535]

79 In other words, we condition the value of the feature on the current set of features that have been computed; since there are only F sets in the restricted setting (and not 2F ), this is a feasible option. [sent-221, score-0.474]

80 0], where we add duplicates of features f 1 through f a for each feature block a. [sent-231, score-0.479]

81 Thus, the model can learn different weights for the same underlying features based on the current level of feature extraction; we found that this was crucial for optimal performance. [sent-232, score-0.558]

82 We compare to two baselines that involve no learning: forward selection and extracting factorwise features based on the entropy of marginals at each position (“Entropy”). [sent-235, score-0.524]

83 The learned policy results are either greedy (“Greedy” example-level and factor-level) or non-myopic (either our “Q-learning” or the baseline “Imitation”). [sent-236, score-0.382]

84 Note that the example-wise method is far less effective than the factor-wise extraction strategy. [sent-237, score-0.267]

85 Furthermore, Q-learning in particular achieves higher accuracy models at a fraction of the computational cost of using all features, and is more effective than imitation learning. [sent-238, score-0.241]

86 1 Experiments Tracking of human pose in video Setup. [sent-245, score-0.232]

87 For this problem, our goal is to predict the joint locations of human limbs in video clips extracted from Hollywood movies. [sent-246, score-0.218]

88 Our testbed is the MODEC+S model proposed in [22]; the MODEC+S model uses the MODEC model of [15] to generate 32 proposed poses per frame of a video sequence, and then combines the predictions using a linear-chain structured sequential prediction model. [sent-247, score-0.425]

89 There are four types of features used by MODEC+S, the ﬁnal and most expensive of which is a coarse-to-ﬁne optical ﬂow [13]; we incrementally compute poses and features to minimize the total runtime. [sent-248, score-0.548]

90 Speciﬁcally, we concatenate the already computed unary and edge features of yi and its neighbors (conditioned on the value of z at i), the margin of the current MAP decoding at position i, and a measure of self-consistency computed on y as follows. [sent-254, score-0.505]

91 For all sets of m frames overlapping with frame i, we extract color histograms for the predicted arm segments and compute the maximum χ2 -distance from the ﬁrst frame to any other frame; we then also add an indicator feature each of these maximum distances exceeds 0. [sent-255, score-0.582]

92 We also add several bias terms for which sets of features have been extracted around position i. [sent-260, score-0.415]

93 While our approach is is extremely efﬁcient in terms of how many features are extracted (Left), the additional overhead of inference is prohibitively expensive for the OCR task without applying q-inference (Right) with a large threshold. [sent-279, score-0.518]

94 Furthermore, although the example-wise strategy is less efﬁcient in terms of features extracted, it is more efﬁcient in terms of overhead. [sent-280, score-0.223]

95 Furthermore, while the feature extraction decisions of the Q-learning model are signiﬁcantly correlated with the error of the starting predictions (ρ = 0. [sent-284, score-0.531]

96 We use three sets of features: the original pixels (free), and two sets of Histogram-of-Gradient (HoG) features computed on the images for different bin sizes. [sent-290, score-0.256]

97 Unlike the pose setting, the features are very fast to compute compared to inference. [sent-291, score-0.381]

98 Our method is extremely efﬁcient in terms of the features computed for h; however, unlike the pose setting, the overhead of inference is on par with the feature computation. [sent-296, score-0.749]

99 6 Conclusion We have introduced a framework for learning feature extraction policies and predictive models that adaptively select features for extraction in a factor-wise, on-line fashion. [sent-300, score-1.087]

100 On two tasks our approach yields models that both more accurate and far more efﬁcient; our work is a signiﬁcant step towards eliminating the feature extraction bottleneck in structured prediction. [sent-301, score-0.673]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.3), ('budget', 0.289), ('extraction', 0.267), ('features', 0.223), ('feature', 0.217), ('reward', 0.206), ('st', 0.172), ('ocr', 0.168), ('imitation', 0.162), ('modec', 0.159), ('structured', 0.131), ('pose', 0.123), ('extracted', 0.109), ('frame', 0.106), ('elbow', 0.103), ('action', 0.102), ('inference', 0.098), ('extractions', 0.095), ('metafeatures', 0.095), ('tier', 0.088), ('overhead', 0.088), ('sapp', 0.084), ('learn', 0.084), ('greedy', 0.082), ('trajectories', 0.082), ('yj', 0.079), ('sj', 0.078), ('wrist', 0.078), ('taskar', 0.075), ('prediction', 0.073), ('articulated', 0.073), ('reinforcement', 0.073), ('state', 0.072), ('budgets', 0.069), ('video', 0.068), ('runtime', 0.064), ('factorwise', 0.064), ('handwriting', 0.064), ('wrists', 0.064), ('ze', 0.064), ('trajectory', 0.063), ('weiss', 0.06), ('entropy', 0.058), ('predictor', 0.058), ('accurate', 0.058), ('classi', 0.057), ('adaptively', 0.057), ('predictive', 0.056), ('batch', 0.056), ('baselines', 0.054), ('parsing', 0.054), ('zu', 0.052), ('cascade', 0.05), ('er', 0.049), ('fe', 0.049), ('discriminative', 0.047), ('predictions', 0.047), ('th', 0.046), ('adaptive', 0.046), ('extract', 0.045), ('scores', 0.044), ('position', 0.044), ('controlling', 0.043), ('decoding', 0.043), ('argmaxa', 0.043), ('cost', 0.042), ('extracting', 0.041), ('margin', 0.041), ('unary', 0.041), ('sparsely', 0.041), ('human', 0.041), ('selection', 0.04), ('utilize', 0.04), ('ea', 0.04), ('yi', 0.04), ('ensembles', 0.039), ('daum', 0.039), ('concatenate', 0.039), ('fu', 0.039), ('add', 0.039), ('ey', 0.038), ('argmin', 0.037), ('accuracy', 0.037), ('emnlp', 0.037), ('control', 0.037), ('output', 0.036), ('constituent', 0.036), ('compute', 0.035), ('choosing', 0.034), ('deterministically', 0.034), ('messages', 0.034), ('incrementally', 0.034), ('current', 0.034), ('indicator', 0.034), ('pixels', 0.033), ('expected', 0.033), ('exceed', 0.033), ('optical', 0.033), ('pay', 0.033), ('ca', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

Author: David J. Weiss, Ben Taskar

2 0.27452716 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our ﬁrst main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.26846793 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have signiﬁcantly grown in popularity in the reinforcement–learning ﬁeld. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identiﬁcation of effective gradient directions and the proposal of efﬁcient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly inﬂuenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simpliﬁed version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

4 0.22338587 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

5 0.21025352 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

Author: Nguyen Viet Cuong, Wee Sun Lee, Nan Ye, Kian Ming A. Chai, Hai Leong Chieu

Abstract: We introduce a new objective function for pool-based Bayesian active learning with probabilistic hypotheses. This objective function, called the policy Gibbs error, is the expected error rate of a random classiﬁer drawn from the prior distribution on the examples adaptively selected by the active learning policy. Exact maximization of the policy Gibbs error is hard, so we propose a greedy strategy that maximizes the Gibbs error at each iteration, where the Gibbs error on an instance is the expected error of a random classiﬁer selected from the posterior label distribution on that instance. We apply this maximum Gibbs error criterion to three active learning scenarios: non-adaptive, adaptive, and batch active learning. In each scenario, we prove that the criterion achieves near-maximal policy Gibbs error when constrained to a ﬁxed budget. For practical implementations, we provide approximations to the maximum Gibbs error criterion for Bayesian conditional random ﬁelds and transductive Naive Bayes. Our experimental results on a named entity recognition task and a text classiﬁcation task show that the maximum Gibbs error criterion is an effective active learning criterion for noisy models. 1

6 0.20709695 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

7 0.18667574 241 nips-2013-Optimizing Instructional Policies

8 0.18533495 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

9 0.18403704 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

10 0.18184887 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

11 0.1804454 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

12 0.1771071 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

13 0.16911621 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

14 0.16662909 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

15 0.16491394 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

16 0.1639421 257 nips-2013-Projected Natural Actor-Critic

17 0.13434143 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

18 0.13429298 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

19 0.13214423 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

20 0.12922439 347 nips-2013-Variational Planning for Graph-based MDPs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.327), (1, -0.27), (2, -0.229), (3, 0.096), (4, 0.056), (5, -0.004), (6, -0.054), (7, 0.032), (8, -0.039), (9, 0.027), (10, -0.009), (11, 0.009), (12, -0.045), (13, -0.014), (14, -0.039), (15, 0.002), (16, -0.035), (17, 0.006), (18, 0.018), (19, -0.043), (20, -0.054), (21, 0.01), (22, 0.036), (23, -0.038), (24, 0.016), (25, 0.013), (26, 0.047), (27, -0.038), (28, 0.033), (29, -0.053), (30, 0.052), (31, 0.008), (32, -0.003), (33, 0.019), (34, 0.039), (35, 0.053), (36, -0.14), (37, -0.027), (38, 0.025), (39, 0.016), (40, 0.0), (41, -0.033), (42, 0.086), (43, -0.066), (44, -0.004), (45, 0.007), (46, -0.02), (47, 0.007), (48, 0.056), (49, -0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9507935 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

Author: David J. Weiss, Ben Taskar

2 0.84790355 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

Author: Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer

Abstract: Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classiﬁcation-based modiﬁed policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the ﬁrst time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI’s results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. 1

3 0.80515289 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our ﬁrst contribution is a symbolic implementation of Modiﬁed Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show signiﬁcantly improved scalability over state-of-the-art symbolic planners. 1

4 0.78630507 165 nips-2013-Learning from Limited Demonstrations

Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup

Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to deﬁne linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-ﬁnding task. 1

5 0.74700987 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz

Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.

6 0.74048448 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

7 0.73447877 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

8 0.72450203 241 nips-2013-Optimizing Instructional Policies

9 0.70677608 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

10 0.6909346 257 nips-2013-Projected Natural Actor-Critic

11 0.6826843 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

12 0.66592729 348 nips-2013-Variational Policy Search via Trajectory Optimization

13 0.6585955 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

14 0.6585626 347 nips-2013-Variational Planning for Graph-based MDPs

15 0.64587396 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

16 0.6334824 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

17 0.61947292 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

18 0.60459393 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

19 0.59512603 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

20 0.58745819 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.015), (16, 0.044), (33, 0.139), (34, 0.158), (36, 0.012), (41, 0.028), (49, 0.043), (56, 0.144), (63, 0.147), (70, 0.054), (85, 0.035), (89, 0.028), (93, 0.082), (95, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90269393 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

Author: David J. Weiss, Ben Taskar

2 0.86528075 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis

Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a ﬁnite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Speciﬁcally, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1

3 0.86185324 5 nips-2013-A Deep Architecture for Matching Short Texts

Author: Zhengdong Lu, Hang Li

Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufﬁcient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More speciﬁcally, we apply this model to matching tasks in natural language, e.g., ﬁnding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1

4 0.85306549 249 nips-2013-Polar Operators for Structured Sparse Estimation

Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans

Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difﬁculties that entail sophisticated algorithms. Our ﬁrst contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efﬁciently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efﬁcient latent fused lasso. 1

5 0.85305011 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan

Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conﬂicts are unlikely and if conﬂicts do arise a conﬂict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1

6 0.85203075 99 nips-2013-Dropout Training as Adaptive Regularization

7 0.84962898 69 nips-2013-Context-sensitive active sensing in humans

8 0.84808296 348 nips-2013-Variational Policy Search via Trajectory Optimization

9 0.84728974 201 nips-2013-Multi-Task Bayesian Optimization

10 0.84589511 318 nips-2013-Structured Learning via Logistic Regression

11 0.84571135 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

12 0.84547061 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

13 0.84526449 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

14 0.84460115 251 nips-2013-Predicting Parameters in Deep Learning

15 0.84431535 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

16 0.8442921 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

17 0.84383649 158 nips-2013-Learning Multiple Models via Regularized Weighting

18 0.84375358 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

19 0.84312522 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

20 0.84272045 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms