nips nips2010 nips2010-208 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present policy gradient results within the framework of linearly-solvable MDPs. [sent-3, score-0.772]
2 For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. [sent-4, score-1.026]
3 We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems. [sent-5, score-0.966]
4 In particular it has been shown that one can replace the true advantage function with a compatible function approximator without affecting the gradient [8, 14], and that a natural policy gradient (with respect to Fisher information) can be computed [2, 5, 11]. [sent-7, score-1.363]
5 The goal of this paper is to apply policy gradient ideas to the linearly-solvable MDPs (or LMDPs) we have recently-developed [15, 16], as well as to a class of continuous stochastic systems with similar properties [4, 7, 16]. [sent-8, score-0.854]
6 This framework has already produced a number of unique results – such as linear Bellman equations, general estimation-control dualities, compositionality of optimal control laws, path-integral methods for optimal control, etc. [sent-9, score-0.166]
7 The present results with regard to policy gradients are also unique, as summarized in Abstract. [sent-10, score-0.615]
8 We will then develop the new results regarding policy gradients. [sent-14, score-0.571]
9 1 Background on LMDPs An LMDP is defined by a state cost () over a (discrete for now) state space X , and a transition probability density (0 |) corresponding to the notion of passive dynamics. [sent-16, score-0.188]
10 The cost function is ( (·|)) = () + KL ( (·|) || (·|)) 1 (1) Thus the controller is free to modify the default/passive dynamics in any way it wishes, but incurs a control cost related to the amount of modification. [sent-21, score-0.172]
11 The average cost and differential cost-to-go () for given (0 |) satisfy the Bellman equation µ ¶ P (0 |) (2) + () = () + 0 (0 |) log + (0 ) (0 |) where () is defined up to a constant. [sent-22, score-0.164]
12 2 Policy gradient for a general parameterization Consider a parameterization (0 | w) which is valid in the sense that it satisfies the above conditions and Ow , w exists for all w ∈ R . [sent-25, score-0.41]
13 Let ( w) be the corresponding stationary density. [sent-26, score-0.108]
14 We will also need the pair-wise density ( 0 w) = ( w) (0 | w). [sent-27, score-0.075]
15 On the other hand, while in traditional MDPs one has to estimate (or rather the advantage function) in order to compute the policy gradient, it will turn out that in LMDPs it is sufficient to estimate . [sent-35, score-0.677]
16 3 A suitable policy parameterization The relation (4) between the optimal policy ∗ and the optimal cost-to-go ∗ suggests parameterizing as a -weighted Gibbs distribution. [sent-37, score-1.288]
17 Since linear function approximators have proven very successful, we will use an energy function (for the Gibbs distribution) which is linear in w : ¡ ¢ (0 |) exp −wT f (0 ) 0 ( | w) , P (7) T (|) exp (−w f ()) Here f () ∈ R is a vector of features. [sent-38, score-0.17]
18 The LMDP policy gradient for parameterization (7) is ¡ ¢ P Ow = 0 ( 0 ) (Π [f ] () − f (0 )) (0 ) − wT f (0 ) (9) As expected from (4), we see that the energy function w f () and the cost-to-go () are related. [sent-42, score-0.869]
19 Indeed if they are equal the gradient vanishes (the converse is not true). [sent-43, score-0.279]
20 4 Compatible cost-to-go function approximation One of the more remarkable aspects of policy gradient results [8, 14] in traditional MDPs is that, when the true function is replaced with a compatible approximation satisfying certain conditions, the gradient remains unchanged. [sent-45, score-1.306]
21 Key to obtaining such results is making sure that the approximation error is orthogonal to the remaining terms in the expression for the policy gradient. [sent-46, score-0.648]
22 Our goal in this section is to construct a compatible function approximator for LMDPs. [sent-47, score-0.336]
23 Given the form of (9), it makes sense to approximate () as a linear combination of the same features f () used to represent the energy function: ( r) , rT f (). [sent-49, score-0.098]
24 However the baseline vanishes after the simplification, and the result is again (11). [sent-52, score-0.078]
25 The error r is now orthogonal to the features f , thus for r = r the second term in (11) vanishes, but the first term does not. [sent-58, score-0.125]
26 The first term in (15) involves the projection of on the auxiliary features g. [sent-67, score-0.14]
27 This projection can be computed by defining the auxiliary function approximator ( s) , sT g () and fitting it to in a least-squares e sense, as in (12) but using g () rather than f (). [sent-68, score-0.262]
28 The approximation error is now orthogonal to the auxiliary features g (), and so replacing () with ( s) in (15) does not affect k. [sent-69, score-0.231]
29 The following procedure yields the exact LMDP policy gradient: 1. [sent-71, score-0.595]
30 The requirement that − be orthogonal to g is somewhat e restrictive, however an equivalent requirement arises in traditional MDPs [14]. [sent-77, score-0.145]
31 5 Natural policy gradient When the parameter space has a natural metric (w), optimization algorithms tend to work better −1 if the gradient of the objective function is pre-multiplied by (w) . [sent-79, score-1.027]
32 In the context of policy gradient methods [5, 11] where w parameterizes a probability density, the natural metric is given by Fisher information (which depends on because w parameterizes the conditional density). [sent-81, score-0.922]
33 Recall that in traditional MDPs the policy is parameterized using features over the state-action space while in LMDPs we only need features over the state space. [sent-84, score-0.799]
34 Another difference is that in LMDPs the (regular as well as natural) policy gradient vanishes when w = r, which is a sensible fixed-point condition. [sent-86, score-0.877]
35 In traditional MDPs the policy gradient vanishes when r = 0, which is peculiar because it corresponds to the advantage function approximation being identically 0. [sent-87, score-0.991]
36 The true advantage function is of course different, but if the policy becomes deterministic and only one action is sampled per state, the resulting data can be fit with r = 0. [sent-88, score-0.608]
37 Thus any deterministic policy is a local maximum in traditional MDPs. [sent-89, score-0.689]
38 At these local maxima the policy gradient theorem cannot actually be applied because it requires a stochastic policy. [sent-90, score-0.883]
39 When the policy becomes near-deterministic, the number of samples needed to obtain accurate estimates increases because of the lack of exploration [6]. [sent-91, score-0.597]
40 6 A Gauss-Newton method for approximating the optimal cost-to-go Instead of using policy gradient, we can solve (3) for the optimal ∗ directly. [sent-94, score-0.641]
41 One option is approximate policy iteration – which in our context takes on a simple form. [sent-95, score-0.621]
42 Given the policy parameters w() at iteration , approximate the cost-to-go function and obtain the feature weights r() , and then set w(+1) = r() . [sent-96, score-0.6]
43 This is equivalent to the above natural gradient method with step size 1, using a biased approximator instead of the compatible approximator given by Theorem 3. [sent-97, score-0.753]
44 The other option is approximate value iteration – which is a fixed-point method for solving (3) while replacing ∗ () with wT f (). [sent-98, score-0.086]
45 The sampling versions use 400 samples per evaluation: 20 trajectories with 20 steps each, starting from the stationary distribution. [sent-104, score-0.161]
46 The sampling version of the Gauss-Newton method worked well with 400 samples per evaluation; the natural gradient needed around 2500 samples. [sent-109, score-0.313]
47 Normally the density () would be fixed, however we have found empirically that the resulting algorithm yields better policies if we set () to the policy-specific stationary density ( w) at each iteration. [sent-110, score-0.282]
48 It is notable that minimization of (23) is closely related to policy evaluation via Bellman residual minimization. [sent-114, score-0.571]
49 Note that the Gauss-Newton method proposed here would be expected to have second-order convergence, even though the amount of computation/sampling per iteration is the same as in a policy gradient method. [sent-116, score-0.801]
50 7 Numerical experiments We compared the natural policy gradient and the Gauss-Newton method, both in exact form and with sampling, on two classes of LMDPs: randomly generated, and a discretization of a continuous "metronome" problem taken from [17]. [sent-118, score-0.906]
51 Fitting the auxiliary approximator ( s) was done using e the LSTD() algorithm [3]. [sent-119, score-0.237]
52 The natural gradient was used with the BFGS minimizer "minFunc" [12]. [sent-122, score-0.255]
53 Figure 1B compares the optimal cost-to-go ∗ , the least-squares fit to the known ∗ using our features (which were a 5-by-5 grid of Gaussians), and the solution of the policy gradient method initialized with w = 0. [sent-127, score-0.85]
54 1 Policy gradient for general controlled diffusions Consider the controlled Ito diffusion x = b (x u) + (x) (26) where () is a standard multidimensional Brownian motion process, and u is now a traditional control vector. [sent-133, score-0.574]
55 As before we focus on infinite-horizon average-cost optimal control problems. [sent-135, score-0.11]
56 Let L be the infinitesimal generator of an Ito diffusion which has a stationary density , and let be a twice-differentiable function. [sent-140, score-0.331]
57 Then Z (x) L [ ] (x) x = 0 (29) Proof: The adjoint L∗ of the infinitesimal generator L is known to be the Fokker-Planck operator – which computes the time-evolution of a density under the diffusion [10]. [sent-141, score-0.298]
58 Since is the stationary density, L∗ [] (x) = 0 for all x, and so hL∗ [] i = 0. [sent-142, score-0.108]
59 For example the density of Brownian motion initialized at the origin is a zero-mean Gaussian whose covariance grows linearly with time – thus there is no stationary density. [sent-147, score-0.183]
60 If however the diffusion is controlled and the policy tends to keep the state within some region, then a stationary density would normally exist. [sent-148, score-0.909]
61 The existence of a stationary density may actually be a sensible definition of stability for stochastic systems (although this point will not be pursued in the present paper). [sent-149, score-0.28]
62 Now consider any policy parameterization u = (x w) such that (for the current value of w) the diffusion (26) has a stationary density and Ow exists. [sent-150, score-0.906]
63 The policy gradient of the controlled diffusion (26) is Z ³ ´ Ow = (x) Ow (x) + Ow b (x) Ox (x) x (31) Here L [Ow ] is meant component-wise. [sent-152, score-0.88]
64 This is essential for a policy gradient procedure which seeks to avoid finite differencing; indeed Ow could not be estimated while sampling from a single policy. [sent-154, score-0.804]
65 Thus we have Unlike most other results in stochastic optimal control, equation (31) does not involve the Hessian Oxx , although we can obtain a Oxx -dependent term here if we allow to depend on u. [sent-155, score-0.16]
66 Let = − be the parameterized policy with 0. [sent-159, score-0.607]
67 Substituting in the HJB equation and matching powers of yields = = 2 , 2 +1 and so the policy gradient can be computed directly as O = 1 − 22 . [sent-161, score-0.83]
68 The stationary density 1 () is a zero-mean Gaussian with variance 2 = 2 . [sent-162, score-0.183]
69 One can now verify that the gradient given by Theorem 5 is identical to the O computed above. [sent-163, score-0.201]
70 Another interesting aspect of Theorem 5 is that it is a natural generalization of classic results from finite-horizon deterministic optimal control [13], even though it cannot be derived from those results. [sent-164, score-0.201]
71 Suppose we have an open-loop control trajectory u () 0 ≤ ≤ , the resulting state trajectory (starting from a given x0 ) is x (), and the corresponding co-state trajectory (obtained by integrating Pontryagin’s ODE backwards in time) is (). [sent-165, score-0.19]
72 It is known that the gradient of the total cost w. [sent-166, score-0.237]
73 Then Z Z ³ ´ Ow = Ow u ()T Ou() = Ow (x () u ()) + Ow b (x () u ())T () (32) The co-state () is known to be equal to the gradient Ox (x ) of the cost-to-go function for the (closed-loop) deterministic problem. [sent-171, score-0.238]
74 Of course in finite-horizon settings there is no stationary density, and instead the integral in (32) is over the trajectory. [sent-173, score-0.108]
75 Theorem 5 suggests a simple procedure for estimating the policy gradient via sampling: fit a function approximator to , and use Ox in (31). [sent-175, score-0.934]
76 Alternatively, a compatible approximation scheme can be b b obtained by fitting Ox to Ox in a least-squares sense, using a linear approximator with features b Ow b (x). [sent-176, score-0.439]
77 Ideally we would construct a compatible approximation scheme which involves fitting rather than b Ox . [sent-178, score-0.259]
78 2 Natural gradient and compatible approximation for linearly-solvable diffusions We now focus on a more restricted family of stochastic optimal control problems which arise in many situations (e. [sent-181, score-0.647]
79 The optimal control law u∗ and the optimal differential cost-go-to ∗ (x) are known to be related as u∗ = −−1 T Ox ∗ . [sent-184, score-0.217]
80 As in the discrete case we use this relation to motivate the choice of policy parameterization and cost-to-go function approximator. [sent-185, score-0.676]
81 whether Ow = Ow ), we seek a natural gradient version of (35). [sent-189, score-0.255]
82 To this end we need to interpret T −1 T as Fisher information for the (infinitesimal) transition probability density of our parameterized diffusion. [sent-190, score-0.111]
83 More precisely, the exponentiated HJB equation for the optimal ∗ in problem (33, 38) is linear in exp (−∗ ). [sent-194, score-0.135]
84 We have also shown [16] that the continuous problem (33, 38) is the limit (when → 0) of continuous-state discrete-time LMDPs constructed via Euler discretization as above. [sent-195, score-0.08]
85 The compatible function approximation scheme from Theorem 3 can then be applied to these LMDPs. [sent-196, score-0.234]
86 Thus the auxiliary function approximator is now (x s) , sT L [f ] (x), and we have e Theorem 6. [sent-199, score-0.237]
87 The following procedure yields the exact policy gradient for problem (33, 38): 1. [sent-200, score-0.796]
88 It is very similar to the corresponding results in the discrete case (Theorems 3,4) except it involves the differential operator L rather than the integral operator Π. [sent-205, score-0.21]
89 4 Summary Here we developed compatible function approximators and natural policy gradients which only require estimation of the cost-to-go function. [sent-206, score-0.92]
90 The resulting approximation scheme is unusual, using policy-specific auxiliary features derived from the primary features. [sent-208, score-0.178]
91 In continuous time we also obtained a new policy gradient result for control problems that are not linearly-solvable, and showed that it generalizes results from deterministic optimal control. [sent-209, score-0.955]
92 We also derived a somewhat heuristic but nevertheless promising Gauss-Newton method for solving for the optimal cost-to-go directly; it appears to be a hybrid between value iteration and policy gradient. [sent-210, score-0.661]
93 One might wonder why we need policy gradients here given that the (exponentiated) Bellman equation is linear, and approximating its solution using features is faster than any other procedure in Reinforcement Learning and Approximate Dynamic Programming. [sent-211, score-0.692]
94 The answer is that minimizing Bellman error does not always give the best policy – as illustrated in Figure 1B. [sent-212, score-0.571]
95 Indeed a combined approach may be optimal: solve the linear Bellman equation approximately [17], and then use the solution to initialize the policy gradient method. [sent-213, score-0.806]
96 We do not see this as a shortcoming because, despite all the effort that has gone into model-free RL, the resulting methods do not seem applicable to truly complex optimal control problems. [sent-216, score-0.11]
97 Our methods involve model-based sampling which combines the best of both worlds: computational speed, and grounding in reality (assuming we have a good model of reality). [sent-217, score-0.083]
98 Optimal control and nonlinear filtering for nondegenerate diffusion processes. [sent-237, score-0.151]
99 Policy gradient methods for reinforcement learning with function approximation. [sent-282, score-0.247]
100 Simple statistical gradient following algorithms for connectionist reinforcement learning. [sent-298, score-0.247]
wordName wordTfidf (topN-words)
[('policy', 0.571), ('ow', 0.338), ('ox', 0.31), ('lmdps', 0.262), ('gradient', 0.201), ('lmdp', 0.191), ('compatible', 0.174), ('approximator', 0.162), ('mdps', 0.112), ('bellman', 0.112), ('stationary', 0.108), ('wt', 0.104), ('traditional', 0.081), ('vanishes', 0.078), ('approximators', 0.077), ('diffusions', 0.077), ('parameterization', 0.076), ('diffusion', 0.076), ('density', 0.075), ('auxiliary', 0.075), ('control', 0.075), ('nitesimal', 0.073), ('differential', 0.072), ('generator', 0.072), ('hjb', 0.071), ('oxx', 0.071), ('natural', 0.054), ('ou', 0.051), ('fisher', 0.049), ('ito', 0.048), ('metronome', 0.048), ('parameterizes', 0.048), ('shortcut', 0.048), ('stochastic', 0.046), ('reinforcement', 0.046), ('compatibility', 0.045), ('gradients', 0.044), ('discretization', 0.044), ('features', 0.043), ('operator', 0.042), ('lqg', 0.042), ('theorem', 0.041), ('approximation', 0.039), ('hl', 0.038), ('orthogonal', 0.038), ('rt', 0.038), ('deterministic', 0.037), ('exp', 0.036), ('continuous', 0.036), ('replacing', 0.036), ('cost', 0.036), ('euler', 0.036), ('todorov', 0.036), ('parameterized', 0.036), ('optimal', 0.035), ('sense', 0.034), ('equation', 0.034), ('adjoint', 0.033), ('controlled', 0.032), ('sampling', 0.032), ('ergodic', 0.031), ('exponentiated', 0.03), ('trajectory', 0.03), ('rl', 0.029), ('unusual', 0.029), ('peters', 0.029), ('iteration', 0.029), ('discrete', 0.029), ('brownian', 0.028), ('reality', 0.028), ('passive', 0.027), ('sensible', 0.027), ('supplement', 0.027), ('dividing', 0.027), ('needed', 0.026), ('somewhat', 0.026), ('differentiating', 0.025), ('state', 0.025), ('dynamics', 0.025), ('rather', 0.025), ('coincides', 0.024), ('yields', 0.024), ('actually', 0.024), ('valid', 0.023), ('involve', 0.023), ('normally', 0.022), ('satisfy', 0.022), ('term', 0.022), ('option', 0.021), ('scheme', 0.021), ('energy', 0.021), ('trajectories', 0.021), ('vanish', 0.021), ('compositionality', 0.021), ('stochastics', 0.021), ('peculiar', 0.021), ('covariant', 0.021), ('ended', 0.021), ('differencing', 0.021), ('konda', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
2 0.44196939 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
3 0.37028357 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
Author: Atsushi Miyamae, Yuichi Nagata, Isao Ono, Shigenobu Kobayashi
Abstract: In this paper, we propose an efficient algorithm for estimating the natural policy gradient using parameter-based exploration; this algorithm samples directly in the parameter space. Unlike previous methods based on natural gradients, our algorithm calculates the natural policy gradient using the inverse of the exact Fisher information matrix. The computational cost of this algorithm is equal to that of conventional policy gradients whereas previous natural policy gradient methods have a prohibitive computational cost. Experimental results show that the proposed method outperforms several policy gradient methods. 1
4 0.35453713 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
5 0.33079317 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
6 0.31654754 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
7 0.28807643 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
8 0.28598681 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
9 0.25717348 134 nips-2010-LSTD with Random Projections
10 0.21899027 43 nips-2010-Bootstrapping Apprenticeship Learning
11 0.20239128 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
12 0.18071423 181 nips-2010-Network Flow Algorithms for Structured Sparsity
13 0.15602547 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
14 0.15288159 212 nips-2010-Predictive State Temporal Difference Learning
15 0.1488024 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
16 0.14331332 269 nips-2010-Throttling Poisson Processes
17 0.13711326 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
18 0.12824807 64 nips-2010-Distributionally Robust Markov Decision Processes
19 0.12691754 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
20 0.12110718 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
topicId topicWeight
[(0, 0.255), (1, -0.508), (2, -0.09), (3, -0.049), (4, 0.024), (5, -0.124), (6, 0.105), (7, 0.039), (8, 0.026), (9, -0.202), (10, 0.023), (11, -0.114), (12, 0.031), (13, 0.088), (14, -0.073), (15, 0.045), (16, 0.091), (17, -0.054), (18, -0.037), (19, -0.066), (20, 0.018), (21, 0.013), (22, -0.009), (23, -0.004), (24, -0.04), (25, -0.059), (26, 0.038), (27, 0.082), (28, -0.087), (29, 0.027), (30, -0.0), (31, -0.014), (32, 0.08), (33, -0.031), (34, -0.101), (35, 0.036), (36, -0.008), (37, -0.008), (38, -0.022), (39, 0.078), (40, -0.018), (41, 0.003), (42, -0.077), (43, -0.069), (44, 0.008), (45, -0.008), (46, -0.055), (47, -0.046), (48, 0.024), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.97432172 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
2 0.9037869 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
3 0.87883544 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
4 0.85037369 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
Author: Atsushi Miyamae, Yuichi Nagata, Isao Ono, Shigenobu Kobayashi
Abstract: In this paper, we propose an efficient algorithm for estimating the natural policy gradient using parameter-based exploration; this algorithm samples directly in the parameter space. Unlike previous methods based on natural gradients, our algorithm calculates the natural policy gradient using the inverse of the exact Fisher information matrix. The computational cost of this algorithm is equal to that of conventional policy gradients whereas previous natural policy gradient methods have a prohibitive computational cost. Experimental results show that the proposed method outperforms several policy gradient methods. 1
5 0.82811916 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
6 0.82384455 152 nips-2010-Learning from Logged Implicit Exploration Data
7 0.79383743 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
8 0.73721045 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
9 0.66788787 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
10 0.64408201 43 nips-2010-Bootstrapping Apprenticeship Learning
11 0.61763704 134 nips-2010-LSTD with Random Projections
12 0.57857239 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
13 0.54625392 212 nips-2010-Predictive State Temporal Difference Learning
14 0.53150499 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
15 0.5095517 269 nips-2010-Throttling Poisson Processes
16 0.47938043 64 nips-2010-Distributionally Robust Markov Decision Processes
17 0.45346996 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
18 0.40075547 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
19 0.40032309 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
20 0.3822009 4 nips-2010-A Computational Decision Theory for Interactive Assistants
topicId topicWeight
[(13, 0.038), (17, 0.014), (27, 0.054), (30, 0.047), (35, 0.02), (45, 0.254), (50, 0.062), (52, 0.028), (53, 0.289), (60, 0.043), (77, 0.04), (90, 0.025)]
simIndex simValue paperId paperTitle
1 0.85319376 226 nips-2010-Repeated Games against Budgeted Adversaries
Author: Jacob D. Abernethy, Manfred K. Warmuth
Abstract: We study repeated zero-sum games against an adversary on a budget. Given that an adversary has some constraint on the sequence of actions that he plays, we consider what ought to be the player’s best mixed strategy with knowledge of this budget. We show that, for a general class of normal-form games, the minimax strategy is indeed efficiently computable and relies on a “random playout” technique. We give three diverse applications of this new algorithmic template: a cost-sensitive “Hedge” setting, a particular problem in Metrical Task Systems, and the design of combinatorial prediction markets. 1
2 0.85311174 2 nips-2010-A Bayesian Approach to Concept Drift
Author: Stephen Bach, Mark Maloof
Abstract: To cope with concept drift, we placed a probability distribution over the location of the most-recent drift point. We used Bayesian model comparison to update this distribution from the predictions of models trained on blocks of consecutive observations and pruned potential drift points with low probability. We compare our approach to a non-probabilistic method for drift and a probabilistic method for change-point detection. In our experiments, our approach generally yielded improved accuracy and/or speed over these other methods. 1
same-paper 3 0.81436664 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
4 0.80613893 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
Author: Bo Thiesson, Chong Wang
Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1
5 0.78879899 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
6 0.71009171 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
7 0.70727831 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
8 0.70703524 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
9 0.70447552 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
10 0.70391786 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
11 0.70336854 212 nips-2010-Predictive State Temporal Difference Learning
12 0.70191509 64 nips-2010-Distributionally Robust Markov Decision Processes
13 0.6978476 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
14 0.69687235 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
15 0.6948756 287 nips-2010-Worst-Case Linear Discriminant Analysis
16 0.69363469 63 nips-2010-Distributed Dual Averaging In Networks
17 0.69293207 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
18 0.69280529 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
19 0.69276851 269 nips-2010-Throttling Poisson Processes
20 0.69234806 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models