nips nips2008 nips2008-210 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John W. Roberts, Russ Tedrake
Abstract: Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. [sent-3, score-0.401]
2 We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. [sent-5, score-0.348]
3 First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. [sent-6, score-0.545]
4 Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. [sent-7, score-0.177]
5 We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. [sent-8, score-0.371]
6 1 Introduction Model-free policy gradient algorithms allow for the optimization of control policies on systems which are impractical to model effectively, whether due to cost, complexity or uncertainty in the very structure and dynamics of the system (Kohl & Stone, 2004; Tedrake et al. [sent-9, score-0.465]
7 As the same systems on which one wishes to use these algorithms tend to have a high cost of policy evaluation, much work has been done on maximizing the policy improvement from any individual evaluation (Meuleau et al. [sent-13, score-0.497]
8 , 2003a) and GPOMDP (Baxter & Bartlett, 2001) have become popular through their ability to match the performance gains of more basic model-free policy gradient algorithms while using fewer policy evaluations. [sent-17, score-0.532]
9 As practitioners of policy gradient algorithms in complicated mechanical systems, our group has a vested interest in making practical and substantial improvements to the performance of these algorithms. [sent-18, score-0.371]
10 Variance reduction, in itself, is not a sufficient metric for optimizing the performance of PG algorithms - of greater significance is the magnitude of the variance relative to the magnitude of the gradient update. [sent-19, score-0.392]
11 Here we formulate a signal-to-noise ratio (SNR) which facilitates simple and fast evaluations of a PG algorithm’s average performance, and facilitates algorithmic performance improvements. [sent-20, score-0.063]
12 Though the SNR does not capture all facets of a policy gradient algorithm’s capability to learn, we show that achieving a high SNR will often result in a superior convergence rate with less violent variations in the policy. [sent-21, score-0.406]
13 1 Through a close analysis of the SNR, and the means by which it is maximized, we find several modifications to traditional model-free policy gradient updates that improve learning performance. [sent-22, score-0.378]
14 The first of these is the reshaping of distributions such that they are different on different parameters, a modification which introduces a bias to the update. [sent-23, score-0.171]
15 We show that this reshaping can improve performance, and that the introduced bias results in following the natural gradient of the cost function, rather than the true point gradient. [sent-24, score-0.474]
16 The second improvement is the use of non-Gaussian distributions for sampling, and through the SNR we find a simple distribution which improves performance without increasing the complexity of implementation. [sent-25, score-0.11]
17 2 The weight perturbation update Consider minimizing a scalar function J(w) with respect to the parameters w (note that it is possible that J(w) is a long-term cost and results from running a system with the parameters w until conclusion). [sent-26, score-0.312]
18 Performing a first-order Taylor expansion of J(w + z) yields: ∆w = −η J(w) + i ∂J zi − J(w) z = −η ∂w i i ∂J zi · z. [sent-28, score-0.185]
19 ∂w i (2) In expectation, this becomes the gradient times a (diagonal) covariance matrix, and reduces to ∂J , (3) ∂w an unbiased estimate of the gradient, scaled by the learning rate and σ 2 , the variance of the perturbation. [sent-29, score-0.217]
20 However, this unbiasedness comes with a very high variance, as the direction of an update is uniformly distributed. [sent-30, score-0.146]
21 It is only the fact that updates near the direction of the true gradient have a larger magnitude than do those nearly perpendicular to the gradient that allows for the true gradient to be achieved in expectation. [sent-31, score-0.806]
22 Note also that all samples parallel to the gradient are equally useful, whether they be in the same or opposite direction, as the sign does not affect the resulting update. [sent-32, score-0.235]
23 E[∆w] = −ησ 2 The WP algorithm is one of the simplest examples of a policy gradient reinforcement learning algorithm, and thus is well suited for analysis. [sent-33, score-0.402]
24 In the special case when z is drawn from a Gaussian distribution, weight perturbation can be interpreted as a REINFORCE update(Williams, 1992). [sent-34, score-0.119]
25 3 SNR for policy gradient algorithms The SNR is the expected power of the signal (update in the direction of the true gradient) divided by the expected power of the noise (update perpendicular to the true gradient). [sent-35, score-0.537]
26 Taking care to ensure that the magnitude of the true gradient does not effect the SNR, we have: E ∆wT ∆w SNR = ∆w = ∆wT and using Jw (w0 ) = ∂J(w) ∂w (w=w0 ) Jw Jw , T E ∆w⊥ ∆w⊥ Jw , ∆w⊥ = ∆w − w , Jw (4) (5) for convenience. [sent-36, score-0.291]
27 Intuitively, this expression measures how large a proportion of the update is “useful”. [sent-37, score-0.138]
28 If the update is purely in the direction of the gradient the SNR would be infinite, while if the update moved perpendicular to the true gradient, it would be zero. [sent-38, score-0.488]
29 As such, all else being equal, a higher SNR should generally perform as well or better than a lower SNR, and result in less violent swings in cost and policy for the same improvement in performance. [sent-39, score-0.356]
30 1 Weight perturbation with Gaussian distributions Evaluating the SNR for the WP update in Equation 1 with a deterministic J(w) and z drawn from a Gaussian distribution yields a surprisingly simple result. [sent-41, score-0.247]
31 We now expand the denominator as follows: T E ∆w⊥ ∆w⊥ = E ∆wT ∆w − 2∆wT (∆w + ∆w⊥ ) + ∆wT ∆w = E ∆wT ∆w −2Q+Q (7) Substituting Equation (1) into Equation (7) and simplifying results in: η2 T 2 E ∆w⊥ ∆w⊥ = Jwi Jwj zi zj zk − Q. [sent-43, score-0.148]
32 2E i,j,k Jw (8) We now assume that each component zi is drawn from a Gaussian distribution with variance σ 2 . [sent-44, score-0.16]
33 As can be seen in Figure 1a, both the SNR and the reduction in cost after running WP for 100 iterations decrease monotonically as the number of parameters N increases. [sent-54, score-0.126]
34 The fact that this occurs in the case of randomly generated cost functions demonstrates that this effect is not related to the simple form of the identity cost function, but is in fact related to the number of dimensions. [sent-55, score-0.229]
35 For randomly generated cost functions, 150 A matrices were tested. [sent-59, score-0.103]
36 True gradient descent was run on the identity cost function. [sent-60, score-0.313]
37 (B) Relationship as Gaussian is reshaped by changing variances for case of 2D anisotropic cost function(ratio of 2 2 gradients in different directions is 5) cost function (Section 4. [sent-62, score-0.368]
38 The plot shows that variances which increase the SNR also improve the performance of the update. [sent-69, score-0.077]
39 3 SNR with parameter-independent additive noise In many real world systems, the evaluation of the cost J(w) is not deterministic, a property which can significantly affect learning performance. [sent-71, score-0.223]
40 In this section we investigate how additive ‘noise’ in the function evaluation affects the analytical expression for the SNR. [sent-72, score-0.085]
41 We demonstrate that for very high noise WP begins to behave like a random walk, and we find in the SNR the motivation for an improvement in the WP algorithm that will be examined in Section 4. [sent-73, score-0.08]
42 Consider modifying the update seen in Equation (1) to allow for a parameter-independent additive noise term v and a more general baseline b(w), and again perform the Taylor expansion. [sent-75, score-0.168]
43 Writing the update with these terms gives: ∆w = −η J(w) + Jwi zi − b(w) + v z = −η i Jwi zi + ξ(w) z. [sent-76, score-0.261]
44 The new variable ξ(w) has two important properties: its mean can be controlled through the value of b(w), and its distribution is independent of parameters w, thus ξ(w) is independent of all the zi . [sent-78, score-0.105]
45 This expression depends upon the fact that the noise v is mean-zero and independent of the parameters, although as stated earlier, the assumption that v is mean-zero is not limiting. [sent-82, score-0.112]
46 It is clear that in the limit of small φ the expression reduces to that seen in Equation (11), while in the limit of very large φ it becomes the expression for the SNR of a random walk (see Section 3. [sent-83, score-0.13]
47 This expression makes it clear that minimizing φ is desirable, a result that suggests two things: (1) the optimal baseline (from the perspective of the SNR) is the value function (i. [sent-85, score-0.067]
48 Thus, in the case of noisy measurements, there is some optimal sampling distance that is as large as possible without resulting in poor sampling of the local gradient. [sent-89, score-0.104]
49 4 SNR of a Random Walk Due to the fact that the update is squared in the SNR, only its degree of parallelity to the true gradient is relevant, not its direction. [sent-94, score-0.31]
50 In the case of WP on a deterministic function, this is not a concern as the update is always within 90◦ of the gradient, and thus the parallel component is always in the correct direction. [sent-95, score-0.115]
51 For a system with noise, however, components of the update parallel to the gradient can in fact be in the incorrect direction, contributing to the SNR even though they do not actually result in learning. [sent-96, score-0.322]
52 This effect only becomes significant when the noise is particularly large, and reaches its extreme in the case of a true random walk (a strong bias in the “wrong” direction is in fact a good update with an incorrect sign). [sent-97, score-0.294]
53 1 Reshaping the Gaussian Distribution Consider a generalized WP algorithm, in which we allow each component zi to be drawn independently from separate mean-zero distributions. [sent-101, score-0.108]
54 1, we no longer assume each zi is drawn from an identical distribution, but rather associate each with its own σi (the vector of the σi will be referred to as σ). [sent-103, score-0.108]
55 (16) An important property of this SNR is that it depends only upon the direction of Jw and the relative magnitude of the σi (as opposed to parameters such as the learning rate η and the absolute magnitudes σ and Jw ). [sent-105, score-0.2]
56 1 Effect of reshaping on performance While the absolute magnitudes of the variance and true gradient do not affect the SNR given in Equation (16), the relative magnitudes of the different σi and their relationship to the true gradient can affect it. [sent-108, score-0.777]
57 To study this property, we investigate a cost function with a significant degree of anisotropy. [sent-109, score-0.103]
58 Using a cost function of the form given in Equation (12) and N = 2, we choose an A matrix whose first diagonal component is five times that of the second. [sent-110, score-0.103]
59 We observe the performance of the first update (rather than the full trial) as the true gradient can vary significantly over the course of a trial, thereby having major effects on the SNR even as the variances are unchanged. [sent-112, score-0.387]
60 As is clear in Figure 1b, as the SNR is increased through the choice of variances the performance of this update is improved. [sent-113, score-0.172]
61 The variation of the SNR is much more significant than the change in performance, however this is not surprising as the SNR is infinite if the update is exactly along the correct direction, while the improvement from this update will eventually saturate. [sent-114, score-0.22]
62 2 Demonstration in simulation The improved performance of the previous section suggests the possibility of a modification to the WP algorithm in which an estimate of the true gradient is used before each update to select new variances which are more likely to learn effectively. [sent-117, score-0.411]
63 Changing the shape of the distribution does add a bias to the update direction, but the resulting biased update is in fact descending the natural gradient of the cost function. [sent-118, score-0.548]
64 To make use of this opportunity, some knowledge of the likely gradient direction is required. [sent-119, score-0.238]
65 This knowledge can be provided via a momentum estimate (an average of previous updates) or through an inaccurate model that is able to capture some facets of the geometry of the cost function. [sent-120, score-0.13]
66 With this estimated gradient the expression given in Equation (16) can be optimized over the σi numerically using a method such as Sequential Quadratic Programming (SQP). [sent-121, score-0.23]
67 placing some small minimum noise on all parameters regardless of the optimization), but ultimately this reshaping of the Gaussian can provide real performance benefits. [sent-124, score-0.183]
68 The task is to apply a horizontal force f to the cart such that the pole swings to the vertical position. [sent-126, score-0.069]
69 (b) The average of 200 curves showing reduction in cost versus trial number for both a symmetric Gaussian distribution and a distribution reshaped using the SNR. [sent-127, score-0.275]
70 The blue shaded region marks the area within one standard deviation for a symmetric Gaussian distribution, the red region marks one standard deviation for the reshaped distribution and the purple is within one standard deviation of both. [sent-128, score-0.225]
71 The reshaping began on the eighth trial to give time for the momentum-based gradient estimate to stabilize. [sent-129, score-0.33]
72 To demonstrate the improvement in convergence time this reshaping can achieve, weight perturbation was used to develop a barycentric feedback policy for the cart-pole swingup task, where the cost was defined as a weighted sum of the actuation used and the squared distance from the upright position. [sent-130, score-0.498]
73 A gradient estimate was obtained through averaging previous updates, and SQP was used to optimize the SNR prior to each trial. [sent-131, score-0.187]
74 Figure 2 demonstrates the superior performance of the reshaped distribution over a symmetric Guassian using the same total variance (i. [sent-132, score-0.147]
75 3 WP with Gaussian distributions follow the natural gradient The natural gradient for a policy that samples with a mean-zero Gaussian of covariance Σ may be written (see (Peters et al. [sent-137, score-0.633]
76 It is important to note that the natural gradient is a function of the shape of the sampling distribution, and it is because of this that all sampling distributions of this form can follow the natural gradient. [sent-142, score-0.366]
77 3 suggests that for a function with noisy measurements there is an optimal sampling distance which depends upon the local noise and gradient as well as the strength of higher-order terms in that region. [sent-145, score-0.332]
78 For a two-dimensional cost function of the form given in Equation (12), Figure 3 shows the SNR’s dependence upon the radius of the shell distribution (i. [sent-146, score-0.27]
79 For various levels of additive mean-zero noise the SNR was computed for a distribution uniform in angle and fixed in its distance from the mean (this distance is the “sampling magnitude”). [sent-149, score-0.095]
80 The fact that there is a unique maximum for each case suggests the possibility of sampling only at that maximal magnitude, rather than over all magnitudes as is done with a Gaussian, and thus improving SNR and performance. [sent-150, score-0.13]
81 While determining the exact magnitude of maximum SNR may be impractical, choosing a distribution with uniformly distributed direction and a constant magnitude close to this optimal value, performance can be improved. [sent-151, score-0.248]
82 Mean-zero measurement noise is included with variances from 0 to . [sent-158, score-0.104]
83 As the noise is increased, the sampling magnitude producing the maximum SNR is larger and the SNR achieved is lower. [sent-160, score-0.178]
84 Note that the highest SNR achieved is for the smallest sampling magnitude with no noise where it approaches the theoretical value (for 2D) of 3. [sent-161, score-0.178]
85 Also note that for small sampling magnitudes and large noises the SNR approaches the random walk value. [sent-162, score-0.177]
86 Experimental Demonstration To provide compelling evidence of improved performance, the shell distribution was implemented on a laboratory experimental system with actuator limitations and innate stochasticity. [sent-163, score-0.168]
87 We have recently been exploring the use of PG algorithms in an incredibly difficult and exciting control domain -fluid dynamics - and as such applied the shell distribution to a fluid dynamical system. [sent-164, score-0.193]
88 Specifically, we applied learning to a system used to sudy the dynamics of flapping flight via a wing submerged in water (see Figure 4 for a description of the system (Vandenberghe et al. [sent-165, score-0.119]
89 , 2005) - in contrast optimizationg on the experimental physical flapping wing can be done in real-time, at the cost of dealing with noise in the evaluation of the cost function; success here would be enabling for experimental fluid dynamics. [sent-169, score-0.311]
90 The plate rotates freely about its vertical axis, while the vertical motion is prescribed by the learnt policy. [sent-172, score-0.136]
91 (b) 5 averaged runs on the flapping plate using Gaussian or Shell distributions for sampling. [sent-174, score-0.071]
92 Beginning with a smoothed square wave, WP was run for 20 updates using shell distributions and Gaussians. [sent-178, score-0.191]
93 The sampling magnitude of the shell distribution was set to be the expected value of the length of a sample from the Gaussian distribution, while all other parameters were set equal. [sent-180, score-0.276]
94 With optimized sampling, we acquired locally optimal policies in as little as 15 minutes, with repeated optimizations from very different initial policies converging to the same waveform. [sent-181, score-0.087]
95 This expression gives us a quantitative means of evaluating the expected performance of a PG algorithm, although the SNR does not completely capture an algorithm’s capacity to learn. [sent-184, score-0.066]
96 SNR analysis revealed two distinct mechanisms for improving the WP update - perturbing different parameters with different distributions, and using non-Gaussian distributions. [sent-185, score-0.095]
97 Variance reduction techniques for gradient estimates in reinforcement learning. [sent-204, score-0.264]
98 Policy gradient methods for robot control (Technical Report CS-03-787). [sent-231, score-0.187]
99 Evaluation of policy gradient methods and variants on the cart-pole benchmark. [sent-243, score-0.348]
100 Stochastic policy gradient reinforcement learning on a simple 3D biped. [sent-258, score-0.402]
wordName wordTfidf (topN-words)
[('snr', 0.649), ('jw', 0.39), ('jwi', 0.288), ('gradient', 0.187), ('wp', 0.182), ('policy', 0.161), ('pg', 0.141), ('jwj', 0.126), ('shell', 0.126), ('reshaping', 0.11), ('cost', 0.103), ('update', 0.095), ('apping', 0.09), ('uid', 0.09), ('zi', 0.083), ('magnitude', 0.076), ('reshaped', 0.072), ('perturbation', 0.07), ('wt', 0.057), ('variances', 0.054), ('magnitudes', 0.054), ('tedrake', 0.054), ('reinforcement', 0.054), ('sampling', 0.052), ('direction', 0.051), ('noise', 0.05), ('peters', 0.048), ('walk', 0.044), ('expression', 0.043), ('vandenberghe', 0.038), ('vertical', 0.038), ('plate', 0.036), ('anisotropic', 0.036), ('flower', 0.036), ('greensmith', 0.036), ('jabri', 0.036), ('jwk', 0.036), ('jwp', 0.036), ('kohl', 0.036), ('sqp', 0.036), ('wing', 0.036), ('baxter', 0.035), ('schaal', 0.035), ('distributions', 0.035), ('policies', 0.034), ('zj', 0.034), ('trial', 0.033), ('perpendicular', 0.032), ('williams', 0.032), ('zk', 0.031), ('violent', 0.031), ('shelley', 0.031), ('riedmiller', 0.031), ('meuleau', 0.031), ('swings', 0.031), ('gaussian', 0.031), ('variance', 0.03), ('improvement', 0.03), ('updates', 0.03), ('deviation', 0.029), ('affect', 0.028), ('true', 0.028), ('equation', 0.028), ('stone', 0.027), ('humanoid', 0.027), ('facets', 0.027), ('noises', 0.027), ('bias', 0.026), ('exciting', 0.025), ('amari', 0.025), ('drawn', 0.025), ('zhang', 0.025), ('vijayakumar', 0.024), ('zp', 0.024), ('demonstration', 0.024), ('suggests', 0.024), ('weight', 0.024), ('motion', 0.024), ('bartlett', 0.024), ('identity', 0.023), ('reduction', 0.023), ('additive', 0.023), ('performance', 0.023), ('et', 0.023), ('marks', 0.022), ('distribution', 0.022), ('taylor', 0.022), ('cations', 0.021), ('modi', 0.021), ('impractical', 0.02), ('facilitates', 0.02), ('parallel', 0.02), ('relationship', 0.02), ('dynamics', 0.02), ('natural', 0.02), ('system', 0.02), ('evaluation', 0.019), ('converging', 0.019), ('upon', 0.019), ('expansion', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999899 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
Author: John W. Roberts, Russ Tedrake
Abstract: Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. 1
2 0.46463087 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher
Abstract: This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. This provides insight on the precise value and limitations of convex programming-based algorithms.
3 0.13004819 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
4 0.11156663 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
5 0.11025301 195 nips-2008-Regularized Policy Iteration
Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári
Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
6 0.087268203 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
7 0.085136242 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
8 0.075520135 214 nips-2008-Sparse Online Learning via Truncated Gradient
9 0.060340479 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
10 0.056836028 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
11 0.055780094 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
12 0.055576835 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
13 0.053239174 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
14 0.053232074 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
15 0.051865689 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
16 0.051230866 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
17 0.048180077 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
18 0.046876058 32 nips-2008-Bayesian Kernel Shaping for Learning Control
19 0.044504371 218 nips-2008-Spectral Clustering with Perturbed Data
20 0.043335553 62 nips-2008-Differentiable Sparse Coding
topicId topicWeight
[(0, -0.147), (1, 0.126), (2, -0.021), (3, 0.006), (4, 0.111), (5, 0.104), (6, -0.05), (7, -0.035), (8, -0.121), (9, 0.092), (10, -0.111), (11, -0.058), (12, 0.044), (13, 0.067), (14, -0.048), (15, -0.04), (16, -0.087), (17, 0.057), (18, 0.042), (19, -0.064), (20, 0.259), (21, 0.122), (22, 0.259), (23, 0.002), (24, -0.085), (25, 0.05), (26, -0.343), (27, -0.172), (28, 0.11), (29, -0.135), (30, -0.364), (31, 0.148), (32, -0.004), (33, 0.204), (34, -0.028), (35, -0.082), (36, -0.049), (37, -0.013), (38, -0.032), (39, 0.024), (40, 0.102), (41, -0.001), (42, 0.049), (43, -0.022), (44, 0.033), (45, 0.092), (46, 0.031), (47, -0.052), (48, -0.041), (49, -0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.96171713 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
Author: John W. Roberts, Russ Tedrake
Abstract: Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. 1
2 0.85130531 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher
Abstract: This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. This provides insight on the precise value and limitations of convex programming-based algorithms.
3 0.30737308 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
4 0.25779063 214 nips-2008-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
5 0.25095451 195 nips-2008-Regularized Policy Iteration
Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári
Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
6 0.24439642 131 nips-2008-MDPs with Non-Deterministic Policies
7 0.22345115 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
8 0.2168024 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
9 0.2156359 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
10 0.21500033 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
11 0.21000053 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
12 0.1921431 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
13 0.19134764 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
14 0.18211682 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
15 0.1820126 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
16 0.17580195 32 nips-2008-Bayesian Kernel Shaping for Learning Control
17 0.16652629 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
18 0.15974851 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
19 0.15220293 172 nips-2008-Optimal Response Initiation: Why Recent Experience Matters
20 0.15081294 134 nips-2008-Mixed Membership Stochastic Blockmodels
topicId topicWeight
[(4, 0.039), (6, 0.097), (7, 0.068), (12, 0.028), (15, 0.013), (28, 0.145), (31, 0.263), (38, 0.011), (57, 0.061), (59, 0.017), (63, 0.05), (71, 0.019), (77, 0.037), (78, 0.013), (83, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.77452141 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
Author: John W. Roberts, Russ Tedrake
Abstract: Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. 1
2 0.72660166 106 nips-2008-Inferring rankings under constrained sensing
Author: Srikanth Jagabathula, Devavrat Shah
Abstract: Motivated by applications like elections, web-page ranking, revenue maximization etc., we consider the question of inferring popular rankings using constrained data. More specifically, we consider the problem of inferring a probability distribution over the group of permutations using its first order marginals. We first prove that it is not possible to recover more than O(n) permutations over n elements with the given information. We then provide a simple and novel algorithm that can recover up to O(n) permutations under a natural stochastic model; in this sense, the algorithm is optimal. In certain applications, the interest is in recovering only the most popular (or mode) ranking. As a second result, we provide an algorithm based on the Fourier Transform over the symmetric group to recover the mode under a natural majority condition; the algorithm turns out to be a maximum weight matching on an appropriately defined weighted bipartite graph. The questions considered are also thematically related to Fourier Transforms over the symmetric group and the currently popular topic of compressed sensing. 1
3 0.60924751 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
4 0.60164011 75 nips-2008-Estimating vector fields using sparse basis field expansions
Author: Stefan Haufe, Vadim V. Nikulin, Andreas Ziehe, Klaus-Robert Müller, Guido Nolte
Abstract: We introduce a novel framework for estimating vector fields using sparse basis field expansions (S-FLEX). The notion of basis fields, which are an extension of scalar basis functions, arises naturally in our framework from a rotational invariance requirement. We consider a regression setting as well as inverse problems. All variants discussed lead to second-order cone programming formulations. While our framework is generally applicable to any type of vector field, we focus in this paper on applying it to solving the EEG/MEG inverse problem. It is shown that significantly more precise and neurophysiologically more plausible location and shape estimates of cerebral current sources from EEG/MEG measurements become possible with our method when comparing to the state-of-the-art. 1
5 0.59884185 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
6 0.59380478 29 nips-2008-Automatic online tuning for fast Gaussian summation
7 0.59266585 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
8 0.59242171 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
9 0.59215963 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
10 0.592107 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
11 0.59083706 85 nips-2008-Fast Rates for Regularized Objectives
12 0.59022826 226 nips-2008-Supervised Dictionary Learning
13 0.58971274 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
14 0.58881688 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
15 0.58784044 196 nips-2008-Relative Margin Machines
16 0.58743107 143 nips-2008-Multi-label Multiple Kernel Learning
17 0.58651185 195 nips-2008-Regularized Policy Iteration
18 0.58649552 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
19 0.5864802 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
20 0.58566672 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations