nips nips2008 nips2008-210 nips2008-210-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John W. Roberts, Russ Tedrake
Abstract: Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. 1
Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Baxter, J., & Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350. Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471–1530. Jabri, M., & Flower, B. (1992). Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. IEEE Trans. Neural Netw., 3, 154–157. Kohl, N., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Meuleau, N., Peshkin, L., Kaelbling, L. P., & Kim, K.-E. (2000). Off-policy policy search. NIPS. Peters, J., Vijayakumar, S., & Schaal, S. (2003a). Policy gradient methods for robot control (Technical Report CS-03-787). University of Southern California. Peters, J., Vijayakumar, S., & Schaal, S. (2003b). Reinforcement learning for humanoid robotics. Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots. Riedmiller, M., Peters, J., & Schaal, S. (2007). Evaluation of policy gradient methods and variants on the cart-pole benchmark. Symposium on Approximate Dynamic Programming and Reinforcement Learning (pp. 254–261). Shelley, M., Vandenberghe, N., & Zhang, J. (2005). Heavy flags undergo spontaneous oscillations in flowing water. Physical Review Letters, 94. Tedrake, R., Zhang, T. W., & Seung, H. S. (2004). Stochastic policy gradient reinforcement learning on a simple 3D biped. Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS) (pp. 2849–2854). Sendai, Japan. Vandenberghe, N., Zhang, J., & Childress, S. (2004). Symmetry breaking leads to forward flapping flight. Journal of Fluid Mechanics, 506, 147–155. Williams, J. L., III, J. W. F., & Willsky, A. S. (2006). Importance sampling actor-critic algorithms. Proceedings of the 2006 American Control Conference. Williams, R. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. 8