nips nips2013 nips2013-28 nips2013-28-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
[1] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 2219–2225. IEEE, 2006.
[2] James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. Automatic Control, IEEE Transactions on, 37(3):332–341, 1992.
[3] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, May 1992.
[4] Jonathan Baxter and Peter L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
[5] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12(22), 2000.
[6] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008.
[7] Sham Kakade. A natural policy gradient. Advances in neural information processing systems, 14:1531–1538, 2001.
[8] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7):1180–1190, 2008.
[9] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
[10] P. Wagner. A reinterpretation of the policy oscillation phenomenon in approximate policy iteration. Advances in Neural Information Processing Systems, 24, 2011.
[11] Jorge J Mor´ and David J Thuente. Line search algorithms with guaranteed sufficient decrease. e ACM Transactions on Mathematical Software (TOMS), 20(3):286–307, 1994.
[12] J. Kober and J. Peters. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems 22 (NIPS 2008), Cambridge, MA: MIT Press, 2009.
[13] Nikos Vlassis, Marc Toussaint, Georgios Kontes, and Savas Piperidis. Learning model-free robot control by a monte carlo em algorithm. Autonomous Robots, 27(2):123–130, 2009.
[14] S.M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, PhD thesis, University College London, 2003.
[15] Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 307–315. JMLR Workshop and Conference Proceedings, May 2013.
[16] S. Pinsker. Information and Information Stability of Random Variable and Processes. HoldenDay Series in Time Series Analysis. Holden-Day, Inc., 1964.
[17] Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of policy gradient estimation. Neural Networks, 26(0):118 – 129, 2012. 9