nips nips2011 nips2011-36 nips2011-36-reference knowledge-graph by maker-knowledge-mining

36 nips-2011-Analysis and Improvement of Policy Gradient Estimation


Source: pdf

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1


reference text

[1] N. Abe, P. Melville, C. Pendus, C. K. Reddy, D. L. Jensen, V. P. Thomas, J. J. Bennett, G. F. Anderson, B. R. Cooley, M. Kowalczyk, M. Domick, and T. Gardinier. Optimizing debt collections using constrained reinforcement learning. In Proceedings of The 16th ACM SGKDD Conference on Knowledge Discovery and Data Mining, pages 75–84, 2010.

[2] J. Baxter, P. Bartlett, and L. Weaver. Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 15:351–381, 2001.

[3] M. Bugeja. Non-linear swing-up and stabilizing control of an inverted pendulum system. In Proceedings of IEEE Region 8 EUROCON, volume 2, pages 437–441, 2003.

[4] P. Dayan and G. E. Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.

[5] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471–1530, 2004.

[6] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.

[7] S. Kakade. A natural policy gradient. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1531–1538, Cambridge, MA, 2002. MIT Press.

[8] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.

[9] P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Discrete Event Dynamic Systems, 13(1-2):111–148, 2004.

[10] J. Peters and S. Schaal. Policy gradient methods for robotics. In Processing of the IEEE/RSJ International Conferece on Inatelligent Robots and Systems(IROS), 2006.

[11] F. Sehnke, C. Osendorfer, T. R¨ ckstiess, A. Graves, J. Peters, and J. Schmidhuber. Policy grau dients with parameter-based exploration for control. In Proceedings of The 18th International Conference on Artificial Neural Networks, pages 387–396, 2008.

[12] F. Sehnke, C. Osendorfer, T. R¨ ckstiess, A. Graves, J. Peters, and J. Schmidhuber. Parameteru exploring policy gradients. Neural Networks, 23(4):551–559, 2010.

[13] R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1998.

[14] G. Tesauro. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215–219, 1994.

[15] L. Weaver and J. Baxter. Reinforcement learning from state and temporal differences. Technical report, Department of Computer Science, Australian National University, 1999.

[16] L. Weaver and N. Tao. The optimal reward baseline for gradient-based reinforcement learning. In Processings of The Seventeeth Conference on Uncertainty in Artificial Intelligence, pages 538–545, 2001.

[17] J. D. Williams and S. Young. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2):231–422, 2007.

[18] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229, 1992. 9