nips nips2009 nips2009-12 nips2009-12-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya
Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.
[1] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In European Conference on Machine Learning, 2005.
[2] V. Gullapalli. A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3(6):671–692, 1990.
[3] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
[4] J. Baxter and P. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
[5] R. Tedrake, T.W. T. W. Zhang, and H. S. Seung. Stochastic policy gradient reinforcement learning on a simple 3D biped. In IEEE International Conference on Intelligent Robots and Systems, 2004.
[6] J. Peters and S. Schaal. Policy gradient methods for robotics. In IEEE International Conference on Intelligent Robots and Systems, 2006.
[7] S. Richter, D. Aberdeen, and J. Yu. Natural actor-critic for road traffic optimisation. In Advances in Neural Information Processing Systems. MIT Press, 2007.
[8] S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002.
[9] S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
[10] T. Morimura, E. Uchibe, and K. Doya. Utilizing natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and its Applications, pages 256–263, 2005.
[11] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 105–112. MIT Press, 2008.
[12] T. Morimura, E. Uchibe, J. Yoshimoto, and K. Doya. A new natural policy gradient by stationary distribution metric. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2008.
[13] T. Morimura, E. Uchibe, J. Yoshimoto, J. Peters, and K. Doya. Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning. Neural Computation. (in press).
[14] V. Godambe. Estimating function. Oxford Science, 1991.
[15] D. P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 1995.
[16] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998.
[17] S. Amari and H. Nagaoka. Method of Information Geometry. Oxford University Press, 2000.
[18] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
[19] D. Bagnell and J. Schneider. Covariant policy search. In Proceedings of the International Joint Conference on Artificial Intelligence, July 2003.
[20] S. Amari and M. Kawanabe. Information geometry of estimating functions in semi-parametric statistical models. Bernoulli, 3(1), 1997.
[21] A. C. Singh and R. P. Rao. Optimal instrumental variable estimation for linear models with stochastic regressors using estimating functions. In Symposium on Estimating Functions, pages 177–192, 1996.
[22] B. Chandrasekhar and B. K. Kale. Unbiased statistical estimating functions in presence of nuisance parameters. Journal of Statistical Planning and. Inference, 9:45–54, 1984.
[23] V. S. Konda and J. N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4):1143–1166, 2003.
[24] T. Ueno, M. Kawanabe, T. Mori, S. Maeda, and S. Ishii. A semiparametric statistical approach to modelfree policy evaluation. In International Conference on Machine Learning, pages 857–864, 2008.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(23):233–246, 2002. 9