nips nips2012 nips2012-364 nips2012-364-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tsuyoshi Ueno, Kohei Hayashi, Takashi Washio, Yoshinobu Kawahara
Abstract: Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efficiently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.
[1] P. Dayan and G. Hinton, “Using expectation-maximization for reinforcement learning,” Neural Computation, vol. 9, no. 2, pp. 271–278, 1997.
[2] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,” Journal of Artificial Intelligence Research, vol. 15, no. 4, pp. 319–350, 2001.
[3] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2003.
[4] J. Peters and S. Schaal, “Reinforcement learning by reward-weighted regression for operational space control,” in Proceedings of the 24th International Conference on Machine Learning, 2007.
[5] ——, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.
[6] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis, “Learning model-free robot control by a monte carlo em algorithm,” Autonomous Robots, vol. 27, no. 2, pp. 123–130, 2009.
[7] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral control approach to reinforcement learning,” Journal of Machine Learning Research, vol. 11, pp. 3137–3181, 2010.
[8] J. Peters, K. M¨ lling, and Y. Alt¨ n, “Relative entropy policy search,” in Proceedings of the 24-th National u u Conference on Artificial Intelligence, 2010.
[9] J. Kober and J. Peters, “Policy search for motor primitives in robotics,” Machine Learning, vol. 84, no. 1-2, pp. 171–203, 2011.
[10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.
[11] H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716–723, 1974.
[12] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
[13] A. Farahmand and C. Szepesv´ ri, “Model selection in reinforcement learning,” Machine Learning, pp. a 1–34, 2011.
[14] M. M. Fard and J. Pineau, “PAC-Bayesian model selection for reinforcement learning,” in Advances in Neural Information Processing Systems 22, 2010.
[15] H. Hachiya, J. Peters, and M. Sugiyama, “Reward-weighted regression with sample reuse for direct policy search in reinforcement learning,” Neural Computation, vol. 23, no. 11, pp. 2798–2832, 2011.
[16] M. G. Azar and H. J. Kappen, “Dynamic policy programming,” Tech. Rep. arXiv:1004.202, 2010.
[17] H. Kappen, V. G´ mez, and M. Opper, “Optimal control as a graphical model inference problem,” Machine o learning, pp. 1–24, 2012.
[18] K. Rawlik, M. Toussaint, and S. Vijayakumar, “On stochastic optimal control and reinforcement learning by approximate inference,” in International Conference on Robotics Science and Systems, 2012.
[19] R. C. Bradley, “Basic properties of strong mixing conditions. A survey and some open questions,” Probability Surveys, vol. 2, pp. 107–144, 2005.
[20] R. Munos and C. Szepesv´ ri, “Finite-time bounds for fitted value iteration,” Journal of Machine Learning a Research, vol. 9, pp. 815–857, 2008.
[21] A. Lazaric, M. Ghavamzadeh, and R. Munos, “Finite-sample analysis of least-squares policy iteration,” Journal of Machine Learning Research, vol. 13, p. 30413074, 2012.
[22] V. P. Godambe, Ed., Estimating Functions. Oxford University Press, 1991.
[23] S. Amari and M. Kawanabe, “Information geometry of estimating functions in semi-parametric statistical models,” Bernoulli, vol. 3, no. 1, pp. 29–54, 1997.
[24] P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner, Efficient and Adaptive Estimation for Semiparametric Models. Springer, 1998.
[25] G. Bouchard and B. Triggs, “The tradeoff between generative and discriminative classifiers,” in Proceedings 1998 16th IASC International Symposium on Computational Statistics, 2004, pp. 721–728.
[26] E. Greensmith, P. L. Bartlett, and J. Baxter, “Variance reduction techniques for gradient estimates in reinforcement learning,” Journal of Machine Learning Research, vol. 5, pp. 1471–1530, 2004.
[27] R. Munos, “Geometric variance reduction in markov chains: application to value function and gradient estimation,” Journal of Machine Learning Research, vol. 7, pp. 413–427, 2006.
[28] T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama, “Analysis and improvement of policy gradient estimation,” Neural Networks, 2011.
[29] T. Ueno, S. Maeda, M. Kawanabe, and S. Ishii, “Generalized TD learning,” Journal of Machine Learning Research, vol. 12, pp. 1977–2020, 2011. 9