nips nips2012 nips2012-364 nips2012-364-reference knowledge-graph by maker-knowledge-mining

364 nips-2012-Weighted Likelihood Policy Search with Model Selection

Source: pdf

Author: Tsuyoshi Ueno, Kohei Hayashi, Takashi Washio, Yoshinobu Kawahara

Abstract: Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efﬁcient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efﬁciently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.

reference text

[1] P. Dayan and G. Hinton, “Using expectation-maximization for reinforcement learning,” Neural Computation, vol. 9, no. 2, pp. 271–278, 1997.

[2] J. Baxter and P. L. Bartlett, “Inﬁnite-horizon policy-gradient estimation,” Journal of Artiﬁcial Intelligence Research, vol. 15, no. 4, pp. 319–350, 2001.

[3] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2003.

[4] J. Peters and S. Schaal, “Reinforcement learning by reward-weighted regression for operational space control,” in Proceedings of the 24th International Conference on Machine Learning, 2007.

[5] ——, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.

[6] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis, “Learning model-free robot control by a monte carlo em algorithm,” Autonomous Robots, vol. 27, no. 2, pp. 123–130, 2009.

[7] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral control approach to reinforcement learning,” Journal of Machine Learning Research, vol. 11, pp. 3137–3181, 2010.

[8] J. Peters, K. M¨ lling, and Y. Alt¨ n, “Relative entropy policy search,” in Proceedings of the 24-th National u u Conference on Artiﬁcial Intelligence, 2010.

[9] J. Kober and J. Peters, “Policy search for motor primitives in robotics,” Machine Learning, vol. 84, no. 1-2, pp. 171–203, 2011.

[10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.

[11] H. Akaike, “A new look at the statistical model identiﬁcation,” IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716–723, 1974.

[12] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.

[13] A. Farahmand and C. Szepesv´ ri, “Model selection in reinforcement learning,” Machine Learning, pp. a 1–34, 2011.

[14] M. M. Fard and J. Pineau, “PAC-Bayesian model selection for reinforcement learning,” in Advances in Neural Information Processing Systems 22, 2010.

[15] H. Hachiya, J. Peters, and M. Sugiyama, “Reward-weighted regression with sample reuse for direct policy search in reinforcement learning,” Neural Computation, vol. 23, no. 11, pp. 2798–2832, 2011.

[16] M. G. Azar and H. J. Kappen, “Dynamic policy programming,” Tech. Rep. arXiv:1004.202, 2010.

[17] H. Kappen, V. G´ mez, and M. Opper, “Optimal control as a graphical model inference problem,” Machine o learning, pp. 1–24, 2012.

[18] K. Rawlik, M. Toussaint, and S. Vijayakumar, “On stochastic optimal control and reinforcement learning by approximate inference,” in International Conference on Robotics Science and Systems, 2012.

[19] R. C. Bradley, “Basic properties of strong mixing conditions. A survey and some open questions,” Probability Surveys, vol. 2, pp. 107–144, 2005.

[20] R. Munos and C. Szepesv´ ri, “Finite-time bounds for ﬁtted value iteration,” Journal of Machine Learning a Research, vol. 9, pp. 815–857, 2008.

[21] A. Lazaric, M. Ghavamzadeh, and R. Munos, “Finite-sample analysis of least-squares policy iteration,” Journal of Machine Learning Research, vol. 13, p. 30413074, 2012.

[22] V. P. Godambe, Ed., Estimating Functions. Oxford University Press, 1991.

[23] S. Amari and M. Kawanabe, “Information geometry of estimating functions in semi-parametric statistical models,” Bernoulli, vol. 3, no. 1, pp. 29–54, 1997.

[24] P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner, Efﬁcient and Adaptive Estimation for Semiparametric Models. Springer, 1998.

[25] G. Bouchard and B. Triggs, “The tradeoff between generative and discriminative classiﬁers,” in Proceedings 1998 16th IASC International Symposium on Computational Statistics, 2004, pp. 721–728.

[26] E. Greensmith, P. L. Bartlett, and J. Baxter, “Variance reduction techniques for gradient estimates in reinforcement learning,” Journal of Machine Learning Research, vol. 5, pp. 1471–1530, 2004.

[27] R. Munos, “Geometric variance reduction in markov chains: application to value function and gradient estimation,” Journal of Machine Learning Research, vol. 7, pp. 413–427, 2006.

[28] T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama, “Analysis and improvement of policy gradient estimation,” Neural Networks, 2011.

[29] T. Ueno, S. Maeda, M. Kawanabe, and S. Ishii, “Generalized TD learning,” Journal of Machine Learning Research, vol. 12, pp. 1977–2020, 2011. 9