nips nips2008 nips2008-195 nips2008-195-reference knowledge-graph by maker-knowledge-mining

195 nips-2008-Regularized Policy Iteration

Source: pdf

Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári

Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a ﬂexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efﬁcient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide ﬁnite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1

reference text

[1] A. Antos, R. Munos, and Cs. Szepesv´ ri. Fitted Q-iteration in continuous action-space MDPs. In Ada vances in Neural Information Processing Systems 20 (NIPS-2007), pages 9–16, 2008.

[2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based ﬁtted policy iteration and a single sample path. Machine Learning, 71:89–129, 2008.

[3] L.C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30–37, 1995.

[4] S.J. Bradtke and A.G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57, 1996.

[5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. JMLR, 6:503–556, 2005.

[6] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. L2-regularized policy iteration. a 2009. (under preparation).

[7] L. Gy¨ rﬁ, M. Kohler, A. Krzy˙ ak, and H. Walk. A distribution-free theory of nonparametric regression. o z Springer-Verlag, New York, 2002.

[8] R.A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, MA, 1960.

[9] T. Jung and D. Polani. Least squares SVM for least squares TD learning. In ECAI, pages 499–503, 2006.

[10] M. Lagoudakis and R. Parr. Least-squares policy iteration. JMLR, 4:1107–1149, 2003.

[11] M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using LASSO. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.

[12] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49:161–178, 2002.

[13] M. Riedmiller. Neural ﬁtted Q iteration – ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005.

[14] P. J. Schweitzer and A. Seidmann. Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110:568–582, 1985.

[15] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998.

[16] R. J. Williams and L.C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 1994.

[17] X. Xu, D. Hu, and X. Lu. Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. on Neural Networks, 18:973–992, 2007.