nips nips2010 nips2010-134 nips2010-134-reference knowledge-graph by maker-knowledge-mining

134 nips-2010-LSTD with Random Projections


Source: pdf

Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos

Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1


reference text

[1] A. Antos, Cs. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, 71:89–129, 2008.

[2] J. Boyan. Least-squares temporal difference learning. Proceedings of the 16th International Conference on Machine Learning, pages 49–56, 1999.

[3] S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57, 1996.

[4] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized policy a iteration. In Proceedings of Advances in Neural Information Processing Systems 21, pages 441–448. MIT Press, 2008.

[5] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized fitted Qa iteration for planning in continuous-space Markovian decision problems. In Proceedings of the American Control Conference, pages 725–730, 2009.

[6] M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos. LSPI with random projections. Technical Report inria-00530762, INRIA, 2010.

[7] P. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the Twenty-Third International Conference on Machine Learning, pages 449–456, 2006.

[8] Z. Kolter and A. Ng. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the Twenty-Sixth International Conference on Machine Learning, pages 521–528, 2009.

[9] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.

[10] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of least-squares policy iteration. Technical Report inria-00528596, INRIA, 2010.

[11] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In Proceedings of the Twenty-Seventh International Conference on Machine Learning, pages 615–622, 2010.

[12] M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using lasso. In IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 352– 359, 2007.

[13] S. Mahadevan. Representation policy iteration. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pages 372–379, 2005.

[14] O. Maillard and R. Munos. Compressed least-squares regression. In Proceedings of Advances in Neural Information Processing Systems 22, pages 1213–1221, 2009.

[15] O. Maillard and R. Munos. Brownian motions and scrambled wavelets for least-squares regression. Technical Report inria-00483014, INRIA, 2010.

[16] I. Menache, S. Mannor, and N. Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134:215–238, 2005.

[17] R. Parr, C. Painter-Wakefield, L. Li, and M. Littman. Analyzing feature generation for valuefunction approximation. In Proceedings of the Twenty-Fourth International Conference on Machine Learning, pages 737–744, 2007.

[18] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in approximate linear programs for Markov decision processes. In Proceedings of the TwentySeventh International Conference on Machine Learning, pages 871–878, 2010.

[19] M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians, 2010.

[20] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIP Press, 1998.

[21] S. Vempala. The Random Projection Method. American Mathematical Society, 2004. 9