nips nips2011 nips2011-283 nips2011-283-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: J. Z. Kolter
Abstract: Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. 1
[1] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the International Conference on Machine Learning, 1995.
[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[3] P. Dayan. The convergence of TD(λ) for general λ. Machine Learning, 8(3–4), 1992.
[4] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1994.
[5] M. Journee, F. Bach, P.A. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010.
[6] J.Z. Kolter and A.Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the International Conference on Machine Learning, 2009.
[7] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
[8] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In Proceedings of the International Conference on Machine Learning, 2010.
[9] H.R. Maei and R.S. Sutton. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, 2010.
[10] H.R. Maei, Cs. Szepesvari, S. Bhatnagar, and R.S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the International Conference on Machine Learning, 2010.
[11] R. Munos. Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Learning, 2003.
[12] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 1999.
[13] B. Scherrer. Personal communication, 2011.
[14] M. Schmidt. minfunc, 2005. Available at http://www.cs.ubc.ca/˜schmidtm/ Software/minFunc.html.
[15] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvari, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning, 2009.
[16] R.S. Sutton, Cs. Szepesvari, and H.R. Maei. A convergent O(n) algorithm for off-policy temporal-different learning with linear function approximation. In Advances in Neural Information Processing, 2008.
[17] J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions and Auotomatic Control, 42:674–690, 1997.
[18] J.N. Tsitsiklis and B. Van Roy. Average cost temporal difference learning. Automatica, 35(11):1799–1808, 1999.
[19] H. Yu and D. P. Bertsekas. Error bounds for approximations from projected linear equations. Mathematics of Operations Research, 35:306–329, 2010. 9