nips nips2011 nips2011-65 nips2011-65-reference knowledge-graph by maker-knowledge-mining

65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

Source: pdf

Author: Daniel J. Lizotte

Abstract: Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, “Expansion-Constrained Ordinary Least Squares” (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the ∞-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of “averagers,” we prove a minimax property of the ECOLS residual error, and we give an efﬁcient algorithm for computing the coefﬁcients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties. 1

reference text

[1] A. Antos, R. Munos, and Cs. Szepesv´ ri. Fitted Q-iteration in continuous action-space MDPs. a In Advances in Neural Information Processing Systems 20, pages 9–16. MIT Press, 2008.

[2] L. Baird. Residual Algorithms: Reinforcement Learning with Function Approximation. In A. Prieditis and S. Russell, editors, Proceedings of the 25th International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995.

[3] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, 2007.

[4] J. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems, pages 369–376, 1995.

[5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research, 6:503–556, 2005.

[6] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized ﬁtted Qa iteration for planning in continuous-space Markovian decision problems. In American Control Conference, pages 725–730, 2009.

[7] R. Fonteneau. Contributions to Batch Mode Reinforcement Learning. PhD thesis, University of Liege, 2011.

[8] G. J. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, 1999.

[9] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, Apr. 2011.

[10] M. C. Grant. Disciplined convex programming and the cvx modeling framework. Information Systems Journal, 2006.

[11] A. Guez, R. D. Vincent, M. Avoli, and J. Pineau. Adaptive treatment of epilepsy via batchmode reinforcement learning. In D. Fox and C. P. Gomes, editors, Innovative Applications of Artiﬁcial Intelligence, pages 1671–1678, 2008.

[12] IBM. IBM ILOG CPLEX Optimization Studio V12.2, 2011.

[13] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems AAMAS 07, 2007.

[14] R. Munos and Cs. Szepesv´ ri. Finite time bounds for ﬁtted value iteration. Journal of Machine a Learning Research, 9:815–857, 2008.

[15] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine learning, 49(2):161– 178, 2002.

[16] M. Riedmiller. Neural ﬁtted Q iteration-ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In ECML 2005, pages 317–328. Springer, 2005.

[17] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3):pp. 487–516, 1997.

[18] G. A. F. Seber. A MATRIX HANDBOOK FOR STATISTICIANS. Wiley, 2007.

[19] S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential clinical decision-making through reinforcement learning : an empirical study. Machine Learning, 2010.

[20] S. Siddiqi, B. Boots, and G. Gordon. A Constraint Generation Approach to Learning Stable Linear Dynamical Systems. In Advances in Neural Information Processing Systems 20, pages 1329–1336. MIT Press, 2008.

[21] Cs. Szepesv´ ri. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010. a

[22] J. N. Tsitsiklis and B. van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. 9