nips nips2011 nips2011-65 nips2011-65-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel J. Lizotte
Abstract: Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, “Expansion-Constrained Ordinary Least Squares” (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the ∞-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of “averagers,” we prove a minimax property of the ECOLS residual error, and we give an efficient algorithm for computing the coefficients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties. 1
[1] A. Antos, R. Munos, and Cs. Szepesv´ ri. Fitted Q-iteration in continuous action-space MDPs. a In Advances in Neural Information Processing Systems 20, pages 9–16. MIT Press, 2008.
[2] L. Baird. Residual Algorithms: Reinforcement Learning with Function Approximation. In A. Prieditis and S. Russell, editors, Proceedings of the 25th International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995.
[3] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2007.
[4] J. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems, pages 369–376, 1995.
[5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research, 6:503–556, 2005.
[6] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized fitted Qa iteration for planning in continuous-space Markovian decision problems. In American Control Conference, pages 725–730, 2009.
[7] R. Fonteneau. Contributions to Batch Mode Reinforcement Learning. PhD thesis, University of Liege, 2011.
[8] G. J. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, 1999.
[9] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, Apr. 2011.
[10] M. C. Grant. Disciplined convex programming and the cvx modeling framework. Information Systems Journal, 2006.
[11] A. Guez, R. D. Vincent, M. Avoli, and J. Pineau. Adaptive treatment of epilepsy via batchmode reinforcement learning. In D. Fox and C. P. Gomes, editors, Innovative Applications of Artificial Intelligence, pages 1671–1678, 2008.
[12] IBM. IBM ILOG CPLEX Optimization Studio V12.2, 2011.
[13] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems AAMAS 07, 2007.
[14] R. Munos and Cs. Szepesv´ ri. Finite time bounds for fitted value iteration. Journal of Machine a Learning Research, 9:815–857, 2008.
[15] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine learning, 49(2):161– 178, 2002.
[16] M. Riedmiller. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method. In ECML 2005, pages 317–328. Springer, 2005.
[17] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3):pp. 487–516, 1997.
[18] G. A. F. Seber. A MATRIX HANDBOOK FOR STATISTICIANS. Wiley, 2007.
[19] S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential clinical decision-making through reinforcement learning : an empirical study. Machine Learning, 2010.
[20] S. Siddiqi, B. Boots, and G. Gordon. A Constraint Generation Approach to Learning Stable Linear Dynamical Systems. In Advances in Neural Information Processing Systems 20, pages 1329–1336. MIT Press, 2008.
[21] Cs. Szepesv´ ri. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010. a
[22] J. N. Tsitsiklis and B. van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. 9