nips nips2002 nips2002-3 nips2002-3-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
[1] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995.
[2] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1):81–138, 1995.
[3] D. P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 2001.
[4] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[5] D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Opt. Theory and Applications, 105(3), 2000.
[6] G. Gordon. Chattering in Sarsa( ). www.cs.cmu.edu/ ggordon, 1996. CMU Learning Lab Internal Report. Available at ¡
[7] G. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, 1999.
[8] G. J. Gordon. Reinforcement learning with function approximation converges to a region. Advances in Neural Information Processing Systems 13, pages 1040–1046. MIT Press, 2001.
[9] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000.
[10] T. J. Perkins and M. D. Pendrith. On the existence of fixed points for Q-learning and Sarsa in partially observable domains. In Proceedings of the Nineteenth International Conference on Machine Learning, 2002.
[11] M. L. Puterman. Markov Decision Processes: Disrete Stochastic Dynamic Programming. John Wiley & Sons, Inc, New York, 1994.
[12] E. Seneta. Sensitivity analysis, ergodicity coefficients, and rank-one updates for finite markov chains. In W. J. Stewart, editor, Numerical Solutions of Markov Chains. Dekker, NY, 1991.
[13] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308, 2000.
[14] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press/Bradford Books, Cambridge, Massachusetts, 1998.
[15] G. J. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215–219, 1994.
[16] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44(10):1840–1851, 1999.
[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. £ a © 2© © ¨¥ § H £ £) $ w £ $ £) £ ¤ , let £ ! £ & 2 ) $ AA wy ! ) $ £ 1. for all , 2. iff is non-singular, 3. for any , 4. is continuous. £ Lemma 7 For -by- matrix £ Appendix . Then: , £ ¤ £ £ £ ¤ £ ¤ ¤ Proof: The first three points readily follow from elementary arguments. We focus on the last point. We want to show that given a sequence of matrices , that converge to some , then . Note that means that . Let . Then ¡ $ ! w £ H)s ) £ $ ¤ £ ) ¡ £ $ ¤ X ¦¤AH¥ ¥ I)s £ ¤ £ £ ¡ £ £ s ) £ ' ¡ £ £ © ' £ s s £ £ X §¦Q¡!