nips nips2010 nips2010-212 nips2010-212-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Byron Boots, Geoffrey J. Gordon
Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1
[1] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
[2] Justin A. Boyan. Least-squares temporal difference learning. In Proc. Intl. Conf. Machine Learning, pages 49–56. Morgan Kaufmann, San Francisco, CA, 1999.
[3] Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning. In Machine Learning, pages 22–33, 1996.
[4] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. J. Mach. Learn. Res., 4:1107– 1149, 2003.
[5] Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L. Littman. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In ICML ’08: Proceedings of the 25th international conference on Machine learning, pages 752–759, New York, NY, USA, 2008. ACM.
[6] Matthew Rosencrantz, Geoffrey J. Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proc. ICML, 2004.
[7] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predictive state representations. In Proceedings of Robotics: Science and Systems VI, 2010.
[8] Pascal Poupart and Craig Boutilier. Value-directed compression of pomdps. In NIPS, pages 1547–1554, 2002.
[9] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 521–528, New York, NY, USA, 2009. ACM.
[10] Gregory C. Reinsel and Rajabather Palani Velu. Multivariate Reduced-rank Regression: Theory and Applications. Springer, 1998.
[11] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.
[12] Byron Boots and Geoffrey J. Gordon. Predictive state temporal difference learning. Technical report, arXiv.org.
[13] Harold Hotelling. The most predictable criterion. Journal of Educational Psychology, 26:139–142, 1935.
[14] S. Soatto and A. Chiuso. Dynamic data factorization. Technical report, UCLA, 2001.
[15] Michael Littman, Richard Sutton, and Satinder Singh. Predictive representations of state. In Advances in Neural Information Processing Systems (NIPS), 2002.
[16] Judea Pearl. Causality: models, reasoning, and inference. Cambridge University Press, 2000.
[17] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12:1371–1398, 2000.
[18] Satinder Singh, Michael James, and Matthew Rudary. Predictive state representations: A new theory for modeling dynamical systems. In Proc. UAI, 2004.
[19] P. Van Overschee and B. De Moor. Subspace Identification for Linear Systems: Theory, Implementation, Applications. Kluwer, 1996.
[20] Tohru Katayama. Subspace Methods for System Identification. Springer-Verlag, 2005.
[21] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. In COLT, 2009.
[22] Michael R. James, Ton Wessling, and Nikos A. Vlassis. Improving approximate value iteration using memories and predictive state representations. In AAAI, 2006.
[23] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-2010), 2010.
[24] John N. Tsitsiklis and Benjamin Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44:1840–1851, 1997.
[25] David Choi and Benjamin Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems, 16(2):207–239, 2006. 9