nips nips2010 nips2010-37 nips2010-37-reference knowledge-graph by maker-knowledge-mining

37 nips-2010-Basis Construction from Power Series Expansions of Value Functions

Source: pdf

Author: Sridhar Mahadevan, Bo Liu

Abstract: This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efﬁcient basis representations. The ﬁrst two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same beneﬁts at lower computational cost. 1

reference text

[1] D. Bertsekas and D. Casta˜ on. Adaptive aggregation methods for inﬁnite horizon dynamic n programming. IEEE Transactions on Automatic Control, 34:589–598, 1989.

[2] S. Campbell and C. Meyer. Generalized Inverses of Linear Transformations. Pitman, 1979.

[3] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.

[4] B. Liu and S. Mahadevan. An investigation of basis construction from power series expansions of value functions. Technical report, University Massachusetts, Amherst, 2010.

[5] S Mahadevan. Sensitive-discount optimality: Unifying discounted and average reward reinforcement learning. In Proceedings of the International Conference on Machine Learning, 1996.

[6] S. Mahadevan. Learning representation and control in Markov Decision Processes: New frontiers. Foundations and Trends in Machine Learning, 1(4):403–565, 2009.

[7] S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov Decision Processes. Journal of Machine Learning Research, 8:2169–2231, 2007.

[8] R. Parr, , Li. L., G. Taylor, C. Painter-Wakeﬁeld, and M. Littman. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2008.

[9] R. Parr, C. Painter-Wakeﬁeld, L. Li, and M. Littman. Analyzing feature generation for value function approximation. In Proceedings of the International Conference on Machine Learning (ICML), pages 737–744, 2007.

[10] M. Petrik. An analysis of Laplacian methods for value function approximation in MDPs. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 2574–2579, 2007.

[11] M. L. Puterman. Markov Decision Processes. Wiley Interscience, New York, USA, 1994.

[12] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM Press, 2003.

[13] A. Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proc. 10th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 1993.

[14] William J. Stewart. Numerical methods for computing stationary distributions of ﬁnite irreducible markov chains. In Advances in Computational Probability. Kluwer Academic Publishers, 1997.

[15] Y. Wei. Successive matrix squaring algorithm for computing the Drazin inverse. Applied Mathematics and Computation, 108:67–75, 2000. 9