nips nips2012 nips2012-358 nips2012-358-reference knowledge-graph by maker-knowledge-mining

358 nips-2012-Value Pursuit Iteration

Source: pdf

Author: Amir M. Farahmand, Doina Precup

Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that ﬁnds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that ﬁnds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a ﬁnite-sample error upper bound for it. 1

reference text

[1] Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research, 8: 2169–2231, 2007. 1

[2] Ronald Parr, Christopher Painter-Wakeﬁeld, Lihong Li, and Michael Littman. Analyzing feature generation for value-function approximation. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 737 – 744, New York, NY, USA, 2007. ACM. 1

[3] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv´ ri, and Shie Mannor. Regularized a policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS - 21), pages 441–448. MIT Press, 2009. 1

[4] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv´ ri, and Shie Mannor. Regularized a ﬁtted Q-iteration for planning in continuous-space Markovian Decision Problems. In Proceedings of American Control Conference (ACC), pages 725–730, June 2009. 1, 4

[5] Gavin Taylor and Ronald Parr. Kernelized value function approximation for reinforcement learning. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 1017– 1024, New York, NY, USA, 2009. ACM. 1

[6] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 521–528. ACM, 2009. 1

[7] Jeff Johns, Christopher Painter-Wakeﬁeld, and Ronald Parr. Linear complementarity for regularized policy evaluation and improvement. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS - 23), pages 1009–1017. 2010. 1

[8] Mohammad Ghavamzadeh, Alessandro Lazaric, R´ mi Munos, and Matthew Hoffman. Finite-sample e analysis of lasso-TD. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 1177–1184, New York, NY, USA, June 2011. ACM. ISBN 978-1-4503-0619-5. 1

[9] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers, pages 40–44, 1993. 1

[10] Geoffrey M. Davis, St´ phane Mallat, and Marco Avellaneda. Adaptive greedy approximations. Journal e of Constructive Approximation, 13:57–98, 1997. 1

[11] Jeff Johns. Basis Construction and Utilization for Markov Decision Processes using Graphs. PhD thesis, University of Massachusetts Amherst, 2010. 1

[12] Christopher Painter-Wakeﬁeld and Ronald Parr. Greedy algorithms for sparse reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning (ICML) (Accepted), 2012. 1

[13] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. 2, 4

[14] Csaba Szepesv´ ri. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 2 a

[15] Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, and Ronald A. Devore. Approximation and learning by greedy algorithms. The Annals of Statistics, 36(1):64–94, 2008. 3, 4

[16] Amir-massoud Farahmand. Regularization in Reinforcement Learning. PhD thesis, University of Alberta, 2011. 6

[17] R´ mi Munos and Csaba Szepesv´ ri. Finite-time bounds for ﬁtted value iteration. Journal of Machine e a Learning Research, 9:815–857, 2008. 7

[18] Amir-massoud Farahmand, R´ mi Munos, and Csaba Szepesv´ ri. Error propagation for approximate pole a icy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS - 23), pages 568–576. 2010. 7

[19] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, January 1994. 7 9