nips nips2011 nips2011-18 nips2011-18-reference knowledge-graph by maker-knowledge-mining

18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Source: pdf

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

reference text

[1] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.

[2] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming (Optimization and Neural Computation Series, 3). Athena Scientiﬁc, 1996.

[3] Enno Mammen and Alexander B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.

[4] Alexander B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. The Annals of Statistics, 32 (1):135–166, 2004.

[5] Jean-Yves Audibert and Alexander B. Tsybakov. Fast learning rates for plug-in classiﬁers. The Annals of Statistics, 35(2):608–633, 2007.

[6] Alessandro Rinaldo and Larry Wasserman. Generalized density clustering. The Annals of Statistics, 38(5):2678–2722, 2010.

[7] Michail G. Lagoudakis and Ronald Parr. Reinforcement learning as classiﬁcation: Leveraging modern classiﬁers. In ICML ’03: Proceedings of the 20th international conference on Machine learning, pages 424–431, 2003. 8

[8] Alessandro Lazaric, Mohammad Ghavamzadeh, and R´ mi Munos. Analysis of a classiﬁcatione based policy iteration algorithm. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 607–614. Omnipress, 2010.

[9] Omar Syed and Robert E. Schapire. A reduction from apprenticeship learning to classiﬁcation. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS - 23), pages 2253–2261, 2010.

[10] Andr´ s Antos, Csaba Szepesv´ ri, and R´ mi Munos. Learning near-optimal policies with a a e Bellman-residual minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71:89–129, 2008.

[11] R´ mi Munos and Csaba Szepesv´ ri. Finite-time bounds for ﬁtted value iteration. Journal of e a Machine Learning Research, 9:815–857, 2008.

[12] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv´ ri, and Shie Mannor. a Regularized ﬁtted Q-iteration for planning in continuous-space Markovian Decision Problems. In Proceedings of American Control Conference (ACC), pages 725–730, June 2009.

[13] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv´ ri, and Shie Mannor. a Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS - 21), pages 441–448. MIT Press, 2009.

[14] Odalric Maillard, R´ mi Munos, Alessandro Lazaric, and Mohammad Ghavamzadeh. Finitee sample analysis of Bellman residual minimization. In Proceedings of the Second Asian Conference on Machine Learning (ACML), 2010.

[15] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). The MIT Press, 1998.

[16] Csaba Szepesv´ ri. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, a 2010.

[17] Hamid Reza Maei, Csaba Szepesv´ ri, Shalabh Bhatnagar, and Richard S. Sutton. Toward a off-policy learning control with function approximation. In Johannes F¨ rnkranz and Thorsten u Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 719–726, Haifa, Israel, June 2010. Omnipress.

[18] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 521–528. ACM, 2009.

[19] Martin Riedmiller. Neural ﬁtted Q iteration – ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005.

[20] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.

[21] Daniela Pucci de Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003.

[22] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 809–816, New York, NY, USA, 2009. ACM.

[23] R´ mi Munos. Error bounds for approximate policy iteration. In ICML 2003: Proceedings of e the 20th Annual International Conference on Machine Learning, pages 560–567, 2003.

[24] R´ mi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM Journal e on Control and Optimization, pages 541–561, 2007.

[25] Amir-massoud Farahmand, R´ mi Munos, and Csaba Szepesv´ ri. Error propagation for ape a proximate policy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS 23), pages 568–576. 2010. 9