nips nips2012 nips2012-292 nips2012-292-reference knowledge-graph by maker-knowledge-mining

292 nips-2012-Regularized Off-Policy TD-Learning

Source: pdf

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables ﬁrst-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

reference text

[1] A. Ben-Tal and A. Nemirovski. Non-Euclidean restricted memory level method for large-scale convex optimization. Mathematical Programming, 102(3):407–456, 2005.

[2] T. Degris, M. White, and R. S. Sutton. Linear off-policy actor-critic. In International Conference on Machine Learning, 2012.

[3] M. Geist, B. Scherrer, A. Lazaric, and M. Ghavamzadeh. A Dantzig Selector Approach to Temporal Difference Learning. In International Conference on Machine Learning, 2012.

[4] M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-Sample Analysis of LassoTD . In Proceedings of the 28th International Conference on Machine Learning, 2011.

[5] J. Johns, C. Painter-Wakeﬁeld, and R. Parr. Linear complementarity for regularized policy evaluation and improvement. In Proceedings of the International Conference on Neural Information Processing Systems, 2010.

[6] A. Juditsky and A. Nemirovski. Optimization for Machine Learning, chapter First-Order Methods for Nonsmooth Convex Large-Scale Optimization. MIT Press, 2011.

[7] J. Zico Kolter and A. Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of 27 th International Conference on Machine Learning, 2009.

[8] G. Konidaris, S. Osentoski, and PS Thomas. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth Conference on Artiﬁcial Intelligence, 2011.

[9] G. M. Korpelevich. The extragradient method for ﬁnding saddle points and other problems. 1976.

[10] H.R. Maei and R.S. Sutton. GQ (λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artiﬁcial General Intelligence, pages 91–96, 2010.

[11] S. Mahadevan and B. Liu. Sparse Q-learning with Mirror Descent. In Proceedings of the Conference on Uncertainty in AI, 2012.

[12] A. Nedi´ and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of optic mization theory and applications, 142(1):205–228, 2009.

[13] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009.

[14] Y. Nesterov. Gradient methods for minimizing composite objective function. In www.optimization-online.org, 2007.

[15] C. Painter-Wakeﬁeld and R. Parr. Greedy algorithms for sparse reinforcement learning. In International Conference on Machine Learning, 2012.

[16] C. Painter-Wakeﬁeld and R. Parr. L1 regularized linear temporal difference learning. Technical report, Duke CS Technical Report TR-2012-01, 2012.

[17] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in approximate linear programs for Markov decision processes. In Proceedings of the International Conference on Machine learning (ICML), 2010.

[18] J. Si and Y. Wang. Online learning control by association and reinforcement. IEEE Transactions on Neural Networks, 12:264–276, 2001.

[19] R. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

[20] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv´ ri, and E. Wiewiora. a Fast gradient-descent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning, pages 993–1000, 2009.

[21] J. Zico Kolter. The Fixed Points of Off-Policy TD. In Advances in Neural Information Processing Systems 24, pages 2169–2177, 2011. 9