jmlr jmlr2011 jmlr2011-36 jmlr2011-36-reference knowledge-graph by maker-knowledge-mining

36 jmlr-2011-Generalized TD Learning

Source: pdf

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a uniﬁed way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function

reference text

H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, 19(6):716–723, 1974. S. Amari. Natural gradient works efﬁciently in learning. Neural computation, 10(2):251–276, 1998. S. Amari and J. F. Cardoso. Blind source separation-semiparametric statistical approach. IEEE Transactions on Signal Processing, 45(11):2692–2700, 2002. 2017 U ENO , M AEDA , K AWANABE AND I SHII S. Amari and M. Kawanabe. Information geometry of estimating functions in semi-parametric statistical models. Bernoulli, 3(1):29–54, 1997. S. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399–1409, 2000. L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pages 30–37, 1995. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996. P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner. Efﬁcient and Adaptive Estimation for Semiparametric Models. Springer, 1998. P. Billingsley. The Lindeberg-Levy theorem for martingales. Proceedings of the American Mathematical Society, 12(5):788–792, 1961. P. Billingsley. Probability and Measure. John Wiley and Sons, 1995. L. Bottou and Y. LeCun. Large scale online learning. In Advances in Neural Information Processing Systems 16, 2004. L. Bottou and Y. LeCun. On-line learning for very large datasets. Applied Stochastic Models in Business and Industry, 21(4):137–151, 2005. J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49 (2):233–246, 2002. R. C. Bradley. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2:107–144, 2005. S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1):33–57, 1996. R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems 8, pages 1017–1023, 1996. A. Geramifard, M. Bowling, and R. S. Sutton. Incremental least-squares temporal difference learning. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence, pages 356–361. AAAI Press, 2006. A. Geramifard, M. Bowling, M. Zinkevich, and R. S. Sutton. iLSTD: Eligibility traces and convergence analysis. In Advances in Neural Information Processing Systems 19, pages 441–448, 2007. V. P. Godambe. An optimum property of regular maximum likelihood estimation. The Annals of Mathematical Statistics, 31(4):1208–1211, 1960. V. P. Godambe. The foundations of ﬁnite sample estimation in stochastic processes. Biometrika, 72 (2):419–428, 1985. 2018 G ENERALIZED TD L EARNING V. P. Godambe, editor. Estimating Functions. Oxford University Press, 1991. S. Grun¨ w¨ lder and K. Obermayer. Optimality of LSTD and its relation to TD and MC. Technical e a report, Berlin University of Technology, 2006. R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985. P. J. Huber and E. M. Ronchetti. Robust Statistics. John Wiley and Sons, 2009. I. A. Ibragimov and I. U. V. Linnik. Independent and Stationary Sequences of Random Variables. Wolters-Noordhoff, 1971. M. Kawanabe and K. M¨ ller. Estimating functions for blind separation when sources have variance u dependencies. Journal of Machine Learning Research, 6(1):453—482, 2005. V. R. Konda. Actor-Critic Algorithm. PhD thesis, Massachusetts Institute of Technology, 2002. S. Konishi and G. Kitagawa. Generalised information criteria in model selection. Biometrika, 83 (4):875–890, 1996. N. Le Roux, P. A. Manzagol, and Y. Bengio. Topmoumoute online natural gradient algorithm. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, 2008. P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th International Conference on Machine Learning, pages 584–591, 2008. S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8: 2169–2231, 2007. S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis. Bias and variance in value function estimation. In Proceedings of the 21st International Conference on Machine Learning, pages 308–322, 2004. S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007. N. Murata and S. Amari. Statistical analysis of learning dynamics. Signal Processing, 74(1):3–28, 1999. A. Nedi´ and D. P. Bertsekas. Least squares policy evaluation algorithms with linear function c approximation. Discrete Event Dynamic Systems, 13(1):79–110, 2003. J. Neveu. Discrete-parameter Martingales. Elsevier, 1975. S. Singh and D. P. Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems 9, pages 974–980, 1997. S. Singh and P. Dayan. Analytical mean squared error curves for temporal difference learning. Machine Learning, 32(1):5–40, 1998. 2019 U ENO , M AEDA , K AWANABE AND I SHII M. Sørensen. On asymptotics of estimating functions. Brazilian Journal of Probability and Statistics, 13(2):419–428, 1999. R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1): 9–44, 1988. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv´ ri, and E. Wiewiora. Fast a gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26thth International Conference on Machine Learning, pages 993–1000, 2009a. R. S. Sutton, C. Szepesv´ ri, and R. H. Maei. A convergent O(n) temporal-difference algorithm a for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems 21, 2009b. G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3): 58 – 68, 1995. A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000. W. Wefelmeyer. Quasi-likelihood models and optimal inference. The Annals of Statistics, 24(1): 405–422, 1996. H. Yu and D. P. Bertsekas. Convergence results for some temporal difference methods based on least squares. Technical report, LIDS REPORT 2697, 2006. W. Zhang and T. G. Dietterich. A reinforcement learning approach to job-shop scheduling. In Proceedings of the 14th International Joint Conference on Artiﬁcial Intelligence, pages 1114– 1120, 1995. 2020