jmlr jmlr2013 jmlr2013-28 jmlr2013-28-reference knowledge-graph by maker-knowledge-mining

28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning


Source: pdf

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modification to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modified SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER


reference text

P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in Neural Information Processing Systems, pages 1–8, 2007. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003. 46. If Pπ is a transition kernel, Jensens inequality (e.g., Boyd and Vandenberghe, 2004) allows Pπ (dy|x) f 2 (y), ∀x ∈ Z, ∀ f ∈ L2 (Z, ξ). 2113 Pπ (dy|x) f (y) 2 ≤ ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER P. Berkes and L. Wiskott. Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 5:579–602, 2005. D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 3rd edition, 2007. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. W. B¨ hmer, S. Gr¨ new¨ lder, H. Nickisch, and K. Obermayer. Generating feature spaces for linear o u a algorithms with regularized sparse kernel slow feature analysis. Machine Learning, 89(1-2): 67–86, 2012. M. Bowling, A. Ghodsi, and D. Wilkinson. Action respecting embedding. In International Conference on Machine learning, 2005. J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: safely approximating the value function. In Advances in Neural Information Processing Systems, pages 369–376, 1995. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1/2/3):33–57, 1996. R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data. Part I: diffusion maps. Proceedings of the National Academy of Science, 102(21):7426 – 7431, May 2005. L. Csat´ and M. Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641–668, o 2002. E. Brian Davies. Linear Operators and their Spectra. Cambridge University Press, 2007. A. J. Davison. Real-time simultaneous localization and mapping with a single camera. In IEEE International Conference on Computer Vision, volume 2, page 1403, 2003. D.P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: the Gaussian process approach to temporal difference learning. In International Conference on Machine Learning, pages 154–161, 2003. K. Ferguson and S. Mahadevan. Proto-transfer learning in Markov decision processes using spectral methods. In ICML Workshop on Transfer Learning, 2006. E. Ferrante, A. Lazaric, and M. Restelli. Transfer of task representation in reinforcement learning using policy-based proto-value functions. In International Joint Conference on Autonomous Agents and Multiagent Systems, 2008. P. F¨ ldi´ k. Learning invariance from transformation sequences. Neural Computation, 3(2):194– o a 200, 1991. 2114 C ONSTRUCTION OF A PPROXIMATION S PACES FOR R EINFORCEMENT L EARNING M. Franzius, H. Sprekeler, and L. Wiskott. Slowness and sparseness leads to place, head-direction, and spatial-view cells. PLoS Computational Biology, 3(8):e166, 2007. S. Gr¨ new¨ lder and K. Obermayer. The optimal unbiased value estimator and its relation to LSTD, u a TD and MC. Machine Learning, 83:289–330, 2011. C. Guestrin, D. Koller, and R. Parr. Max-norm projections for factored MDPs. In International Joint Conference on Artificial Intelligence, pages 673–682, 2001. C. Guestrin, M. Hauskrecht, and B. Kveton. Solving factored MDPs with continuous and discrete variables. In Uncertainty in Artificial Intelligence, pages 235–242, 2004. M. Hauskrecht and B. Kveton. Linear program approximations for factored continuous-state Markov decision processes. In Advances in Neural Information Processing Systems, pages 895– 902, 2003. S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd edition, 1998. ISBN 978-0132733502. G. E. Hinton and S. Osindero. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006. H. Holden, B. Øksendal, J. Ubøe, and T. Zhang. Stochastic Partial Differential Equations. Springer Science+Business Media, 2nd edition, 2010. O. C. Jenkins and M. J. Mataric. A spatio-temporal extension to Isomap nonlinear dimension reduction. In International Conference on Machine Learning, 2004. S. T. Jensen and A. Rahbek. On the law of large numbers for (geometrically) ergodic Markov chains. Economic Theory, 23:761–766, 2007. S. Jodogne and J. H. Piater. Closed-loop learning of visual control policies. Journal of Artificial Intelligence Research, 28:349–391, 2007. L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134, 1998. J. Kober and J. Peters. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems, 2009. D. Koller and R. Parr. Computing factored value functions for policies in structured MDPs. In International Joint Conference on Artificial Intelligence, pages 1332–1339, 1999. D. Koller and R. Parr. Policy iteration for factored MDPs. In Uncertainty in Artificial Intelligence, pages 326–334, 2000. V. R. Kompella, M. D. Luciw, and J. Schmidhuber. Incremental slow feature analysis: adaptive lowcomplexity slow feature updating from high-dimensional input streams. Neural Computation, 24 (11):2994–3024, 2012. 2115 ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER G. D. Konidaris, S. Osentoski, and P.S. Thomas. Value function approximation in reinforcement learning using the Fourier basis. In Proceedings of the Twenty-Fifth Conference on Artificial Intelligence, 2011. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. S. Lange and M. Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In International Joint Conference on Neural Networks, pages 1–8, 2010. R. Legenstein, N. Wilbert, and L. Wiskott. Reinforcement learning on slow features of highdimensional input streams. PLoS Computational Biology, 6(8):e1000894, 2010. M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In In Advances In Neural Information Processing Systems 14, 2001. D. G. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, 1999. M. Luciw and J. Schmidhuber. Low complexity proto-value function learning from sensory observations with incremental slow feature analysis. In International Conference on Artificial Neural Networks and Machine Learning, volume III, pages 279–287. Springer-Verlag, 2012. S. Mahadevan and B. Liu. Basis construction from power series expansions of value functions. In Advances in Neutral Information Processing Systems, pages 1540–1548, 2010. S. Mahadevan and M. Maggioni. Proto-value functions: a Laplacian framework for learning representations and control in Markov decision processes. Journal of Machine Learning Research, 8: 2169–2231, 2007. S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions On Signal Processing, 41:3397–3415, 1993. N. Mehta, S. Natarajan, P. Tadepalli, and A. Fern. Transfer in variable-reward hierarchical reinforcement learning. Machine Learning, 73:289–312, 2008. R. Parr, C. Painter-Wakefield, L. Li, and M. Littman. Analyzing feature generation for valuefunction approximation. In International Conference on Machine Learning, 2007. R. Parr, L. Li, G. Taylor, C. Painter-Wakefiled, and M. L. Littman. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In International Conference on Machine Learning, 2008. K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11):559 – 572, 1901. M. Petrik. An analysis of Laplacian methods for value function approximation in MDPs. In International Joint Conference on Artificial Intelligence, pages 2574–2579, 2007. M. Petrik and S. Zilberstein. Robust approximate bilinear programming for value function approximation. Journal of Machine Learning Research, 12:3027–3063, 2011. 2116 C ONSTRUCTION OF A PPROXIMATION S PACES FOR R EINFORCEMENT L EARNING C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. M. Reed and B. Simon. Methods of Modern Mathematical Physics I: Functional Analysis. Academic Press, 1980. ISBN 0-12-585050-6. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. ISBN 978-0262194754. o B. Sch¨ lkopf, A. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10(5):1299–1319, 1998. R. Smith, M. Slef, and P. Cheeseman. Estimating uncertain spatial relationships in robotics. In Autonomous Robot Vehicles. Springer-Verlag, 1990. A.J. Smola and B. Sch¨ lkopf. Sparse greedy matrix approximation for machine learning. In Proo ceedings to the 17th International Conference Machine Learning, pages 911–918, 2000. M. Snel and S. Whiteson. Multi-task reinforcement learning: Shaping and feature selection. In European Workshop on Reinforcement Learning, pages 237–248, 2011. N. Sprague. Predictive projections. In International Joint Conference on Artificial Intelligence, pages 1223–1229, 2009. H. Sprekeler. On the relationship of slow feature analysis and Laplacian eigenmaps. Neural Computation, 23(12):3287–3302, 2011. Y. Sun, F. Gomez, M. Ring, and J. Schmidhuber. Incremental basis construction from temporal difference error. In International Conference on Machine Learning, pages 481–488, 2011. R. S. Sutton. Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, pages 1038–1044, 1996. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685, 2009. J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global framework for nonlinear dimensionality reduction. Science, 290:2319 – 2323, 2000. S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, 1993. J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. G. Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990. C. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992. 2117 ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER D. Wingate. Predictively defined representations of state. In M. Wiering and M. van Otterlo, editors, Reinforcement Learning: State-of-the-Art, pages 415–439. Springer-Verlag Berlin Heidelberg, 2012. D. Wingate and S. P. Singh. On discovery and learning of models with predictive representations of state for agents with continuous actions and observations. In International Joint Conference on Autonomous Agents and Multiagent Systems, pages 1128–1135, 2007. L. Wiskott. Slow feature analysis: a theoretical analysis of optimal free responses. Neural Computation, 15(9):2147–2177, 2003. L. Wiskott and T. Sejnowski. Slow feature analysis: unsupervised learning of invariances. Neural Computation, 14(4):715–770, 2002. X. Xu. A sparse kernel-based least-squares temporal difference algorithm for reinforcement learning. In Advances in Natural Computation, volume 4221 of Lecture Notes in Computer Science, pages 47–56. Springer Berlin / Heidelberg, 2006. 2118