nips nips2005 nips2005-78 nips2005-78-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Doron Blatt, Alfred O. Hero
Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1
[1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11, 2004.
[2] J. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003.
[3] A. G. Barto and T. G. Dietterich. Reinforcement learning and its relationship to supervised learning. In J. Si, A. Barto, W. Powell, and D. Wunsch, editors, Handbook of learning and approximate dynamic programming. John Wiley and Sons, Inc, 2004.
[4] A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy language bias. In Advances in Neural Information Processing Systems, volume 16, 2003.
[5] M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000.
[6] M. Lagoudakis and R. Parr. Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.
[7] J. Langford and A. Beygelzimer. Sensitive error correcting output codes. In Proceedings of the 18th Annual Conference on Learning Theory, pages 158–172, 2005.
[8] J. Langford and B. Zadrozny. Reducing T-step reinforcement learning to classification. http://hunch.net/∼jl/projects/reductions/reductions.html, 2003.
[9] J. Langford and B. Zadrozny. Relating reinforcement learning performance to classification performance. In Proceedings of the Twenty Second International Conference on Machine Learning, pages 473–480, 2005.
[10] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc, 1994.