nips nips2005 nips2005-78 nips2005-78-reference knowledge-graph by maker-knowledge-mining

78 nips-2005-From Weighted Classification to Policy Search

Source: pdf

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classiﬁcation problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a ﬁnite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classiﬁcation methods can be applied to ﬁnd policies in the reinforcement learning problem. 1

reference text

[1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11, 2004.

[2] J. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003.

[3] A. G. Barto and T. G. Dietterich. Reinforcement learning and its relationship to supervised learning. In J. Si, A. Barto, W. Powell, and D. Wunsch, editors, Handbook of learning and approximate dynamic programming. John Wiley and Sons, Inc, 2004.

[4] A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy language bias. In Advances in Neural Information Processing Systems, volume 16, 2003.

[5] M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000.

[6] M. Lagoudakis and R. Parr. Reinforcement learning as classiﬁcation: Leveraging modern classiﬁers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.

[7] J. Langford and A. Beygelzimer. Sensitive error correcting output codes. In Proceedings of the 18th Annual Conference on Learning Theory, pages 158–172, 2005.

[8] J. Langford and B. Zadrozny. Reducing T-step reinforcement learning to classiﬁcation. http://hunch.net/∼jl/projects/reductions/reductions.html, 2003.

[9] J. Langford and B. Zadrozny. Relating reinforcement learning performance to classiﬁcation performance. In Proceedings of the Twenty Second International Conference on Machine Learning, pages 473–480, 2005.

[10] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc, 1994.