nips nips2001 nips2001-175 nips2001-175-reference knowledge-graph by maker-knowledge-mining

175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

Source: pdf

Author: Xin Wang, Thomas G. Dietterich

Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1

reference text

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In ICML-95, 30-37, San Francisco, CA. Morgan Kaufmann. Baxter, J., & Bartlett, P. L. (2000). Reinforcement learning in POMDP's via direct gradient ascent. In ICML-2000, 41-48. Morgan Kaufmann, San Francisco, CA. Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, 369-376. The MIT Press, Cambridge. Boyan, J. A., & Moore, A. W. (1996). Learning evaluation functions for large acyclic domains. In ICML-96, 63-70. Morgan Kaufmann, San Francisco, CA. Gordon, G. J. (2001). Reinforcement learning with function approximation converge to a region. In NIPS-13, 1040-1046. The MIT Press. Harvey, W. D., & Ginsberg, L. P. (1995). Limited discrepancy search. In IJCAI-95, 825-830. Morgan Kaufmann. Konda, V. R., & Tsitsiklis, J. N. (2000). Policy gradient methods for reinforcement learning with function approximation. In NIPS-12, 1008-1014 Cambridge, MA. MIT Press. Moll, R., Barto, A. G., Perkins, T. J., & Sutton, R. S. (1999). Learning instanceindependent value functions to enhance local search. In NIPS-ll, 1017-1023. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In NIPS-12, 1057-1063. Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In NIPS-8, 1038-1044. The MIT Press, Cambridge. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8,229. -.. Zhang, W., & Dietterich, T. G. (1995). A reinforcement learning approach to job-shop scheduling. In IJCAI-95, 1114-1120. Morgan Kaufmann, San Francisco, CA.