nips nips2001 nips2001-187 nips2001-187-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shie Mannor, Nahum Shimkin
Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1
[1] J. Abounadi, D. Bertsekas, and V. Borkar. Learning algorithms for markov decision processes with average cost. LIDS-P 2434, Lab. for Info. and Decision Systems, MIT, October 1998.
[2] A.G. Barto and R.S. Sutton. Reinforcement Learning. MIT Press, 1998.
[3] D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific J. Math., 6(1):1–8, 1956. 2
[4] R.I. Brafman and M. Tennenholtz. A near optimal polynomial time algorithm for learning in certain classes of stochastic games. Artificial Intelligence, 121(1-2):31–47, April 2000.
[5] C. Derman. Finite state Markovian decision processes. Academic Press, 1970.
[6] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer Verlag, 1996.
[7] L.P. Kaelbling, M. Littman, and A.W. Moore. Reinforcement learning - a survey. Journal of Artificial Intelligence Research, (4):237–285, May 1996.
[8] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. of the 15th Int. Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, 1998.
[9] M.L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Morgan Kaufman, editor, Eleventh International Conference on Machine Learning, pages 157–163, 1994.
[10] S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1):159–196, 1996.
[11] S. Mannor and N. Shimkin. The empirical bayes envelope approach to regret minimization in stochastic games. Technical report EE- 1262, Faculty of Electrical Engineering, Technion, Israel, October 2000.
[12] J.F. Mertens and A. Neyman. Stochastic games. International Journal of Game Theory, 10(2):53–66, 1981.
[13] A. Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth International Conference on Machine Learning, pages 298–305. Morgan Kaufmann, 1993.
[14] N. Shimkin and A. Shwartz. Guaranteed performance regions in markovian systems with competing decision makers. IEEE Trans. on Automatic Control, 38(1):84–95, January 1993.