jmlr jmlr2009 jmlr2009-67 jmlr2009-67-reference knowledge-graph by maker-knowledge-mining

67 jmlr-2009-Online Learning with Sample Path Constraints

Source: pdf

Author: Shie Mannor, John N. Tsitsiklis, Jia Yuan Yu

Abstract: We study online learning where a decision maker interacts with Nature with the objective of maximizing her long-term average reward subject to some sample path average constraints. We deﬁne the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature’s choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint, the convex hull turns out to be the highest attainable function. Using a calibrated forecasting rule, we provide an explicit strategy that attains this convex hull. We also measure the performance of heuristic methods based on non-calibrated forecasters in experiments involving a CPU power management problem. Keywords: online learning, calibration, regret minimization, approachability

reference text

E. Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999. D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization. Athena Scientiﬁc, 1997. D. Blackwell. An analog of the minimax theorem for vector payoffs. Paciﬁc J. Math., 6(1):1–8, 1956a. D. Blackwell. Controlled random walks. In Proc. Int. Congress of Mathematicians 1954, volume 3, pages 336–338. North Holland, Amsterdam, 1956b. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. D. P. Foster and R. V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21:40–55, 1997. J. Hannan. Approximation to Bayes Risk in Repeated Play, volume III of Contribution to The Theory of Games, pages 97–139. Princeton University Press, 1957. E. Hazan and N. Megiddo. Online learning with prior information. In Proceedings of 20th Annual Conference on Learning Theory, 2007. B. Kveton, J.Y. Yu, G. Theocharous, and S. Mannor. Online learning with expert advice and ﬁnitehorizon constraints. In AAAI 2008, in press, 2008. S. Mannor and N. Shimkin. The empirical Bayes envelope and regret minimization in competitive Markov decision processes. Mathematics of Operations Research, 28(2):327–345, 2003. S. Mannor and N. Shimkin. A geometric approach to multi-criterion reinforcement learning. Journal of Machine Learning Research, 5:325–360, 2004. S. Mannor, J. S. Shamma, and G. Arslan. Online calibrated forecasts: Memory efﬁciency versus universality for learning in games. Machine Learning, 67(1–2):77–115, 2007. J. F. Mertens, S. Sorin, and S. Zamir. Repeated games. CORE Reprint Dps 9420, 9421 and 9422, Center for Operation Research and Econometrics, Universite Catholique De Louvain, Belgium, 1994. N. Shimkin. Stochastic games with average cost constraints. In T. Basar and A. Haurie, editors, Advances in Dynamic Games and Applications, pages 219–230. Birkhauser, 1994. X. Spinat. A necessary and sufﬁcient condition for approachability. Mathematics of Operations Research, 27(1):31–44, 2002. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of ICML, 2003. 590