nips nips2013 nips2013-231 knowledge-graph by maker-knowledge-mining

231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries


Source: pdf

Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. [sent-3, score-0.689]

2 Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. [sent-6, score-1.506]

3 Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. [sent-7, score-0.649]

4 , [8]— is defined as the following repeated game, between a randomized player with a finite and fixed set of available actions and an adversary. [sent-10, score-0.563]

5 At the beginning of each round of the game, the adversary assigns a loss to each action. [sent-11, score-0.696]

6 Next, the player defines a probability distribution over the actions, draws an action from this distribution, and suffers the loss associated with that action. [sent-12, score-0.728]

7 Two versions of this game are typically considered: in the full-information feedback version, at the end of each round, the player observes the adversary’s assignment of loss values to each action. [sent-14, score-0.864]

8 In the bandit feedback version, the player only observes the loss associated with his chosen action, but not the loss values of other actions. [sent-15, score-1.151]

9 We assume that the adversary is adaptive (also called nonoblivious by [8] or reactive by [16]), which means that the adversary chooses the loss values on round t based on the player’s actions on rounds 1 . [sent-16, score-1.414]

10 In other words, the adversary can perform all of the calculations needed to specify, in advance, how he plans to react on each round to any sequence of actions chosen by the player. [sent-22, score-0.702]

11 We assume that the adversary defines, in advance, a sequence of history-dependent loss functions f1 , f2 , . [sent-28, score-0.627]

12 The input to each loss function ft is the entire history of the player’s actions so far, therefore the player’s loss on round t is ft (X1:t ). [sent-32, score-0.971]

13 Note that the player doesn’t observe the functions ft , only the losses that result from his past actions. [sent-33, score-0.842]

14 Specifically, in the bandit feedback model, the player observes ft (X1:t ) on round t, whereas in the full-information model, the player observes ft (X1:t−1 , x) for all x ∈ A. [sent-34, score-2.049]

15 1 On any round T , we evaluate the player’s performance so far using the notion of regret, which compares his cumulative loss on the first T rounds to the cumulative loss of the best fixed action in hindsight. [sent-35, score-0.573]

16 Formally, the player’s regret on round T is defined as RT = T � t=1 ft (X1:t ) − min x∈A T � ft (x . [sent-36, score-0.925]

17 This definition is the same as the one used in [18] and [3] (in the latter, it is called policy regret), but differs from the more common definition of expected regret � � T T � � ft (X1:t ) − min ft (X1:t−1 , x) . [sent-42, score-0.843]

18 Indeed, if the adversary is adaptive, the quantity ft (X1:t−1 , x)is hardly interpretable —see [3] for a more detailed discussion. [sent-47, score-0.7]

19 The weakest adversary we discuss is the oblivious adversary, which determines the loss on round t based only on the current action Xt . [sent-51, score-1.058]

20 In other words, this adversary is oblivious to the player’s past actions. [sent-52, score-0.706]

21 Formally, the oblivious adversary is constrained to choose a sequence of loss functions that t−1 satisfies ∀t, ∀x1:t ∈ At , and ∀x� , 1:t−1 ∈ A ft (x1:t ) = ft (x� 1:t−1 , xt ) . [sent-53, score-1.492]

22 In this example, the investor is the player and the stock market is the adversary. [sent-59, score-0.601]

23 A stronger adversary is the oblivious adversary with switching costs. [sent-62, score-1.377]

24 This adversary is similar to the oblivious adversary defined above, but charges the player an additional switching cost of 1 whenever Xt �= Xt−1 . [sent-63, score-1.841]

25 More formally, this adversary defines his sequence of loss functions in two steps: first he chooses an oblivious sequence of loss functions, �1 , �2 . [sent-64, score-1.072]

26 The switching costs adversary defines ft to be a function of Xt and Xt−1 , and is therefore a special case of a more general adversary called an adaptive adversary with a memory of 1. [sent-78, score-2.1]

27 This adversary t−2 is constrained to choose loss functions that satisfy ∀t, ∀x1:t ∈ At , and ∀x� , 1:t−2 ∈ A ft (x1:t ) = ft (x� 1:t−2 , xt−1 , xt ) . [sent-79, score-1.199]

28 (5) This adversary is more general than the switching costs adversary because his loss functions can depend on the previous action in an arbitrary way. [sent-80, score-1.499]

29 We can further strengthen this adversary and 2 define the bounded memory adaptive adversary, which has a bounded memory of an arbitrary size. [sent-81, score-0.905]

30 In other words, this adversary is allowed to set his loss function based on the player’s m most recent past actions, where m is a predefined parameter. [sent-82, score-0.576]

31 Formally, the bounded memory adversary must t−m−1 choose loss functions that satisfy, ∀t, ∀x1:t ∈ At , and ∀x� , 1:t−m−1 ∈ A ft (x1:t ) = ft (x� 1:t−m−1 , xt−m:t ) . [sent-83, score-1.307]

32 In addition to the adversary types described above, the bounded memory adaptive adversary has additional interesting special cases. [sent-85, score-1.145]

33 One of them is the delayed feedback oblivious adversary of [19], which defines an oblivious loss sequence, but reveals each loss value with a delay of m rounds. [sent-86, score-1.389]

34 Since the loss at time t depends on the player’s action at time t − m, this adversary is a special case of a bounded memory adversary with a memory of size m. [sent-87, score-1.427]

35 The delayed feedback adversary is not a focus of our work, and we present it merely as an interesting special case. [sent-88, score-0.641]

36 As mentioned above, the oblivious adversary has been studied extensively and is the best understood of all the adversaries discussed in this paper. [sent-96, score-0.874]

37 The oblivious setting with bandit feedback, where the player only observes the incurred loss ft (X1:t ), is called the nonstochastic (or adversarial) multi-armed bandit problem. [sent-103, score-1.63]

38 The analysis of FLL guarantees that the oblivious component of the player’s expected regret (without √ counting the switching costs), as well as the expected number of √ switches, is upper bounded by O( T ), implying an expected regret of O( T ). [sent-106, score-1.302]

39 The work in [3] focuses on the bounded memory adversary with bandit feedback and guarantees an expected regret of O(T 2/3 ). [sent-107, score-1.392]

40 We note that [18, 12] study this problem in a different feedback model, which we call counterfactual feedback, where the player receives a full description of the history-dependent function ft at the end √ of round t. [sent-109, score-1.024]

41 Learning with bandit feedback and switching costs has mostly been considered in the economics literature, using a different setting than ours and with prior knowledge assumptions (see [13] for an overview). [sent-111, score-0.777]

42 For example, several bandit √ algorithms have extensions to the “adaptive” adversary case, with a regret upper bound of O( T ) [1]. [sent-120, score-1.009]

43 Specifically, our lower bound applies to the switching costs adversary with bandit feedback and to all strictly stronger adversaries. [sent-131, score-1.267]

44 • Building on this lower bound, we prove another regret lower bound in the bounded memory setting with full-information feedback, again matching the known upper bound. [sent-132, score-0.629]

45 • Despite the lower bound, we show that for switching costs and bandit feedback, if we also assume stochastic i. [sent-134, score-0.608]

46 First, observe that the optimal regret against the �√ � � � switching costs adversary is Θ T with full-information feedback, versus Θ T 2/3 with bandit feedback. [sent-141, score-1.301]

47 Second, observe that √ the full-information feedback case, the optimal regret against a switching in costs adversary is Θ( T ), whereas the optimal regret against the more general bounded memory adversary is Ω(T 2/3 ). [sent-151, score-2.169]

48 We assume that the loss values on each individual round are bounded in an interval of constant size C, but we allow this interval to drift from round to round. [sent-158, score-0.573]

49 1 Ω(T 2/3 ) with Switching Costs and Bandit Feedback We begin with a Ω(T 2/3 ) regret lower bound against an oblivious adversary with switching costs, when the player receives bandit feedback. [sent-168, score-1.969]

50 to denote the oblivious sequence of loss functions chosen by the adversary before adding the switching cost. [sent-173, score-1.126]

51 For any player strategy that relies on bandit feedback and for any number of rounds T , are there exist loss functions f1 , . [sent-175, score-1.122]

52 , fT that� oblivious with switching costs, with a range bounded 1 by C = 2, and a drift bounded by Dt = 3 log(t) + 16, such that E[RT ] ≥ 40 T 2/3 . [sent-178, score-0.76]

53 It is straightforward to confirm that this loss sequence has a bounded range, as required by the theorem: by construction we have |�t (1) − �t (2)| = T −1/3 ≤ 1 for all t, and since the switching cost can add at most 1 to the loss on each round, we conclude that |ft (1) − ft (2)| ≤ 2 for all t. [sent-199, score-0.888]

54 Next, we show that the expected regret of any player against this random loss sequence is Ω(T 2/3 ), where expectation is taken over the randomization of both the adversary and the player. [sent-200, score-1.373]

55 The intuition is that the player can only gain information about which action is better by switching between them. [sent-201, score-0.835]

56 Since the gap between the two losses on each round is T −1/3 , the player must perform Ω(T 2/3 ) switches before he can identify the better action. [sent-203, score-0.729]

57 If the player performs that many switches, the total regret incurred due to the switching costs is Ω(T 2/3 ). [sent-204, score-1.096]

58 Alternatively, if the player performs o(T 2/3 ) switches, he 5 �t (1) �t (2) 2 0 −2 5 10 15 t 20 25 30 Figure 1: A particular realization of the random loss sequence defined in Eq. [sent-205, score-0.627]

59 can’t identify the better action; as a result he suffers an expected regret of Ω(T −1/3 ) on each round and a total regret of Ω(T 2/3 ). [sent-208, score-0.728]

60 (8), plus a switching cost, achieves an expected regret of Ω(T 2/3 ), there must exist at least one deterministic loss sequence �1 . [sent-210, score-0.717]

61 To get this strong result, we need to give the adversary a little bit of extra power: memory of size 2 instead of size 1 as in the case of switching costs. [sent-221, score-0.79]

62 For any player strategy that relies on full-information feedback and for any number of rounds T ≥ 2, there exist loss functions f1 , . [sent-224, score-0.892]

63 We construct the adversarial loss sequence as follows: on each round, the adversary assigns the same loss to both actions. [sent-230, score-0.743]

64 Recall that even in the full-information version of the game, the player doesn’t know what the losses would have been had he chosen different actions in the past. [sent-232, score-0.625]

65 (9) In words, we define the loss on round t of the full-information game to be equal to the loss on round t − 1 of a bandits-with-switching-costs game in which the player chooses the same sequence of actions. [sent-239, score-1.151]

66 Therefore, the Ω(T 2/3 ) lower bound for switching costs and bandit feedback extends to the full-information setting with a memory of size at least 2. [sent-242, score-0.936]

67 Specifically, of the upper bounds that appear in Table 1, we prove the following: √ • O( T ) for an oblivious adversary with switching costs, with full-information feedback. [sent-244, score-1.025]

68 √ � � • O( T ) for an oblivious adversary with bandit feedback (where O hides logarithmic factors). [sent-245, score-1.121]

69 � • O(T 2/3 ) for a bounded memory adversary with bandit feedback. [sent-246, score-0.868]

70 6 The remaining upper bounds in Table 1 are either trivial or follow from the principle that an upper bound still holds if we weaken the adversary or provide a more informative feedback. [sent-247, score-0.591]

71 �T (without the additional switching costs) were all bounded in [0, 1], the Follow the Lazy Leader (FLL) algorithm of √ [14] would guarantee a regret of O( T ) with respect to these losses (again, without the additional √ switching costs). [sent-253, score-0.948]

72 On round t, after choosing an action and receiving the loss function �t , the player defines the modi� � 1 fied loss �� (x) = C−1 �t (x) − miny �t (y) and feeds it to the FLL algorithm. [sent-256, score-0.976]

73 is oblivious with switching costs and has a range √ bounded by C then the player strategy described above attains O(C T ) expected regret. [sent-262, score-1.245]

74 We first show that each �� is bounded in [0, 1] and therefore the standard regret bound for FLL holds t with respect to the sequence of modified loss functions �� , �� , . [sent-264, score-0.592]

75 The player sets a fixed horizon T and focuses on controlling his regret at time T ; he can then use a standard doubling trick [8] to handle an infinite horizon. [sent-275, score-0.779]

76 The player uses the fact that each ft has a range bounded by C. [sent-276, score-0.841]

77 2(C + D) 2 Note that ft� (X1:t ) can be computed by the player using only bandit feedback. [sent-278, score-0.694]

78 The player then feeds √ ft� (X1:t ) to an algorithm that guarantees a O( T ) standard regret (see definition in Eq. [sent-279, score-0.793]

79 The player chooses his actions according to the choices made by Exp3. [sent-282, score-0.578]

80 The following theorem states that this reduction results in √ � a bandit algorithm that guarantees a regret of O( T ) against oblivious adversaries. [sent-283, score-0.808]

81 fT is oblivious with a range bounded by C√ loss � � a drift bounded by Dt = O log(t) then the player strategy described above attains O(C T ) expected regret. [sent-288, score-1.162]

82 In a nutshell, we show that each ft� is a loss √ function bounded in [0, 1] and that the analysis of Exp3 guarantees a regret of O( T ) with respect to √ � � the loss sequence f1 . [sent-290, score-0.686]

83 3 � O(T 2/3 ) with Bounded Memory and Bandit Feedback Proving an upper bound against an adversary with a memory of size m, with bandit feedback, requires a more delicate reduction. [sent-299, score-0.831]

84 Since fT (x1:t ) depends only on the last m + 1 actions in x1:t , we slightly overload our notation and define ft (xt−m:t ) to mean the same as ft (x1:t ). [sent-302, score-0.606]

85 To define the reduction, the player fixes a base 7 action x0 ∈ A and for each t > m he defines the loss function � 1 � 1 � � ft (xt−m:t ) − ft−m−1 (x0 . [sent-303, score-0.971]

86 At the beginning of each epoch, the player plans his action sequence for the entire epoch. [sent-309, score-0.66]

87 For each action in A, the player chooses an exploration interval of 2(m + 1) consecutive rounds within the epoch. [sent-311, score-0.754]

88 The player runs the Hedge algorithm [11] in the background, invoking it only at the beginning of each epoch and using it to choose one exploitation action that will be played consistently on all of the exploitation rounds in the epoch. [sent-315, score-0.753]

89 In the exploration interval for action x, the player first plays m + 1 rounds of the base action x0 followed by m + 1 rounds of the action x. [sent-316, score-1.056]

90 Letting tx denote the first round in this interval, the player uses the observed losses ftx +m (x0 . [sent-317, score-0.702]

91 � T is has a memory of size m, a range bounded f �� by C, and a drift bounded by Dt = O log(t) then the player strategy described above attains � O(T 2/3 ) expected regret. [sent-337, score-0.892]

92 4 Discussion In this paper, we studied the problem of prediction with expert advice against different types of adversaries, ranging from the oblivious adversary to the general adaptive adversary. [sent-338, score-0.824]

93 We proved upper and lower bounds on the player’s regret against each of these adversary types, in both the full-information and the bandit feedback models. [sent-339, score-1.229]

94 Our lower bounds essentially matched our upper bounds in all but one case: the adaptive adversary with a unit memory in the full-information √ setting, where we only know that regret is Ω( T ) and O(T 2/3 ). [sent-340, score-1.034]

95 First, we characterize the regret attainable with switching costs, and show a setting where predicting with bandit feedback is strictly more difficult than predicting with full-information feedback —even in terms of the dependence on T , and even on small finite action sets. [sent-342, score-1.331]

96 Second, in the full-information setting, we show that predicting against a switching costs adversary is strictly easier than predicting against an arbitrary adversary with a bounded memory. [sent-343, score-1.37]

97 In addition to the adversaries discussed in this paper, there are other interesting classes of adversaries that lie between the oblivious and the adaptive. [sent-350, score-0.618]

98 For example, imagine playing a multi-armed bandit game where the loss values are initially oblivious, but whenever the player chooses an arm with zero loss, the loss of the same arm on the next round is deterministically changed to zero. [sent-352, score-1.189]

99 Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. [sent-365, score-0.561]

100 Online bandit learning against an adaptive adversary: from regret to policy regret. [sent-371, score-0.594]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('player', 0.464), ('adversary', 0.439), ('regret', 0.282), ('ft', 0.261), ('oblivious', 0.252), ('switching', 0.247), ('bandit', 0.23), ('adversaries', 0.183), ('feedback', 0.178), ('action', 0.124), ('loss', 0.122), ('round', 0.121), ('fll', 0.106), ('memory', 0.104), ('costs', 0.103), ('bounded', 0.095), ('xt', 0.091), ('actions', 0.084), ('rounds', 0.084), ('losses', 0.077), ('investor', 0.071), ('adaptive', 0.068), ('switches', 0.067), ('game', 0.065), ('drift', 0.05), ('lt', 0.047), ('hedge', 0.046), ('dt', 0.046), ('stock', 0.045), ('sequence', 0.041), ('bounds', 0.041), ('ftx', 0.04), ('epoch', 0.037), ('walk', 0.037), ('observes', 0.035), ('nes', 0.032), ('interval', 0.032), ('rt', 0.032), ('upper', 0.031), ('chooses', 0.03), ('lower', 0.028), ('bound', 0.027), ('nonoblivious', 0.027), ('online', 0.026), ('attainable', 0.026), ('expert', 0.026), ('functions', 0.025), ('advice', 0.025), ('expected', 0.025), ('guarantees', 0.024), ('leader', 0.024), ('delayed', 0.024), ('invests', 0.023), ('feeds', 0.023), ('doesn', 0.022), ('trading', 0.022), ('weaken', 0.022), ('logarithmic', 0.022), ('market', 0.021), ('deterministically', 0.021), ('range', 0.021), ('deferred', 0.021), ('formally', 0.02), ('exploration', 0.02), ('reduction', 0.02), ('strategy', 0.019), ('maxt', 0.019), ('supplementary', 0.019), ('adversarial', 0.019), ('attains', 0.019), ('setting', 0.019), ('seventeenth', 0.018), ('suffers', 0.018), ('controlling', 0.018), ('lazy', 0.018), ('nonstochastic', 0.017), ('plans', 0.017), ('proving', 0.017), ('predicting', 0.016), ('competitive', 0.016), ('multiarmed', 0.016), ('past', 0.015), ('randomized', 0.015), ('prove', 0.015), ('proof', 0.015), ('microsoft', 0.015), ('strictly', 0.015), ('log', 0.015), ('dealing', 0.015), ('exploitation', 0.015), ('focuses', 0.015), ('prediction', 0.014), ('beginning', 0.014), ('shifted', 0.014), ('sublinear', 0.014), ('policy', 0.014), ('bandits', 0.014), ('dekel', 0.014), ('imagine', 0.014), ('implying', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1

2 0.37526459 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

Author: Sasha Rakhlin, Karthik Sridharan

Abstract: We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to H¨ lder-smooth o functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T )￿T ). This addresses a question of Daskalakis et al [6]. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem. 1

3 0.34231877 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1

4 0.30150276 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

Author: Abhradeep Guha Thakurta, Adam Smith

Abstract: We give differentially private algorithms for a large class of online learning algorithms, in both the full information and bandit settings. Our algorithms aim to minimize a convex loss function which is a sum of smaller convex loss terms, one for each data point. To design our algorithms, we modify the popular mirror descent approach, or rather a variant called follow the approximate leader. The technique leads to the first nonprivate algorithms for private online learning in the bandit setting. In the full information setting, our algorithms improve over the regret bounds of previous work (due to Dwork, Naor, Pitassi and Rothblum (2010) and Jain, Kothari and Thakurta (2012)). In many cases, our algorithms (in both settings) match the dependence on the input length, T , of the optimal nonprivate regret bounds up to logarithmic factors in T . Our algorithms require logarithmic space and update time. 1

5 0.29172292 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set ⇧ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. 1

6 0.2587046 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

7 0.24705547 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

8 0.23527104 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

9 0.21528403 89 nips-2013-Dimension-Free Exponentiated Gradient

10 0.19856584 230 nips-2013-Online Learning with Costly Features and Labels

11 0.18995295 325 nips-2013-The Pareto Regret Frontier

12 0.16571064 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

13 0.14864537 137 nips-2013-High-Dimensional Gaussian Process Bandits

14 0.1379803 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

15 0.13492982 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

16 0.13274328 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

17 0.11537497 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

18 0.11251782 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

19 0.11040115 7 nips-2013-A Gang of Bandits

20 0.10934352 25 nips-2013-Adaptive Anonymity via $b$-Matching


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.201), (1, -0.266), (2, 0.288), (3, -0.306), (4, -0.011), (5, -0.182), (6, 0.082), (7, -0.067), (8, -0.008), (9, -0.041), (10, -0.024), (11, -0.086), (12, 0.051), (13, 0.031), (14, 0.004), (15, -0.092), (16, 0.027), (17, -0.115), (18, -0.018), (19, -0.022), (20, -0.064), (21, -0.013), (22, -0.029), (23, 0.045), (24, -0.021), (25, -0.052), (26, 0.066), (27, 0.005), (28, 0.054), (29, -0.015), (30, 0.007), (31, -0.038), (32, 0.037), (33, -0.086), (34, 0.138), (35, -0.029), (36, -0.021), (37, 0.073), (38, 0.023), (39, 0.013), (40, 0.028), (41, -0.049), (42, 0.037), (43, 0.043), (44, 0.005), (45, -0.042), (46, -0.069), (47, 0.009), (48, -0.045), (49, -0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96442842 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1

2 0.85069805 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1

3 0.83778501 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

Author: Sasha Rakhlin, Karthik Sridharan

Abstract: We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to H¨ lder-smooth o functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T )￿T ). This addresses a question of Daskalakis et al [6]. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem. 1

4 0.79756045 325 nips-2013-The Pareto Regret Frontier

Author: Wouter M. Koolen

Abstract: Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. We study which such regret trade-offs can be achieved, and how. We analyse regret w.r.t. each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each finite horizon and asymptotically. 1

5 0.78293782 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. 1

6 0.61308783 230 nips-2013-Online Learning with Costly Features and Labels

7 0.59594488 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

8 0.5763461 89 nips-2013-Dimension-Free Exponentiated Gradient

9 0.55467796 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

10 0.53341049 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

11 0.50702071 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

12 0.48984119 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

13 0.46328959 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

14 0.44138619 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

15 0.42559418 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

16 0.4238838 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

17 0.41572654 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

18 0.41416675 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

19 0.40671751 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

20 0.38285792 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.134), (16, 0.015), (33, 0.147), (34, 0.072), (41, 0.021), (49, 0.021), (54, 0.018), (56, 0.218), (70, 0.012), (73, 0.129), (85, 0.052), (89, 0.018), (91, 0.012), (93, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91543907 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1

2 0.89980263 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1

3 0.88727182 66 nips-2013-Computing the Stationary Distribution Locally

Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah

Abstract: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difficult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1

4 0.86479902 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

Author: Sasha Rakhlin, Karthik Sridharan

Abstract: We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to H¨ lder-smooth o functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T )￿T ). This addresses a question of Daskalakis et al [6]. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem. 1

5 0.86422008 171 nips-2013-Learning with Noisy Labels

Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari

Abstract: In this paper, we theoretically study the problem of binary classification in the presence of random classification noise — the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is class-conditional — the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence — methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.

6 0.85857069 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

7 0.85413408 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

8 0.84979832 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

9 0.84491187 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

10 0.84313399 230 nips-2013-Online Learning with Costly Features and Labels

11 0.83860803 325 nips-2013-The Pareto Regret Frontier

12 0.83518225 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

13 0.83193702 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

14 0.82863379 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

15 0.82828808 269 nips-2013-Regression-tree Tuning in a Streaming Setting

16 0.82673842 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

17 0.82086289 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

18 0.82055789 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

19 0.82025409 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

20 0.81880736 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting