nips nips2013 nips2013-125 knowledge-graph by maker-knowledge-mining

125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence


Source: pdf

Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. [sent-9, score-0.45]

2 Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). [sent-10, score-1.75]

3 In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. [sent-11, score-0.944]

4 Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. [sent-12, score-0.582]

5 A well studied example of prediction game is the following: In each round, the adversary privately assigns a loss value to each action in a fixed set. [sent-16, score-0.331]

6 Then the player chooses an action (possibly using randomization) and incurs the corresponding loss. [sent-17, score-0.353]

7 The goal of the player is to control regret, which is defined as the excess loss incurred by the player as compared to the best fixed action over a sequence of rounds. [sent-18, score-0.57]

8 The best √ possible regret for the expert setting is of order T log K. [sent-21, score-0.319]

9 In the bandit setting, the √ optimal regret is of order T K, achieved by the INF algorithm [2]. [sent-23, score-0.298]

10 √ bandit variant of Hedge, A called Exp3 [3], achieves a regret with a slightly worse bound of order T K log K. [sent-24, score-0.381]

11 Recently, Mannor and Shamir [14] introduced an elegant way for defining intermediate observability models between the expert setting (full observability) and the bandit setting (single observability). [sent-25, score-0.697]

12 An intuitive way of representing an observability model is through a directed graph over actions: an arc1 from action i to action j implies that when playing action i we get information also about the loss of action j. [sent-26, score-1.691]

13 Thus, the expert setting is obtained by choosing a complete graph over actions (playing any action reveals all losses), and the bandit setting is obtained by choosing an empty edge set (playing an action only reveals the loss of that action). [sent-27, score-1.062]

14 1 According to the standard terminology in directed graph theory, throughout this paper a directed edge will be called an arc. [sent-28, score-0.73]

15 1 The main result of [14] concerns undirected observability graphs. [sent-29, score-0.52]

16 The regret is characterized in terms of the independence number α of the undirected observability graph. [sent-30, score-0.81]

17 Specifically, they prove √ that T α log K is the optimal regret (up to logarithmic factors) and show that a variant of Exp3, called ELP, achieves this bound when the graph is known ahead of time, where α ∈ {1� . [sent-31, score-0.476]

18 � K} interpolates between full observability (α = 1 for the clique) and single observability (α = K for the graph with no edges). [sent-34, score-0.999]

19 Given the observability graph, ELP runs a linear program to compute the desired distribution over actions. [sent-35, score-0.417]

20 In the case when the graph changes over time,� at each time and �T step ELP observes the current observability graph before prediction, a bound of t=1 αt log K is shown, where αt is the independence number of the graph at time t. [sent-36, score-1.097]

21 A major problem left open in [14] was the characterization of regret for directed observability graphs, a setting for which they only proved partial results. [sent-37, score-0.953]

22 Our main result is a full characterization (to within logarithmic factors) of regret in the case of directed observability graphs. [sent-38, score-0.915]

23 This algorithm is efficient to run even when the graph changes over time: it just needs to compute a small dominating set of the current observability graph (which must be given as side information) before prediction. [sent-40, score-0.938]

24 2 As in the undirected case, the regret for the directed case is characterized in terms of the independence numbers of the observability graphs (computed ignoring edge directions). [sent-41, score-1.17]

25 We also explore the possibility of the learning algorithm receiving the observability graph only after prediction, and not before. [sent-44, score-0.582]

26 For this setting, we introduce a new variant of Exp3, called Exp3-SET, which achieves the same regret as ELP for undirected graphs, but without the need of accessing the current observability graph before each prediction. [sent-45, score-0.938]

27 We show that in some random directed graph models Exp3-SET has also a good performance. [sent-46, score-0.435]

28 In general, we can upper bound the regret of Exp3SET as a function of the maximum acyclic subgraph of the observability graph, but this upper bound may not be tight. [sent-47, score-0.865]

29 There are a variety of real-world settings where partial observability models corresponding to directed and undirected graphs are applicable. [sent-49, score-0.888]

30 We are given a graph of possible routes connecting cities: when we select a route r connecting two cities, we observe the cost (say, driving time or fuel consumption) of the “edges” along that route and, in addition, we have complete information on any sub-route r� of r, but not vice versa. [sent-51, score-0.267]

31 We abstract this in our model by having an observability graph over routes r, and an arc from r to any of its sub-routes r� . [sent-52, score-0.712]

32 3 Sequential prediction problems with partial observability models also arise in the context of recommendation systems. [sent-53, score-0.519]

33 This knowledge can be represented as a graph over the set of products, where two products are joined by an edge if and only if users who buy any one of the two are likely to buy the other as well. [sent-55, score-0.352]

34 Such observability models may also arise in the case when a recommendation system operates in a network of users. [sent-59, score-0.56]

35 2 Computing an approximately minimum dominating set can be done by running a standard greedy set cover algorithm, see Section 2. [sent-68, score-0.257]

36 2 2 Learning protocol� notation� and preliminaries As stated in the introduction, we consider an adversarial multi-armed bandit setting with a finite action set V = {1� . [sent-71, score-0.424]

37 , a player (the “learning algorithm”) picks some action It ∈ V and incurs a bounded loss �It �t ∈ [0� 1]. [sent-78, score-0.395]

38 Unlike the standard adversarial bandit problem [3, 7], where only the played action It reveals its loss �It �t , here we assume all the losses in a subset SIt �t ⊆ V of actions are revealed after It is played. [sent-79, score-0.623]

39 We also assume i ∈ Si�t for any i and t, that is, any action reveals its own loss when played. [sent-81, score-0.284]

40 Note that the bandit setting (Si�t = {i}) and the expert setting (Si�t = V ) are both special cases of this framework. [sent-82, score-0.28]

41 We call Si�t the observation set of action i at t time t, and write i − j when at time t playing action i also reveals the loss of action j. [sent-83, score-0.815]

42 The family of observation sets {Si�t }i∈V we collectively call the observation system at time t. [sent-85, score-0.29]

43 The performance of a player A is measured through the regret � � max � LA�T − Lk�T k∈V where LA�T = �I1 �1 + · · · + �IT �T and Lk�T = �k�1 + · · · + �k�T are the cumulative losses of the player and of action k, respectively. [sent-91, score-0.788]

44 4 The observation system {Si�t }i∈V is also adversarially generated, and each Si�t can be an arbitrary function of past player’s actions, just like losses are. [sent-93, score-0.296]

45 However, in Section 3 we also consider a variant in which the observation system is randomly generated according to a specific stochastic model. [sent-94, score-0.233]

46 In the first setting, called the informed setting, the full observation system {Si�t }i∈V selected by the adversary is made available to the learner before making its choice It . [sent-97, score-0.381]

47 In the second setting, called the uninformed setting, no information whatsoever regarding the time-t observation system is given to the learner prior to prediction. [sent-99, score-0.413]

48 , the observation system {Si�t }i∈V defines a directed graph Gt = (V� Dt ), where V is the set of actions, and Dt is the set of arcs, i. [sent-104, score-0.635]

49 A notable special case of the above is when the observation system is symmetric → over time: j ∈ Si�t if and only if i ∈ Sj�t for all i� j and t. [sent-113, score-0.25]

50 In words, playing i at time t reveals the loss of j if and only if playing j at time t reveals the loss of i. [sent-114, score-0.382]

51 A symmetric observation system is equivalent to Gt being an undirected graph or, more precisely, to a directed graph having, for every pair of nodes i� j ∈ V , either no arcs or length-two directed cycles. [sent-115, score-1.341]

52 Thus, from the point of view of the symmetry of the observation system, we also distinguish between the directed case (Gt is a general directed graph) and the symmetric case (Gt is an undirected graph for all t). [sent-116, score-0.948]

53 Two graph-theoretic notions playing an important role here are those of independent sets and dominating sets. [sent-118, score-0.276]

54 Given an undirected graph G = (V� E), an independent set of G is any subset T ⊆ V such that no two i� j ∈ T are connected by an edge in E. [sent-119, score-0.293]

55 If G is directed, we can still associate with it an independence number: we simply view G as undirected by ignoring arc orientation. [sent-122, score-0.303]

56 If G = (V� D) is a directed graph, then a subset R ⊆ V is a dominating set for G if for all j �∈ R there exists some i ∈ R such that arc (i� j) ∈ D. [sent-123, score-0.561]

57 Play action It drawn according to distribution pt = (p1�t � . [sent-134, score-0.24]

58 A dominating set is minimal if no proper subset thereof is itself a dominating set. [sent-142, score-0.415]

59 The domination number of directed graph G, denoted by γ(G), is the size of a smallest (minimal) dominating set of G. [sent-143, score-0.658]

60 Computing a minimum dominating set for an arbitrary directed graph Gt is equivalent to solving a minimum set cover problem on the associated observation system {Si�t }i∈V . [sent-144, score-0.864]

61 Although minimum set cover is NP-hard, the well-known Greedy Set Cover algorithm [9], which repeatedly selects from {Si�t }i∈V the set containing the largest number of uncovered elements so far, computes a dominating set Rt such that |Rt | ≤ γ(Gt ) (1 + ln K). [sent-145, score-0.337]

62 Note that when G is undirected (more precisely, as above, when G is a directed graph having for every pair of nodes i� j ∈ V either no arcs or length-two cycles), then mas(G) = α(G), otherwise mas(G) ≥ α(G). [sent-148, score-0.656]

63 In particular, when G is itself a directed acyclic graph, then mas(G) = |V |. [sent-149, score-0.328]

64 3 Algorithms without Explicit Exploration: The Uninformed Setting In this section, we show that a simple variant of the Exp3 algorithm [3] obtains optimal regret (to within logarithmic factors) in the symmetric and uninformed setting. [sent-150, score-0.485]

65 We then show that even the harder adversarial directed setting lends itself to an analysis, though with a weaker regret bound. [sent-151, score-0.598]

66 → Next, we bound the regret of Exp3-SET in terms of the key quantity � pi�t � p � i�t Qt = = . [sent-155, score-0.267]

67 6 Theorem 1 The regret of Exp3-SET satisfies T � ln K � η � + �[Qt ] . [sent-162, score-0.298]

68 max � LA�T − Lk�T ≤ k∈V η 2 t=1 As we said, in the adversarial and symmetric case the observation system at time t can be described by an undirected graph Gt = (V� Et ). [sent-163, score-0.613]

69 Corollary 2 In the symmetric setting, the regret of Exp3-SET satisfies T � � ln K η � + �[α(Gt )] . [sent-167, score-0.348]

70 We start by considering a setting in which the observation system is stochastically generated. [sent-179, score-0.243]

71 The Erd˝ s-Renyi model is a standard model for random directed graphs G = (V� D), where we are o given a density parameter r ∈ [0� 1] and, for any pair i� j ∈ V , arc (i� j) ∈ D with independent probability r. [sent-181, score-0.435]

72 Then the regret of Exp3-SET satisfies � � ln K � ηT � + max � LA�T − Lk�T ≤ 1 − (1 − r)K . [sent-184, score-0.298]

73 max � LA�T − Lk�T ≤ k∈V r Note that as r ranges in [0� 1] we interpolate between the bandit (r = 0)8 and the expert (r = 1) regret bounds. [sent-190, score-0.384]

74 Corollary 4 In the directed setting, the regret of Exp3-SET satisfies 6 7 8 T � � ln K η � �[mas(Gt )] . [sent-192, score-0.568]

75 The simple observation is that given a directed graph G, we can define a new graph G� which is made undirected just by reciprocating arcs; namely, if there is an arc (i� j) in G we add arcs (i� j) and (j� i) in G� . [sent-208, score-1.011]

76 Corollary 5 Fix a directed graph G, and suppose Gt = G for all t. [sent-212, score-0.435]

77 Then there exists a �randomized) � � adversarial strategy such that for any T = Ω α(G)3 and for any learning strategy, the expected � �� regret of the learner is Ω α(G)T . [sent-213, score-0.324]

78 , o [11]), show that, when the density parameter r is constant, the independence number of the resulting graph has an inverse dependence on r. [sent-216, score-0.265]

79 One may wonder whether a sharper lower bound argument exists which applies to the general directed adversarial setting and involves the larger quantity mas(G). [sent-218, score-0.511]

80 Unfortunately, the above measure does not seem to be related to the optimal regret: Using Claim 1 in the appendix (see proof of Theorem 3) one can exhibit a sequence of graphs each having a large acyclic subgraph, on which the regret of Exp3-SET is still small. [sent-219, score-0.313]

81 In the next section, we show an allocation strategy that delivers optimal (to within logarithmic factors) regret bounds using prior knowledge of the graphs Gt . [sent-222, score-0.322]

82 This is due to the fact that, when the graph induced by the observation system is directed, the key quantity Qt defined in (1) cannot be nonvacuously upper bounded independent of the choice of probabilities pi�t . [sent-225, score-0.422]

83 This exploration term is supported on a dominating set of the current graph Gt . [sent-227, score-0.401]

84 For this reason, Exp3-DOM requires prior access to a dominating set Rt at each time step t which, in turn, requires prior knowledge of the entire observation system {Si�t }i∈V . [sent-228, score-0.391]

85 As announced, the next result shows that, even for simple directed graphs, there exist distributions pt on the vertices such that Qt is linear in the number of nodes while the independence number is 1. [sent-229, score-0.432]

86 Play action It drawn according to distribution pt � �bt ) �bt ) � = p1�t � . [sent-247, score-0.24]

87 Then Q= K � i=1 pi + p �i j : j− i → pj = K � i=1 pi �K j=i pj = K +1 . [sent-266, score-0.372]

88 2 We are now ready to introduce and analyze the new algorithm Exp3-DOM for the informed and directed setting. [sent-267, score-0.372]

89 At time t the algorithm is given observation system {Si�t }i∈V , and computes a dominating set Rt of the directed graph Gt induced by {Si�t }i∈V . [sent-272, score-0.826]

90 Based on the size |Rt | of Rt , the algorithm uses instance bt = �log2 |Rt |� to pick action It . [sent-273, score-0.256]

91 For this reason, we do not prove a regret bound for Exp3-DOM, where each instance uses a fixed γ �b) , but for a slight variant (described in the proof of Theorem 7 —see the appendix) where each γ �b) is set through a doubling trick. [sent-289, score-0.314]

92 Theorem 7 In the directed case, the regret of Exp3-DOM satisfies �  � � �log2 K� �b) b � � � � Qt  2 ln K + γ �b) �  1 + b+1  . [sent-290, score-0.568]

93 Comparing Corollary 8 to Corollary 5 delivers the announced characherization in the general adversarial and directed setting. [sent-297, score-0.426]

94 5 Conclusions and work in progress We have investigated online prediction problems in partial information regimes that interpolate between the classical bandit and expert settings. [sent-301, score-0.263]

95 Our improvements are diverse, and range from considering both informed and uninformed settings to delivering more refined graph-theoretic characterizations, from providing more efficient algorithmic solutions to relying on simpler (and often more general) analytical tools. [sent-304, score-0.276]

96 Some research directions we are currently pursuing are the following: (1) We are currently investigating the extent to which our results could be applied to the case when the observation system {Si�t }i∈V may depend on the loss �It �t of player’s action It . [sent-305, score-0.42]

97 (2) The upper bound contained in Corollary 4 and expressed in terms of mas(·) is almost certainly suboptimal, even in the uninformed setting, and we are trying to see if more adequate graph complexity measures can be used instead. [sent-307, score-0.419]

98 (3) Our lower bound in Corollary 5 heavily relies on the corresponding lower bound in [14] which, in turn, refers to a constant graph sequence. [sent-308, score-0.265]

99 � α(GT ) (or variants thereof), in both the uninformed and the informed settings. [sent-315, score-0.276]

100 (4) All our upper bounds rely on parameters to be tuned as a function of sequences of observation system quantities (e. [sent-316, score-0.257]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('observability', 0.417), ('gt', 0.29), ('directed', 0.27), ('dominating', 0.191), ('regret', 0.19), ('action', 0.178), ('player', 0.175), ('uninformed', 0.174), ('graph', 0.165), ('si', 0.159), ('mas', 0.156), ('elp', 0.156), ('qt', 0.143), ('rt', 0.121), ('arcs', 0.118), ('lk', 0.118), ('la', 0.11), ('system', 0.11), ('tt', 0.11), ('bandit', 0.108), ('ln', 0.108), ('undirected', 0.103), ('informed', 0.102), ('corollary', 0.101), ('arc', 0.1), ('independence', 0.1), ('pi', 0.097), ('adversarial', 0.095), ('observation', 0.09), ('pj', 0.089), ('sit', 0.086), ('expert', 0.086), ('playing', 0.085), ('qi', 0.08), ('bt', 0.078), ('losses', 0.07), ('actions', 0.066), ('graphs', 0.065), ('reveals', 0.064), ('pt', 0.062), ('followers', 0.059), ('wi', 0.059), ('acyclic', 0.058), ('disclosed', 0.052), ('bound', 0.05), ('symmetric', 0.05), ('buy', 0.045), ('exploration', 0.045), ('israel', 0.043), ('setting', 0.043), ('nicol', 0.043), ('loss', 0.042), ('doubling', 0.041), ('users', 0.041), ('bandits', 0.041), ('adversary', 0.04), ('subgraph', 0.04), ('learner', 0.039), ('logarithmic', 0.038), ('cover', 0.038), ('erd', 0.038), ('prediction', 0.036), ('route', 0.036), ('game', 0.035), ('liked', 0.035), ('observes', 0.035), ('mannor', 0.033), ('thereof', 0.033), ('partial', 0.033), ('social', 0.033), ('variant', 0.033), ('recommendation', 0.033), ('announced', 0.032), ('bsf', 0.032), ('domination', 0.032), ('israeli', 0.032), ('products', 0.031), ('accessing', 0.03), ('routes', 0.03), ('randomization', 0.03), ('upper', 0.03), ('freund', 0.029), ('round', 0.029), ('delivers', 0.029), ('kt', 0.029), ('greedy', 0.028), ('irrespective', 0.027), ('buying', 0.027), ('hedge', 0.027), ('alon', 0.027), ('quantities', 0.027), ('quantity', 0.027), ('claudio', 0.026), ('sharper', 0.026), ('adversarially', 0.026), ('cities', 0.026), ('pk', 0.026), ('factors', 0.026), ('edge', 0.025), ('yoav', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. 1

2 0.32129827 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1

3 0.27819011 325 nips-2013-The Pareto Regret Frontier

Author: Wouter M. Koolen

Abstract: Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. We study which such regret trade-offs can be achieved, and how. We analyse regret w.r.t. each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each finite horizon and asymptotically. 1

4 0.2587046 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1

5 0.24901403 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set ⇧ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. 1

6 0.23696758 230 nips-2013-Online Learning with Costly Features and Labels

7 0.21983594 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

8 0.18566148 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

9 0.15210047 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

10 0.14167567 7 nips-2013-A Gang of Bandits

11 0.14065337 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

12 0.13952103 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

13 0.13273847 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

14 0.11540889 137 nips-2013-High-Dimensional Gaussian Process Bandits

15 0.11496753 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

16 0.1109376 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

17 0.098888859 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

18 0.094322257 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

19 0.087321348 89 nips-2013-Dimension-Free Exponentiated Gradient

20 0.086853951 26 nips-2013-Adaptive Market Making via Online Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.225), (1, -0.197), (2, 0.222), (3, -0.224), (4, 0.026), (5, -0.102), (6, 0.062), (7, -0.1), (8, -0.01), (9, -0.038), (10, 0.01), (11, -0.115), (12, 0.181), (13, -0.013), (14, 0.083), (15, -0.092), (16, 0.013), (17, -0.132), (18, -0.082), (19, -0.049), (20, -0.145), (21, -0.032), (22, -0.102), (23, 0.028), (24, -0.016), (25, -0.053), (26, 0.003), (27, 0.07), (28, -0.022), (29, 0.04), (30, -0.069), (31, -0.012), (32, 0.032), (33, -0.002), (34, 0.067), (35, -0.017), (36, -0.064), (37, 0.004), (38, -0.049), (39, -0.004), (40, 0.045), (41, 0.012), (42, 0.07), (43, -0.017), (44, -0.018), (45, -0.047), (46, -0.05), (47, -0.029), (48, -0.037), (49, -0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96493262 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. 1

2 0.8467859 325 nips-2013-The Pareto Regret Frontier

Author: Wouter M. Koolen

Abstract: Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. We study which such regret trade-offs can be achieved, and how. We analyse regret w.r.t. each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each finite horizon and asymptotically. 1

3 0.84325159 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1

4 0.77564132 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1

5 0.74200237 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set ⇧ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. 1

6 0.64367354 230 nips-2013-Online Learning with Costly Features and Labels

7 0.63116044 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

8 0.55328029 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

9 0.54007 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

10 0.49675161 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

11 0.48626041 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

12 0.47736472 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

13 0.45903885 26 nips-2013-Adaptive Market Making via Online Learning

14 0.44618425 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

15 0.4372313 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

16 0.43206698 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

17 0.42196971 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap

18 0.40970269 7 nips-2013-A Gang of Bandits

19 0.40805903 89 nips-2013-Dimension-Free Exponentiated Gradient

20 0.40455773 25 nips-2013-Adaptive Anonymity via $b$-Matching


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.081), (16, 0.014), (33, 0.112), (34, 0.069), (41, 0.035), (49, 0.036), (51, 0.166), (56, 0.167), (70, 0.03), (85, 0.125), (89, 0.04), (93, 0.038), (95, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86195576 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. 1

2 0.79432106 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Author: Bruno Scherrer

Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !

3 0.79363424 332 nips-2013-Tracking Time-varying Graphical Structure

Author: Erich Kummerfeld, David Danks

Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1

4 0.78394276 66 nips-2013-Computing the Stationary Distribution Locally

Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah

Abstract: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difficult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1

5 0.7831468 89 nips-2013-Dimension-Free Exponentiated Gradient

Author: Francesco Orabona

Abstract: I present a new online learning algorithm that extends the exponentiated gradient framework to infinite dimensional spaces. My analysis shows that the algorithm is implicitly able to estimate the L2 norm of the unknown competitor, U , achieving √ a regret bound of the order of O(U log(U T + 1)) T ), instead of the standard √ O((U 2 + 1) T ), achievable without knowing U . For this analysis, I introduce novel tools for algorithms with time-varying regularizers, through the use of local smoothness. Through a lower bound, I also show that the algorithm is optimal up to log(U T ) term for linear and Lipschitz losses. 1

6 0.78137201 86 nips-2013-Demixing odors - fast inference in olfaction

7 0.77624881 7 nips-2013-A Gang of Bandits

8 0.77073985 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

9 0.77048278 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

10 0.76883698 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

11 0.76870257 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

12 0.76555181 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

13 0.76506621 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

14 0.76335239 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

15 0.76207864 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

16 0.76044112 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

17 0.75902772 232 nips-2013-Online PCA for Contaminated Data

18 0.75746411 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

19 0.75694758 25 nips-2013-Adaptive Anonymity via $b$-Matching

20 0.75591671 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints