nips nips2013 nips2013-95 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. [sent-8, score-0.525]
2 Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. [sent-10, score-1.012]
3 In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. [sent-11, score-0.41]
4 √ That is, distributing learning to k players gives rise to a factor k parallel speedup. [sent-12, score-0.444]
5 Following recent MAB literature [14, 3, 15, 18], we focus on the problem of identifying a “good” bandit arm with high confidence. [sent-21, score-0.655]
6 In this problem, we may repeatedly choose one arm (corresponding to an action) and observe a reward drawn from a probability distribution associated with that arm. [sent-22, score-0.672]
7 Our goal is to find an arm with an (almost) optimal expected reward, with as few arm pulls as possible (that is, minimize the simple regret [7]). [sent-23, score-1.437]
8 In our setup, a distributed strategy is evaluated by the number of arm pulls per node required for the task, which correlates with the parallel speed-up obtained by distributing the learning process. [sent-27, score-0.993]
9 In our model, there are k players that correspond to k independent machines in a cluster. [sent-29, score-0.391]
10 The players are presented with a set of arms, with a common goal of identifying a good arm. [sent-30, score-0.417]
11 Each player receives a stream of queries upon each it chooses an arm to pull. [sent-31, score-0.98]
12 To collaborate, the players may communicate with each other. [sent-33, score-0.455]
13 We assume that the bandwidth of the underlying network is limited, so that players cannot simply share every piece of information. [sent-34, score-0.391]
14 Also, communicating over the network might incur substantial latencies, so players should refrain from doing so as much as possible. [sent-35, score-0.408]
15 When measuring communication of a certain multi-player protocol we consider the number of communication rounds it requires, where in a round of communication each player broadcasts a single message (of arbitrary size) to all other players. [sent-36, score-1.094]
16 At one extreme, if all players broadcast to each other each and every arm reward as it is observed, they can simply simulate the decisions of a serial, optimal algorithm. [sent-39, score-1.048]
17 At the other extreme, if the players never communicate, each will suffer the learning curve of a single player, thereby avoiding any possible speed-up the distributed system may provide. [sent-41, score-0.451]
18 Considering the high cost of communication, perhaps the simplest and most important question that arises is how well can the players learn while keeping communication to the very minimum. [sent-43, score-0.547]
19 More specifically, is there a non-trivial strategy by which the players can identify a “good” arm while communicating only once, at the end of the process? [sent-44, score-1.067]
20 As a corollary, we derive an algorithm that demonstrates an explicit trade-off between the number of arm pulls and the amount of inter-player communication. [sent-51, score-0.837]
21 In the MAB literature, several recent works consider multi-player MAB scenarios in which players actually compete with each other, either on arm-pulls resources [15] or on the rewards received [19]. [sent-54, score-0.419]
22 In contrast, we study a collaborative multi-player problem and investigate how sharing observations helps players achieve their common goal. [sent-55, score-0.418]
23 The players are given n arms, enumerated by [n] := {1, 2, . [sent-64, score-0.391]
24 Each arm i ∈ [n] is associated with a reward, which is a [0, 1]-valued random variable with expectation pi . [sent-68, score-0.654]
25 , T , each player pulls one arm of his choice and observes an independent sample of its reward. [sent-73, score-1.24]
26 Each player may choose any of the arms, regardless of the other players and their actions. [sent-74, score-0.809]
27 At the end of the game, each player must commit to a single arm. [sent-75, score-0.421]
28 In a communication round, that may take place at any predefined time step, each player may broadcast a message to all other players. [sent-76, score-0.604]
29 In the best-arm identification version of the problem, the goal of a multi-player algorithm given some target confidence level δ > 0, is that with probability at least 1 − δ all players correctly identify the best arm (i. [sent-78, score-1.093]
30 For simplicity, we assume in this setting that the best arm is unique. [sent-81, score-0.597]
31 Similarly, in the (ε, δ)-PAC variant the goal is that each player finds an ε-optimal (or “ε-best”) arm, that is an arm i with pi ≥ p1 − ε, with high probability. [sent-82, score-1.057]
32 We use the notation ∆i := p1 − pi to denote the suboptimality gap of arm i, and occasionally use ∆ := ∆2 for denoting the minimal gap. [sent-84, score-0.654]
33 In the best-arm version of the problem, where we assume that the best arm is unique, we have ∆i > 0 for all i > 1. [sent-85, score-0.597]
34 (∆ε )2 i (1) Intuitively, the hardness of the task is therefore captured by the quantity Hε , which is roughly the number of arm pulls needed to find an ε-best arm with a reasonable probability; see also [3] for a discussion. [sent-99, score-1.414]
35 The first obvious approach, already mentioned earlier, is the no-communication strategy: just let each player explore the arms in isolation of the other players, following an independent instance of some serial strategy; at the end of the executions, all players hold an ε-best arm. [sent-103, score-1.149]
36 ˜ Clearly, this approach performs poorly in terms of learning performance, needing Ω(Hε ) pulls per player in the worst case and not leading to any parallel speed-up. [sent-104, score-0.725]
37 Another straightforward approach is to employ a majority vote among the players: let each player independently identify an arm, and choose the arm having most of the votes (alternatively, at least half of the votes). [sent-105, score-1.188]
38 However, this approach does not lead to any improvement in performance: for this vote to work, each player has to solve the problem correctly with reasonable probability, which ˜ already require Ω(Hε ) pulls of each. [sent-106, score-0.743]
39 This approach is of course prohibited in our context, but is able to achieve the optimal parallel speed-up, linear in the number of players k. [sent-111, score-0.419]
40 For the best-arm identifi√ cation task, our observation is that by letting each player explore a smaller set of n/ k arms chosen √ at random and choose one of them as “best”, about k of the players would come up with the global best arm. [sent-116, score-1.103]
41 This (partial) consensus on a single arm is a key aspect in our approach, since it allows the players to identify the correct best arm among the votes of all k players, after sharing information √ only once. [sent-117, score-1.657]
42 Although our goal here is pure exploration, in our algorithms each player follows an explore-exploit strategy. [sent-119, score-0.403]
43 The idea is that a player should sample his recommended arm as much as his budget permits, even if it was easy to identify in his smallsized problem. [sent-120, score-1.033]
44 This way we can guarantee that the top arms are sampled to a sufficient precision by the time each of the players has to choose a single best arm. [sent-121, score-0.7]
45 As mentioned above, an agreement on a single arm is essential for a vote to work. [sent-123, score-0.637]
46 In the case of multiple communication rounds, we present a distributed elimination-based algorithm that discards arms right after each communication round. [sent-126, score-0.646]
47 Between rounds, we share the work load between players uniformly. [sent-127, score-0.422]
48 We show that the number of such rounds can be reduced to as low as O(log(1/ε)), by eliminating all 2−r -suboptimal arms in the r’th round. [sent-128, score-0.391]
49 3 One Communication Round This section considers the most basic variant of the multi-player MAB problem, where each player is only allowed a single transmission, when finishing her queries. [sent-133, score-0.403]
50 3 with a lower bound for the required budget of arm pulls in this setting. [sent-139, score-0.853]
51 Our algorithms in this section assume the availability of a serial algorithm A(A, ε), that given a set of arms A and target accuracy ε, identifies an ε-best arm in A with probability at least 2/3 using no more than 1 |A| cA log ε (2) (∆ε )2 ∆i i i∈A 4 arm pulls, for some constant cA > 1. [sent-140, score-1.539]
52 For simplicity, we present a version matching δ = 1/3, meaning that the algorithm produces the correct arm with probability at least 2/3; we later explain how to extend it to deal with arbitrary values of δ. [sent-145, score-0.625]
53 Our algorithm is akin to a majority vote among the multiple players, in which each player pulls arms in two stages. [sent-146, score-1.025]
54 In the first E XPLORE stage, each player independently solves a “smaller” MAB instance on a random subset of the arms using the exploration strategy A. [sent-147, score-0.777]
55 In the second E XPLOIT stage, each player exploits the arm identified as “best” in the first stage, and communicates that arm and its observed average reward. [sent-148, score-1.599]
56 An appealing feature of our algorithm is that it requires each player to transmit a single message of constant size (up to logarithmic factors). [sent-150, score-0.448]
57 The algorithm uses a single communicaif the algorithm fails to identify any arm or tion round, in which each player communicates does not terminate gracefully, let ij be an ˜ O(1) bits. [sent-156, score-1.145]
58 Note that we can 7: let ki be the number of players j with ij = i, √ still do that with one communication round (at and define A = {i : ki > k} the end of all executions), but each player now 8: let pi = (1/ki ) {j : ij =i} qj for all i ˆ ˆ has to communicate O(log(1/δ)) values1 . [sent-158, score-1.479]
59 n ˜ 1 gorithm that given O √k i=2 1/∆2 arm i pulls, identifies the best arm correctly with probability at least 1 − δ. [sent-162, score-1.242]
60 The algorithm uses a single communication round, in which each player communicates O(log(1/δ)) numerical values. [sent-163, score-0.601]
61 We show that a budget T of samples (arm pulls) per player, where n 24cA 1 n T ≥ √ · 2 ln ∆ , k i=2 ∆i i (3) suffices for the players to jointly identify the best arm i with the desired probability. [sent-166, score-1.084]
62 Our first lemma shows that each player chooses the global best arm and identifies it as the local best arm with sufficiently large probability. [sent-171, score-1.633]
63 √ In fact, by letting each player pick a slightly larger subset of O( log(1/δ) · n/ k) arms, we can amplify the success probability to 1 − δ without needing to communicate more than 2 values per player. [sent-172, score-0.541]
64 When (3) holds, each player identifies the (global) best arm correctly after the E X √ PLORE phase with probability at least 2/ k. [sent-176, score-1.1]
65 Provided that (3) holds, we have |ˆi − pi | ≤ 1 ∆ for all arms i ∈ A with probability p 2 at least 5/6. [sent-183, score-0.399]
66 Let us first show that with probability at least 5/6, the best arm i is contained in the set A. [sent-189, score-0.645]
67 Bernoulli random variables {Ij }j where Ij is the indicator of whether√ player j chooses arm i after the E XPLORE phase. [sent-193, score-0.98]
68 Next, note that with probability at least 5/6 the arm i ∈ A having the highest empirical reward pi ˆ is the one with the highest expected reward pi . [sent-196, score-0.903]
69 4 that 1 shows that with probability at least 5/6, for all arms i ∈ A the estimate pi is within 2 ∆ of the true ˆ bias pi . [sent-198, score-0.476]
70 Hence, via a union bound we conclude that with probability at least 2/3, the best arm is in A and has the highest empirical reward. [sent-199, score-0.645]
71 In other words, with probability at least 2/3 the algorithm outputs the best arm i . [sent-200, score-0.645]
72 Here, there might be more than one ε-best arm, so each “successful” player might come up with a different ε-best arm. [sent-203, score-0.403]
73 Nevertheless, our analysis below shows that with high probability, a subset of the players can still agree on a single ε-best arm, which makes it possible to identify it among the votes of all players. [sent-204, score-0.466]
74 Algorithm 2 identifies a 2ε-best arm with probability at least 2/3 using no more than n 1 1 n √ · log ε ∆i k i=2 (∆ε )2 i √ arm pulls per player, provided that 24 ≤ k ≤ n. [sent-208, score-1.478]
75 The algorithm uses a single communication ˜ round, in which each player communicates O(1) bits. [sent-209, score-0.601]
76 Our analysis considers two different of √ 1 1 regimes: n2ε ≤ 50 k and n2ε > 50 k, and shows that in any case, T ≥ 400cA √ k n i=2 1 24n ε )2 ln ∆ε (∆i i (4) suffices for identifying a 2ε-best arm with the desired probability. [sent-212, score-0.63]
77 The first lemma shows that at least one of the players is able to find an ε-best arm. [sent-215, score-0.457]
78 When (4) holds, at least one player successfully identifies an ε-best arm in the E X PLORE phase, with probability at least 5/6. [sent-219, score-1.058]
79 The next lemma is more refined and states that in case there are few 2ε-best arms, the probability of each player to successfully identify an ε-best arm grows linearly with nε . [sent-220, score-1.071]
80 When (4) holds, each player identifies an ε-best arm in the √ E XPLORE phase, with probability at least 2nε / k. [sent-224, score-1.028]
81 With probability at least 5/6, we output an arm have |ˆi − pi | ≤ ε/2 for all arms i ∈ A. [sent-228, score-0.976]
82 p 1: for player j = 1 to k do √ 2: choose a subset Aj of 12n/ k arms uniFor the proofs of the above lemmas, refer to formly at random [16]. [sent-229, score-0.714]
83 This if the algorithm fails to identify any arm or would complete the proof, since Lemma 3. [sent-234, score-0.614]
84 7: let ki be the number of players j with ij = i √ 1 First, consider the case n2ε > 50 k. [sent-236, score-0.513]
85 6 shows that with probability 5/6 there exists for all i a player j that identifies an ε-best arm ij . [sent-238, score-1.066]
86 Since 9: define A = {i ∈ [n] : ti ≥ (1/ε2 ) ln(12n)} for at least n2ε arms ∆i ≤ 2ε, we have 10: return arg maxi∈A pi ; if the set A is empty, ˆ 400 n2ε − 1 24n output an arbitrary arm. [sent-239, score-0.396]
87 The random variable N is a sum of Bernoulli random variables {Ij }j √ where Ij indicates whether player j identified some ε-best arm. [sent-243, score-0.403]
88 ε That is, with probability 5/6, at least nε k players found an ε-best arm. [sent-246, score-0.439]
89 A pigeon-hole argument √ now shows that in this case there exists an ε-best arm i selected by at least k players. [sent-247, score-0.607]
90 Hence, with probability 5/6 the number of samples of this arm collected in the E XPLOIT phase is at least √ ti ≥ kT /2 > (1/ε2 ) ln(12n), which means that i ∈ A. [sent-248, score-0.672]
91 3 Lower Bound The following theorem suggests that √ general, for identifying the best arm k players achieve a in ˜ multiplicative speed-up of at most O( k) when allowing one transmission per player (at the end of the game). [sent-250, score-1.515]
92 We accomplish this by let- output an arm 1: initialize S0 ← [n], r ← 0, t0 ← 0 ting each player sample uniformly all remain2: repeat ing arms and communicate the results to other 3: set r ← r + 1 players. [sent-266, score-1.318]
93 Then, players are able to eliminate 4: let εr ← 2−r , tr ← (2/kε2 ) ln(4nr2 /δ) suboptimal arms with high confidence. [sent-267, score-0.665]
94 If r for player j = 1 to k do each such round is successful, after log2 (1/ε) 5: 6: sample each arm i ∈ Sr−1 for tr − tr−1 rounds only ε-best arms survive. [sent-268, score-1.45]
95 1 times below bounds the number of arm pulls used by 7: let pr be the average reward of arm i (in ˆj,i this algorithm (a proof can be found in [16]). [sent-270, score-1.527]
96 By properly tuning the elimination thresholds εr of Algorithm 3 in accordance with the target accuracy ε, we can establish an explicit trade-off between the number of communication rounds and the number of arm pulls each player needs. [sent-277, score-1.549]
97 With probability at least 1 − δ, the modified algorithm n ˜ • identifies an ε-best arm using O((ε−2/R /k) · i=2 (1/∆ε )2 ) arm pulls per player; i • terminates after at most R rounds of communication. [sent-283, score-1.617]
98 5 Conclusions and Further Research We have considered a collaborative MAB exploration problem, in which several independent players explore a set of arms with a common goal, and obtained the first non-trivial results in such setting. [sent-284, score-0.765]
99 Our main results apply for the specifically interesting regime where each of the players is allowed a single transmission; this setting fits naturally to common distributed frameworks such as MapReduce. [sent-285, score-0.451]
100 Intuitively, the difficulty here is that in the second phase of a reasonable strategy each player should focus on the arms that excelled in the first phase; this makes the sub-problems being faced in the second phase as hard as the entire MAB instance, in terms of the quantity Hε . [sent-288, score-0.768]
wordName wordTfidf (topN-words)
[('arm', 0.577), ('player', 0.403), ('players', 0.391), ('arms', 0.274), ('pulls', 0.26), ('mab', 0.158), ('communication', 0.156), ('rounds', 0.117), ('xplore', 0.082), ('round', 0.079), ('pi', 0.077), ('exploration', 0.073), ('xploit', 0.068), ('ij', 0.068), ('communicate', 0.064), ('serial', 0.063), ('reward', 0.062), ('sr', 0.061), ('distributed', 0.06), ('vote', 0.06), ('identi', 0.054), ('ki', 0.054), ('bandit', 0.052), ('pr', 0.051), ('qj', 0.047), ('labs', 0.043), ('communicates', 0.042), ('yahoo', 0.038), ('votes', 0.038), ('identify', 0.037), ('elimination', 0.036), ('karnin', 0.036), ('lemma', 0.036), ('haifa', 0.035), ('kanade', 0.033), ('phase', 0.032), ('setup', 0.031), ('load', 0.031), ('least', 0.03), ('koren', 0.03), ('majority', 0.028), ('hoeffding', 0.028), ('pull', 0.028), ('parallel', 0.028), ('bandits', 0.028), ('rewards', 0.028), ('strategy', 0.027), ('ln', 0.027), ('collaborative', 0.027), ('eshcar', 0.027), ('halting', 0.027), ('hillel', 0.027), ('plore', 0.027), ('ulti', 0.027), ('aj', 0.027), ('message', 0.027), ('identifying', 0.026), ('transmission', 0.026), ('distributing', 0.025), ('ucb', 0.024), ('collaborate', 0.024), ('radlinski', 0.024), ('mannor', 0.023), ('regret', 0.023), ('executions', 0.022), ('lempel', 0.022), ('amplify', 0.022), ('formly', 0.022), ('terminates', 0.022), ('mapreduce', 0.021), ('best', 0.02), ('correctly', 0.02), ('tradeoff', 0.019), ('saha', 0.019), ('audibert', 0.019), ('theorem', 0.019), ('allowing', 0.019), ('maxi', 0.018), ('terminate', 0.018), ('gracefully', 0.018), ('broadcast', 0.018), ('probability', 0.018), ('logarithmic', 0.018), ('phillips', 0.018), ('needing', 0.018), ('end', 0.018), ('omitted', 0.017), ('communicating', 0.017), ('consensus', 0.017), ('bubeck', 0.017), ('protocols', 0.017), ('daum', 0.017), ('agarwal', 0.016), ('budget', 0.016), ('multiarmed', 0.016), ('ideal', 0.016), ('per', 0.016), ('choose', 0.015), ('ti', 0.015), ('pac', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
2 0.50763005 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
3 0.28200242 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
4 0.26273695 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
Author: Sasha Rakhlin, Karthik Sridharan
Abstract: We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to H¨ lder-smooth o functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T )T ). This addresses a question of Daskalakis et al [6]. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem. 1
5 0.24705547 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir
Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1
6 0.22727264 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
7 0.20870923 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
8 0.18323791 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
9 0.17794815 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
10 0.11496753 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
11 0.089335442 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
12 0.078671537 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
13 0.076015525 257 nips-2013-Projected Natural Actor-Critic
14 0.070331447 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
15 0.067710169 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
16 0.0605672 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
17 0.057419028 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
18 0.052907228 7 nips-2013-A Gang of Bandits
19 0.052424349 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
20 0.048862159 230 nips-2013-Online Learning with Costly Features and Labels
topicId topicWeight
[(0, 0.141), (1, -0.164), (2, 0.153), (3, -0.163), (4, 0.01), (5, -0.142), (6, 0.186), (7, -0.199), (8, -0.097), (9, -0.111), (10, 0.044), (11, 0.212), (12, -0.128), (13, -0.054), (14, -0.054), (15, -0.073), (16, -0.131), (17, 0.063), (18, -0.049), (19, 0.209), (20, 0.152), (21, -0.068), (22, 0.102), (23, 0.001), (24, -0.064), (25, -0.118), (26, 0.14), (27, 0.143), (28, -0.022), (29, -0.114), (30, 0.008), (31, 0.045), (32, 0.09), (33, -0.09), (34, 0.194), (35, 0.06), (36, -0.066), (37, 0.127), (38, -0.059), (39, 0.031), (40, -0.101), (41, 0.01), (42, 0.011), (43, 0.016), (44, -0.1), (45, -0.077), (46, 0.034), (47, -0.042), (48, -0.082), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.96857315 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
2 0.84817797 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
3 0.671345 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
4 0.54035568 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
Author: Sebastien Bubeck, Che-Yu Liu
Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
5 0.5315851 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
6 0.47947514 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
7 0.41057107 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
8 0.37525669 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
9 0.35870072 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
10 0.35095268 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
11 0.2827228 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
12 0.26419494 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
13 0.24720168 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
14 0.24714445 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
15 0.2438781 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
16 0.23474357 325 nips-2013-The Pareto Regret Frontier
17 0.23372446 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
18 0.23102535 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
19 0.21781948 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
20 0.2144389 7 nips-2013-A Gang of Bandits
topicId topicWeight
[(2, 0.018), (10, 0.215), (16, 0.027), (33, 0.197), (34, 0.059), (41, 0.02), (49, 0.016), (56, 0.15), (70, 0.02), (85, 0.07), (89, 0.036), (93, 0.061), (95, 0.017)]
simIndex simValue paperId paperTitle
1 0.87754309 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers
Author: Raphael Bailly, Xavier Carreras, Ariadna Quattoni
Abstract: Finite-State Transducers (FST) are a standard tool for modeling paired inputoutput sequences and are used in numerous applications, ranging from computational biology to natural language processing. Recently Balle et al. [4] presented a spectral algorithm for learning FST from samples of aligned input-output sequences. In this paper we address the more realistic, yet challenging setting where the alignments are unknown to the learning algorithm. We frame FST learning as finding a low rank Hankel matrix satisfying constraints derived from observable statistics. Under this formulation, we provide identifiability results for FST distributions. Then, following previous work on rank minimization, we propose a regularized convex relaxation of this objective which is based on minimizing a nuclear norm penalty subject to linear constraints and can be solved efficiently. 1
2 0.85938692 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
Author: Boqing Gong, Kristen Grauman, Fei Sha
Abstract: In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. However, image data is difficult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.) We propose an approach to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identified domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric formulation and efficient optimization procedure that can successfully discover domains among both training and test data. We extensively evaluate our approach on object recognition and human activity recognition tasks. 1
same-paper 3 0.84323245 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
4 0.81153715 158 nips-2013-Learning Multiple Models via Regularized Weighting
Author: Daniel Vainsencher, Shie Mannor, Huan Xu
Abstract: We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers. 1
Author: Yuchen Zhang, John Duchi, Michael Jordan, Martin J. Wainwright
Abstract: We establish lower bounds on minimax risks for distributed statistical estimation under a communication budget. Such lower bounds reveal the minimum amount of communication required by any procedure to achieve the centralized minimax-optimal rates for statistical estimation. We study two classes of protocols: one in which machines send messages independently, and a second allowing for interactive communication. We establish lower bounds for several problems, including various types of location models, as well as for parameter estimation in regression models. 1
6 0.77531028 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
7 0.77090234 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
8 0.76818955 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning
9 0.76814008 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
10 0.76800108 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
11 0.7676034 233 nips-2013-Online Robust PCA via Stochastic Optimization
12 0.76648313 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
13 0.76367468 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
14 0.76291084 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
15 0.76274389 149 nips-2013-Latent Structured Active Learning
16 0.76273078 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
17 0.76147467 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
18 0.76138759 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
19 0.76127791 335 nips-2013-Transfer Learning in a Transductive Setting
20 0.76088721 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies