nips nips2013 nips2013-338 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
Reference: text
sentIndex sentText sentNum sentScore
1 se Abstract We consider an infinite-armed bandit problem with Bernoulli rewards. [sent-4, score-0.141]
2 The mean rewards are independent, uniformly distributed over [0, 1]. [sent-5, score-0.203]
3 We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. [sent-7, score-0.856]
4 This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. [sent-8, score-0.602]
5 This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. [sent-9, score-0.507]
6 While classical multi-armed bandit problems assume a finite number of arms [9], many practical situations involve a large, possibly infinite set of options for the player. [sent-13, score-0.554]
7 In such situations, it is usually not possible to explore all options, a constraint that is best represented by a bandit problem with an infinite number of arms. [sent-15, score-0.231]
8 Moreover, even when the set of options is limited, the time horizon may be too short in practice to enable the full exploration of these options. [sent-16, score-0.447]
9 Unlike classical algorithms like UCB [10, 1], which rely on a initial phase where all arms are sampled once, algorithms for infinite-armed bandits have an intrinsic stopping rule in the number of arms to explore. [sent-17, score-0.825]
10 We believe that this provides useful insights into the design of efficient algorithms for usual finite-armed bandits when the time horizon is relatively short. [sent-18, score-0.377]
11 We consider a stochastic infinite-armed bandit with Bernoulli rewards, the mean reward of each arm having a uniform distribution over [0, 1]. [sent-20, score-0.914]
12 We propose √two-target algorithm a based on some fixed parameter m that achieves a long-term average regret in 2n for large m and a large known time horizon n. [sent-22, score-0.602]
13 The anytime version of this algo√ rithm achieves a long-term average regret in 2 n for unknown time horizon n, which we conjecture to be also optimal. [sent-24, score-0.758]
14 Specifically, if the probability that the mean reward exceeds u is equivalent to α(1 − u)β when ∗ † The authors are members of the LINCS, Paris, France. [sent-26, score-0.136]
15 e 1 β u → 1− , the two-target algorithm achieves a long-term average regret in C(α, β)n β+1 , with some explicit constant C(α, β) that depends on whether the time horizon is known or not. [sent-31, score-0.602]
16 This regret is provably optimal when the time horizon is known. [sent-32, score-0.568]
17 The stochastic infinite-armed bandit problem has first been considered in a general setting by Mallows and Robbins [12] and then in the particular case of Bernoulli rewards by Herschkorn, Pek¨ z and Ross [6]. [sent-35, score-0.318]
18 The proposed algorithms are first-order optimal in the sense that they o minimize the ratio Rn /n for large n, where Rn is the regret after n time steps. [sent-36, score-0.297]
19 In the considered setting of Bernoulli rewards with mean rewards uniformly distributed over [0, 1], this means that the ratio Rn /n tends to 0 almost surely. [sent-37, score-0.403]
20 [2], who propose √ various algorithms achieving a long-term average regret in 2 n, conjecture that this regret is opti√ mal and provide a lower bound in 2n. [sent-41, score-0.531]
21 Our algorithm achieves a regret that is arbitrarily close to √ √ 2n, which invalidates the conjecture. [sent-42, score-0.294]
22 [2] and applied in [11, 4, 5, 7] to various mean-reward distributions are variants of the 1-failure strategy where each arm is played until the first failure, √ called a run. [sent-46, score-0.679]
23 For instance, the non-recalling n-run policy consists in exploiting the first arm giving √ a run larger than √ n. [sent-47, score-0.734]
24 For a uniform mean-reward distribution over [0, 1], the average number of explored arms is n and √ selected arm is exploited for the equivalent of n time steps with an the √ expected failure rate of 1/ n, yielding the regret of 2 n. [sent-48, score-1.552]
25 We introduce a second target to improve the expected failure rate of the selected arm, at the expense of a slightly more expensive exploration phase. [sent-49, score-0.318]
26 Specifically, we show that it is optimal to explore n/2 arms on average, resulting in the √ expected √ failure rate 1/ 2n of the exploited arm, for the equivalent of n time steps, hence the regret of 2n. [sent-50, score-0.925]
27 For unknown horizon times, anytime versions of the algorithms of Berry √ al. [sent-51, score-0.398]
28 are proposed by Teytaud, Gelly and Sebag in [13] and proved to achieve a regret in O( n). [sent-53, score-0.23]
29 We √ show that the anytime version of our algorithm achieves a regret arbitrarily close to 2 n, which we conjecture to be optimal. [sent-54, score-0.434]
30 Our results extend to any mean-reward distribution whose support contains 1, the regret depending on the characteristics of this distribution around 1. [sent-55, score-0.253]
31 This problem has been considered in the more general setting of bounded rewards by Wang, Audibert and Munos [15]. [sent-56, score-0.177]
32 When the time horizon is known, their algorithms consist in exploring a pre-defined set of K arms, which depends on the parameter β mentioned above, using variants of the UCB policy [10, 1]. [sent-57, score-0.392]
33 In the present case of Bernoulli rewards and mean-reward distributions whose support contains 1, the corresponding β regret is in n β+1 , up to logarithmic terms coming from the exploration of the K arms, as in usual finite-armed bandits algorithms [9]. [sent-58, score-0.594]
34 The nature of our algorithm is very different in that it is based on a stopping rule in the exploration phase that depends on the observed rewards. [sent-59, score-0.159]
35 This does not only remove the logarithmic terms in the regret but also achieves the optimal constant. [sent-60, score-0.298]
36 2 Model We consider a stochastic multi-armed bandit with an infinite number of arms. [sent-61, score-0.141]
37 , the rewards of arm k are Bernoulli with unknown parameter θk . [sent-65, score-0.841]
38 We refer to rewards 0 and 1 as a failure and a success, respectively, and to a run as a consecutive sequence of successes followed by a failure. [sent-66, score-0.367]
39 1 Specifically, it is assumed that for any policy, the mean rewards of the explored arms have a uniform distribution over [0, 1], independently of the number of explored arms. [sent-71, score-0.691]
40 For the 1-failure policy for instance, given that only one arm has been explored until time n, the mean reward of this arm has a beta distribution with parameters 1, n. [sent-73, score-1.622]
41 2 This lower bound is 4 n/3 for a√ distribution with parameters 1/2, 1, see [11], while our algorithm beta achieves a regret arbitrarily close to 2 n in this case, since C(α, β) = 2 for α = 1/2 and β = 1, see the appendix. [sent-74, score-0.393]
42 , we select some arm It and receive the corresponding reward Xt , which is a Bernoulli random variable with parameter θIt . [sent-79, score-0.716]
43 , the arm selection only depends on previous arm selections and rewards; formally, the random variable It is Ft−1 -mesurable, where Ft denotes the σ-field generated by the set {I1 , X1 , . [sent-84, score-1.256]
44 Let Kt be the number of arms selected until time t. [sent-88, score-0.42]
45 We also assume that It+1 = It whenever Xt = 1: if the selection of arm It gives a success at time t, the same arm is selected at time t + 1. [sent-101, score-1.416]
46 The objective is to maximize the cumulative reward or, equivalently, to minimize the regret defined n by Rn = n − t=1 Xt . [sent-102, score-0.318]
47 Specifically, we focus on the average regret E(Rn ), where expectation is taken over all random variables, including the sequence of mean rewards θ1 , θ2 , . [sent-103, score-0.433]
48 1 Two-target algorithm The two-target algorithm consists in exploring new arms until two successive targets 1 and 2 are reached, in which case the current arm is exploited until the time horizon n. [sent-109, score-1.441]
49 The first target aims at discarding “bad” arms while the second aims at selecting a “good” arm. [sent-110, score-0.401]
50 We prove in Proposition 1 below that, √ for large m, the target values3 1 = 3 n and 2 = m n provide a regret in 2n. [sent-112, score-0.282]
51 2 2 Algorithm 1: Two-target algorithm with known time horizon n. [sent-113, score-0.331]
52 , n do Get reward X from arm I if not Exploit then if X = 1 then L←L+1 else M ←M +1 if M = 1 then if L < 1 then Explore else if M = m then if L < 2 then Explore else Exploit ← true 3 The first target could actually be any function 1 of the time horizon n such that when n → +∞. [sent-117, score-1.322]
53 2 Regret analysis Proposition 1 The two-target algorithm with targets 2 ∀n ≥ m , 2 2 E(Rn ) ≤ m + +1 m In particular, lim sup n→+∞ 2 − 2 = 1 n 2 3 and 2 = m m −m+2 1−m+2 2+ n 2 satisfies: 1 m+1 +2 m 1+1 . [sent-120, score-0.227]
54 Note that Let U1 = 1 if arm 1 is used until time n and U1 = 0 otherwise. [sent-123, score-0.668]
55 Denote by M1 the total number of 0’s received from arm 1. [sent-124, score-0.628]
56 (1) E(Rn ) ≤ P (U1 = 1) Let Nt be the number of 0’s received from arm 1 until time t when this arm is played until time t. [sent-126, score-1.387]
57 Since P (N 1 = 0|θ1 = u) = u 1 , the probability that the first 2 target is achieved by arm 1 is given by: 1 1 P (N 1 = 0) = u 1 du = . [sent-128, score-0.767]
58 1+1 0 Similarly, m−1 P (N 2− 1 2 < m|θ1 = u) = j=0 − j 1 u 2 − 1 −j (1 − u)j , so that the probability that arm 1 is used until time n is given by: 1 P (U1 = 1) = P (N 0 m−1 = j=0 We deduce: m 2+1 2 − ( 1 = 0|θ1 = u)P (N − 1 )! [sent-129, score-0.668]
59 Theorem 1 For any algorithm with known time horizon n, E(Rn ) √ lim inf √ ≥ 2. [sent-145, score-0.471]
60 Assume an oracle reveals the parameter of each arm after the first failure of this arm. [sent-149, score-0.786]
61 With this information, the optimal policy explores a random number of arms, each until the first failure, then plays only one of these arms until time n. [sent-150, score-0.497]
62 Let µ be the parameter of the best known arm at time t. [sent-151, score-0.688]
63 Since the probability that any new arm is better than this arm is 1 − µ, the mean cost of exploration to find 1 a better arm is 1−µ . [sent-152, score-2.008]
64 The corresponding mean reward has a uniform distribution over [µ, 1] so that the mean gain of exploitation is less than (n − t) 1−µ (it is not equal to this quantity due to the time 2 2 spent in exploration). [sent-153, score-0.239]
65 Thus if 1 − µ < n−t , it is preferable not to explore new arms and to play the best known arm, with mean reward µ, until time n. [sent-154, score-0.613]
66 We denote by An the first arm whose We have Kn ≤ An (the optimal policy cannot explore more than n . [sent-157, score-0.826]
67 2 E(An ) = The parameter θAn of arm An is uniformly distributed over [1 − E(θAn ) = 1 − 2 n , 1], so that 1 . [sent-158, score-0.628]
68 , let L1 (k) be the length of the first run of arm k. [sent-162, score-0.653]
69 2 n In particular, lim n→+∞ and 1 E(L1 (1) + . [sent-167, score-0.112]
70 n→+∞ n lim To conclude, we write: E(Rn ) ≥ E(Kn ) + E((n − L1 (1) − . [sent-174, score-0.112]
71 + L1 (An − 1) ≤ n 5 }, the number of explored arms satisfies 2 Kn ≥ An where An denotes the first arm whose parameter is larger than 1 − 4 . [sent-181, score-1.031]
72 + L1 (An − 1) ≤ n ) → 1 and E(An ) = lim inf n→+∞ 4 n−n 5 E(Kn ) 1 √ ≥√ . [sent-185, score-0.14]
73 Unknown time horizon Anytime version of the algorithm When the time horizon is unknown, the targets depend on the current time t, say 1 (t) and 2 (t). [sent-198, score-0.742]
74 Now any arm that is exploited may be eventually discarded, in the sense that a new arm is explored. [sent-199, score-1.308]
75 This happens whenever either L1 < 1 (t) or L2 < 2 (t), where L1 and L2 are the respective lengths of the first run and the first m runs of this arm. [sent-200, score-0.129]
76 Thus, unlike the previous version of the algorithm which consists in an exploration phase followed by an exploitation phase, the anytime version of the algorithm continuously switches between exploration and exploitation. [sent-201, score-0.353]
77 We prove in Proposition 2 √ √ below that, for large m, the target √ values 1 (t) = 3 t and 2 (t) = m t given in the pseudo-code achieve an asymptotic regret in 2 n. [sent-202, score-0.31]
78 do Get reward X from arm I √ √ 3 t , 2= m t 1 = if Exploit then if L1 < 1 or L2 < 2 then Explore Exploit ← false else if X = 1 then L←L+1 else M ←M +1 if M = 1 then if L < 1 then Explore else L1 ← L else if M = m then if L < 2 then Explore else L2 ← L Exploit← true 6 4. [sent-207, score-1.151]
79 2 Regret analysis Proposition 2 The two-target algorithm with time-dependent targets √ m t satisfies: E(Rn ) 1 lim sup √ ≤2+ . [sent-208, score-0.227]
80 , denote by L1 (k) and L2 (k) the respective lengths of the first run and of the first m runs of arm k when this arm is played continuously. [sent-213, score-1.408]
81 Since arm k cannot be selected before time k, the regret at time n satisfies: Kn n Rn ≤ K n + m 1{L1 (k)> 1 (k)} (1 − Xt )1{L2 (It )> + 2 (t)} . [sent-214, score-0.969]
82 t=1 k=1 First observe that, since the target functions 1 (t) and 2 (t) are non-decreasing, Kn is less than or equal to Kn , the number of arms selected by a two-target policy with known time horizon n and fixed targets 1 (n) and 2 (n). [sent-215, score-0.924]
83 In this scheme, let U1 = 1 if arm 1 is used until time n and U1 = 0 √ 1 otherwise. [sent-216, score-0.668]
84 k=1 Finally, E((1 − Xt )1{L2 (It )> 2 (t)} ) ≤ E(1 − Xt |L2 (It ) > 2 (t)) ∼ m+1 1 √ m 2 t when t → +∞, so that 1 lim sup √ n n→+∞ n E((1 − Xt )1{L2 (It )> ) 2 (t)} t=1 ≤ m+1 1 lim n→+∞ n m m+1 = m Combining the previous results yields: lim sup n→+∞ E(Rn ) 1 √ ≤2+ . [sent-229, score-0.406]
85 To support this conjecture, consider an oracle that reveals the parameter of each arm after the first failure of this arm, as in the proof of Theorem 1. [sent-233, score-0.809]
86 With this information, an optimal policy exploits an arm whenever its 1 ¯ ¯ parameter is larger than some increasing function θt of time t. [sent-234, score-0.804]
87 Then proceeding as in the proof of Theorem 1, we get: lim inf n→+∞ 5 1 E(Rn ) √ ≥ c + lim n→+∞ n n n t=1 1 2c n 1 =c+ t c 1 0 du 1 √ = c + ≥ 2. [sent-236, score-0.339]
88 c 2 u Numerical results Figure 1 gives the expected failure rate E(Rn )/n with respect to the time horizon n, that is supposed to be known. [sent-237, score-0.448]
89 The mean rewards have (a) a uniform distribution or (b) a Beta(1,2) distribution, corresponding to the probability density function u → 2(1 − u). [sent-239, score-0.234]
90 The performance gains of the two-target algorithm turn out to be negligible for the uniform distribution but substantial for the Beta(1,2) distribution, where “good” arms are less frequent. [sent-245, score-0.38]
91 4 Expected failure rate Expected failure rate 0. [sent-248, score-0.274]
92 1 0 0 10 100 1000 10000 10 Time horizon 100 1000 10000 Time horizon (a) Uniform mean-reward distribution (b) Beta(1,2) mean-reward distribution Figure 1: Expected failure rate E(Rn )/n with respect to the time horizon n. [sent-257, score-0.99]
93 6 Conclusion The proposed algorithm uses two levels of sampling in the exploration phase: the first eliminates “bad” arms while the second selects “good” arms. [sent-258, score-0.447]
94 To our knowledge, this is the first algorithm that √ √ achieves the optimal regrets in 2n and 2 n for known and unknown horizon times, respectively. [sent-259, score-0.395]
95 Future work will be devoted to the proof of the lower bound in the case of unknown horizon time. [sent-260, score-0.329]
96 Finally, we would like to compare the performance of our algorithm for finite-armed bandits with those of the best known algorithms like KL-UCB [10, 3] and Thompson sampling [14, 8] over short time horizons where the full exploration of the arms is generally not optimal. [sent-262, score-0.573]
97 A note on strategies for bandit problems with infinitely many arms. [sent-280, score-0.141]
98 A note on infinite-armed bernoulli bandit problems with generalized beta prior distributions. [sent-283, score-0.316]
99 Policies without memory for the infinite-armed bernoulli bandit under the average-reward criterion. [sent-286, score-0.239]
100 Some optimal strategies for bandit problems with beta prior distributions. [sent-303, score-0.245]
wordName wordTfidf (topN-words)
[('arm', 0.628), ('arms', 0.349), ('horizon', 0.271), ('regret', 0.23), ('kn', 0.228), ('rewards', 0.177), ('bandit', 0.141), ('failure', 0.137), ('lim', 0.112), ('rn', 0.108), ('bernoulli', 0.098), ('exploration', 0.098), ('anytime', 0.091), ('explore', 0.09), ('reward', 0.088), ('du', 0.087), ('else', 0.081), ('policy', 0.081), ('targets', 0.08), ('beta', 0.077), ('berry', 0.076), ('bandits', 0.066), ('deduce', 0.064), ('explored', 0.054), ('exploited', 0.052), ('target', 0.052), ('forall', 0.051), ('herschkorn', 0.051), ('prouti', 0.051), ('recommandation', 0.051), ('teytaud', 0.051), ('played', 0.051), ('xt', 0.05), ('conjecture', 0.049), ('swedish', 0.045), ('tze', 0.045), ('proposition', 0.044), ('exploit', 0.042), ('achieves', 0.041), ('time', 0.04), ('gelly', 0.039), ('mallows', 0.039), ('phase', 0.038), ('options', 0.038), ('herbert', 0.037), ('unknown', 0.036), ('audibert', 0.036), ('sup', 0.035), ('alexandre', 0.033), ('thompson', 0.032), ('nitely', 0.032), ('uniform', 0.031), ('selected', 0.031), ('ucb', 0.03), ('false', 0.03), ('olivier', 0.03), ('munos', 0.03), ('mi', 0.029), ('lengths', 0.029), ('whenever', 0.028), ('exploitation', 0.028), ('asymptotic', 0.028), ('inf', 0.028), ('successes', 0.028), ('council', 0.028), ('optimal', 0.027), ('paris', 0.027), ('situations', 0.026), ('mean', 0.026), ('asymptotically', 0.025), ('respective', 0.025), ('kt', 0.025), ('run', 0.025), ('discarded', 0.024), ('tends', 0.023), ('arbitrarily', 0.023), ('content', 0.023), ('annals', 0.023), ('support', 0.023), ('leung', 0.023), ('metrika', 0.023), ('liated', 0.023), ('stockholm', 0.023), ('paristech', 0.023), ('telecom', 0.023), ('stopping', 0.023), ('exceeds', 0.022), ('ft', 0.022), ('runs', 0.022), ('bound', 0.022), ('success', 0.021), ('successive', 0.021), ('announced', 0.021), ('networking', 0.021), ('sebag', 0.021), ('emilie', 0.021), ('mich', 0.021), ('fortiori', 0.021), ('reveals', 0.021), ('known', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
2 0.50763005 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
3 0.3409335 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
Author: Sebastien Bubeck, Che-Yu Liu
Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
4 0.33885998 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
5 0.32851326 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
6 0.26005393 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
7 0.16618331 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
8 0.1379803 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
9 0.13739033 137 nips-2013-High-Dimensional Gaussian Process Bandits
10 0.1351309 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
11 0.12528142 257 nips-2013-Projected Natural Actor-Critic
12 0.11865278 230 nips-2013-Online Learning with Costly Features and Labels
13 0.1168965 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
14 0.11646269 325 nips-2013-The Pareto Regret Frontier
15 0.11463946 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
16 0.10986156 7 nips-2013-A Gang of Bandits
17 0.10938449 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
18 0.10889183 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
19 0.10319264 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
20 0.10111122 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
topicId topicWeight
[(0, 0.19), (1, -0.28), (2, 0.135), (3, -0.129), (4, -0.01), (5, -0.138), (6, 0.222), (7, -0.189), (8, -0.133), (9, -0.107), (10, 0.037), (11, 0.289), (12, -0.187), (13, -0.07), (14, -0.028), (15, -0.042), (16, -0.132), (17, 0.104), (18, -0.057), (19, 0.186), (20, 0.215), (21, -0.059), (22, 0.137), (23, -0.009), (24, -0.004), (25, -0.047), (26, 0.085), (27, 0.097), (28, -0.036), (29, -0.061), (30, 0.004), (31, 0.073), (32, 0.063), (33, 0.007), (34, 0.02), (35, 0.043), (36, -0.018), (37, 0.028), (38, -0.032), (39, 0.011), (40, -0.086), (41, 0.067), (42, -0.011), (43, -0.04), (44, -0.031), (45, -0.042), (46, 0.024), (47, -0.031), (48, -0.04), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.97076076 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
2 0.88823056 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
3 0.83308959 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
4 0.7658127 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
Author: Sebastien Bubeck, Che-Yu Liu
Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
5 0.725236 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
6 0.59596515 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
7 0.41258052 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
8 0.38683474 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
9 0.38229433 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
10 0.35829574 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
11 0.35062602 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
12 0.34051159 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
13 0.32891861 7 nips-2013-A Gang of Bandits
14 0.32366967 137 nips-2013-High-Dimensional Gaussian Process Bandits
15 0.30201223 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers
16 0.30099729 257 nips-2013-Projected Natural Actor-Critic
17 0.29351434 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
18 0.26001474 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
19 0.25923881 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
20 0.25082317 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation
topicId topicWeight
[(2, 0.02), (16, 0.025), (33, 0.275), (34, 0.085), (41, 0.033), (42, 0.11), (49, 0.019), (56, 0.202), (70, 0.023), (85, 0.047), (89, 0.039), (93, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.96530652 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
2 0.9466942 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
Author: Akshay Krishnamurthy, Aarti Singh
Abstract: We study low rank matrix and tensor completion and propose novel algorithms that employ adaptive sampling schemes to obtain strong performance guarantees. Our algorithms exploit adaptivity to identify entries that are highly informative for learning the column space of the matrix (tensor) and consequently, our results hold even when the row space is highly coherent, in contrast with previous analyses. In the absence of noise, we show that one can exactly recover a n ⇥ n matrix of rank r from merely ⌦(nr3/2 log(r)) matrix entries. We also show that one can recover an order T tensor using ⌦(nrT 1/2 T 2 log(r)) entries. For noisy recovery, our algorithm consistently estimates a low rank matrix corrupted with noise using ⌦(nr3/2 polylog(n)) entries. We complement our study with simulations that verify our theory and demonstrate the scalability of our algorithms. 1
3 0.94468445 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
Author: Shouyuan Chen, Michael R. Lyu, Irwin King, Zenglin Xu
Abstract: Tensor completion from incomplete observations is a problem of significant practical interest. However, it is unlikely that there exists an efficient algorithm with provable guarantee to recover a general tensor from a limited number of observations. In this paper, we study the recovery algorithm for pairwise interaction tensors, which has recently gained considerable attention for modeling multiple attribute data due to its simplicity and effectiveness. Specifically, in the absence of noise, we show that one can exactly recover a pairwise interaction tensor by solving a constrained convex program which minimizes the weighted sum of nuclear norms of matrices from O(nr log2 (n)) observations. For the noisy cases, we also prove error bounds for a constrained convex program for recovering the tensors. Our experiments on the synthetic dataset demonstrate that the recovery performance of our algorithm agrees well with the theory. In addition, we apply our algorithm on a temporal collaborative filtering task and obtain state-of-the-art results. 1
Author: Yuchen Zhang, John Duchi, Michael Jordan, Martin J. Wainwright
Abstract: We establish lower bounds on minimax risks for distributed statistical estimation under a communication budget. Such lower bounds reveal the minimum amount of communication required by any procedure to achieve the centralized minimax-optimal rates for statistical estimation. We study two classes of protocols: one in which machines send messages independently, and a second allowing for interactive communication. We establish lower bounds for several problems, including various types of location models, as well as for parameter estimation in regression models. 1
5 0.94279104 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin
Abstract: In this paper, we are interested in the development of efficient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the first-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We first examine a two stages exploration-exploitation based algorithm which first approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efficient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method √ conin strained optimization and attains the optimal convergence rate of O(1/ T ) in high probability for general Lipschitz continuous objectives.
6 0.94197911 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning
7 0.94138491 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
8 0.93921661 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
10 0.9352119 247 nips-2013-Phase Retrieval using Alternating Minimization
11 0.93379295 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
12 0.93370283 233 nips-2013-Online Robust PCA via Stochastic Optimization
13 0.93317443 344 nips-2013-Using multiple samples to learn mixture models
14 0.9318893 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
15 0.93171448 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
16 0.93143415 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
17 0.93143219 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation
18 0.93095672 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
19 0.93034089 137 nips-2013-High-Dimensional Gaussian Process Bandits
20 0.92958182 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality