nips nips2013 nips2013-330 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. [sent-6, score-0.35]
2 Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. [sent-7, score-0.387]
3 Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. [sent-8, score-0.222]
4 This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. [sent-9, score-0.674]
5 Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. [sent-10, score-0.216]
6 1 Introduction K-armed bandit problems provide an elementary model for exploration-exploitation tradeoffs found at the heart of many online learning problems. [sent-11, score-0.156]
7 In such problems, an agent is presented with K distributions (also called arms, or actions) {pa }K , from which she draws samples interpreted as a=1 rewards she wants to maximize. [sent-12, score-0.188]
8 This objective induces a trade-off between choosing to sample a distribution that has already yielded high rewards, and choosing to sample a relatively unexplored distribution at the risk of loosing rewards in the short term. [sent-13, score-0.194]
9 Here we make the assumption that the distributions, pa , belong to a parametric family of distributions P = {p(· | θ), θ ∈ Θ} where Θ ⊂ R. [sent-14, score-0.251]
10 The bandit model is described by a parameter θ0 = (θ1 , . [sent-15, score-0.156]
11 We introduce the mean function µ(θ) = EX∼p(·|θ) [X], and the optimal arm θ∗ = θa∗ where a∗ = argmaxa µ(θa ). [sent-19, score-0.33]
12 An algorithm, A, for a K-armed bandit problem is a (possibly randomised) method for choosing which arm at to sample from at time t, given a history of previous arm choices and obtained rewards, Ht−1 := ((as , xs ))t−1 : each reward xs is drawn from the distribution pas . [sent-20, score-0.842]
13 s=1 This quantity measures the expected performance of algorithm A compared to the expected performance of an optimal algorithm given knowledge of the reward distributions, i. [sent-22, score-0.073]
14 1 Since the early 2000s the “optimisim in the face of uncertainty” heuristic has been a popular approach to this problem, providing both simplicity of implementation and finite-time upper bounds on the regret (e. [sent-25, score-0.155]
15 While this heuristic was first put forward to solve bandit problems eighty years ago in [15], it was not until recently that theoretical analyses of its performance were achieved [1, 2, 11, 13]. [sent-29, score-0.204]
16 In this paper we take a major step towards generalising these analyses to the same level of generality already achieved for “optimistic” algorithms. [sent-30, score-0.134]
17 Thompson Sampling Unlike optimistic algorithms which are often based on confidence intervals, the Thompson Sampling algorithm, denoted by Aπ0 uses Bayesian tools and puts a prior distribution πa,0 = π0 on each parameter θa . [sent-31, score-0.18]
18 A posterior distribution, πa,t , is then maintained according to the rewards observed in Ht−1 . [sent-32, score-0.237]
19 At each time a sample θa,t is drawn from each posterior πa,t and then the algorithm chooses to sample at = arg maxa∈{1,. [sent-33, score-0.127]
20 Note that actions are sampled according to their posterior probabilities of being optimal. [sent-37, score-0.152]
21 As explained in [1, 2], Thompson Sampling with uniform prior for Bernoulli rewards can be slightly adapted to deal with bounded rewards. [sent-40, score-0.242]
22 However, there is no notion of asymptotic optimality for this non-parametric family of rewards. [sent-41, score-0.135]
23 In this paper, we extend the optimality property that holds for Bernoulli distributions to more general families of parametric rewards, namely 1dimensional exponential families if the algorithm uses the Jeffreys prior: Theorem 1. [sent-42, score-0.408]
24 Suppose that the reward distributions belong to a 1-dimensional canonical exponential family and let πJ denote the associated Jeffreys prior. [sent-43, score-0.384]
25 Then, K R(AπJ , T ) µ(θa∗ ) − µ(θa ) lim = , T →∞ ln T K(θa , θa∗ ) a=1 (1) where K(θ, θ ) := KL(pθ , pθ ) is the Kullback-Leibler divergence between pθ and pθ . [sent-44, score-0.131]
26 In the proof of this result we provide in Theorem 4 a finite-time, exponential concentration bound for posterior distributions of exponential family random variables, something that to the best of our knowledge is new to the literature and of interest in its own right. [sent-46, score-0.743]
27 Our proof also exploits the connection between the Jeffreys prior, Fisher information and the Kullback-Leibler divergence in exponential families. [sent-47, score-0.222]
28 [2] establishes that R(AπU , T ) = O( KT ln(T )) for Thompson Sampling for bounded rewards (with the classic uniform prior πU on the underlying Bernoulli parameter). [sent-49, score-0.242]
29 [14] go beyond the Bernoulli model, and give an upper bound on the Bayes risk (i. [sent-50, score-0.101]
30 the regret averaged over the prior) independent of the prior distribution. [sent-52, score-0.263]
31 For the parametric multi-armed bandit with K arms described above, their result states that the regret of Thompson Sampling using a prior π0 is not too big when averaged over this same prior: Eθ∼π⊗K [R(Aπ0 , T )(θ)] ≤ 4 + K + 4 KT log(T ). [sent-53, score-0.515]
32 √ Building on the same ideas, [6] have improved this upper bound to 14 KT . [sent-54, score-0.079]
33 In our paper, we rather see the prior used by Thompson Sampling as a tool, and we want therefore to derive regret bounds for any given problem parametrized by θ that depend on this parameter. [sent-55, score-0.286]
34 0 [14] also use Thompson Sampling in more general models, like the linear bandit model. [sent-56, score-0.156]
35 Their result is a bound on the Bayes risk that does not depend on the prior, whereas [3] gives a first bound on the regret in this model. [sent-57, score-0.263]
36 Linear bandits consider a possibly infinite number of arms whose mean rewards are linearly related by a single, unknown coefficient vector. [sent-58, score-0.245]
37 Once again, the analysis in [3] encounters the problem of describing the concentration of posterior distributions. [sent-59, score-0.267]
38 However by using a conjugate normal prior, they can employ explicit concentration bounds available for Normal distributions to complete their argument. [sent-60, score-0.188]
39 2 Paper Structure In Section 2 we describe important features of the one-dimensional canonical exponential families we consider, including closed-form expression for KL-divergences and the Jeffreys’ prior. [sent-61, score-0.259]
40 Section 3 gives statements of the main results, and provides the proof of the regret bound. [sent-62, score-0.187]
41 Section 4 proves the posterior concentration result used in the proof of the regret bound. [sent-63, score-0.479]
42 2 Exponential Families and the Jeffreys Prior A distribution is said to belong to a one-dimensional canonical exponential family if it has a density with respect to some reference measure ν of the form: p(x | θ) = A(x) exp(T (x)θ − F (θ)), (2) where θ ∈ Θ ⊂ R. [sent-64, score-0.289]
43 T and A are some fixed functions that characterize the exponential family and F (θ) = log A(x) exp [T (x)θ] dν(x) . [sent-65, score-0.197]
44 KL-divergence in Exponential Families In an exponential family, a direct computation shows that the Kullback-Leibler divergence can be expressed as a Bregman divergence of the normalisation function, F: B K(θ, θ ) = DF (θ , θ) := F (θ ) − [F (θ) + F (θ)(θ − θ)] . [sent-71, score-0.25]
45 (3) Jeffreys prior in Exponential Families In the Bayesian literature, a special “non-informative” prior, introduced by Jeffreys in [10], is sometimes considered. [sent-72, score-0.132]
46 In the special case of the canonical exponential family, the Fisher information takes the form I(θ) = F (θ), hence the Jeffreys prior for the model (2) is πJ (θ) ∝ |F (θ)|. [sent-74, score-0.302]
47 Under the Jeffreys prior, the posterior on θ after n observations is given by n p(θ|y1 , . [sent-75, score-0.149]
48 yn ) ∝ T (yi ) − nF (θi ) F (θ) exp θ (4) i=1 When Θ F (θ)dθ < +∞, the prior is called proper. [sent-78, score-0.132]
49 However, stasticians often use priors which are not proper: the prior is called improper if Θ F (θ)dθ = +∞ and any observation makes the corresponding posterior (4) integrable. [sent-79, score-0.259]
50 Some Intuition for choosing the Jeffreys Prior In the proof of our concentration result for posterior distributions (Theorem 4) it will be crucial to lower bound the prior probability of an -sized KL-divergence ball around each of the parameters θa . [sent-80, score-0.589]
51 Since the Fisher information F (θ) = limθ →θ K(θ, θ )/|θ − θ |2 , choosing a prior proportional to F (θ) ensures that the prior √ measure of such balls are Ω( ). [sent-81, score-0.325]
52 Examples and Pseudocode Algorithm 1 presents pseudocode for Thompson Sampling with the Jeffreys prior for distributions parametrized by their natural parameter θ. [sent-82, score-0.231]
53 But as the Jeffreys prior is invariant under reparametrization, if a distribution is parametrised by some parameter λ ≡ θ, the algorithm can use the Jeffreys prior ∝ I(λ) on λ, drawing samples from the posterior on λ. [sent-83, score-0.414]
54 Note that the posterior sampling step (in bold) is always tractable using, for example, a HastingsMetropolis algorithm. [sent-84, score-0.224]
55 3 Algorithm 1 Thompson Sampling for Exponential Families with the Jeffreys prior Require: F normalization function, T sufficient statistic, µ mean function for t = 1 . [sent-85, score-0.132]
56 K do Sample arm t and get rewards xt Nt = 1, St = T (xt ). [sent-88, score-0.405]
57 , yn depends on n and s = n i=1 T (yi ) Some examples of common exponential family models are given in Figure 1, together with the posterior distributions on the parameter λ that is used by TS with the Jeffreys prior. [sent-99, score-0.372]
58 These last two distributions are not covered even by the work in [8], and belong to the family of heavy-tailed distributions. [sent-101, score-0.192]
59 For the Bernoulli model, we note futher that the use of the Jeffreys prior is not covered by the previous analyses. [sent-102, score-0.157]
60 3 Results and Proof of Regret Bound An exponential family K-armed bandit is a K-armed bandit for which the reward distributions pa are known to be elements of an exponential family of distributions P(Θ). [sent-104, score-0.895]
61 We denote by pθa the distribution of arm a and its mean by µa = µ(θa ). [sent-105, score-0.271]
62 Assume that µ1 > µa for all a = 1, and that πa,0 is taken to be the Jeffreys prior over Θ. [sent-107, score-0.132]
63 Then for every > 0 there exists a constant C( , P) depending on and on the problem P such that the regret of Thompson Sampling using the Jeffreys prior satisfies R(AπJ , T ) ≤ 1+ 1− K (µ1 − µa ) K(θa , θ1 ) a=2 ln(T ) + C( , P). [sent-108, score-0.263]
64 Proof: We give here the main argument of the proof of the regret bound, which proceed by bounding the expected number of draws of any suboptimal arm. [sent-109, score-0.253]
65 Along the way we shall state concentration results whose proofs are postponed to later sections. [sent-110, score-0.14]
66 4 Step 0: Notation We denote by ya,s the s-th observation of arm a and by Na,t the number of times u arm a is chosen up to time t. [sent-111, score-0.542]
67 Let Ya := (ya,s )1≤s≤u be Na,t the vector of first u observations from arm a. [sent-116, score-0.293]
68 Ya,t := Ya is therefore the vector of observations from arm a available at the beginning of round t. [sent-117, score-0.32]
69 Observations from arm a such that 2 p(ya,s |θ) ≥ L(θa ) can therefore be seen as likely observations. [sent-120, score-0.271]
70 ˜ On Ea,t , the empirical sufficient statistic of arm a at round t is well concentrated around its mean θ and a ’likely’ realization of arm a has been observed. [sent-123, score-0.633]
71 Step 1: Decomposition The idea of the proof is to decompose the probability of playing a subopT timal arm using the events given in Step 0, and that E[Na,T ] = t=1 P (at = a): T T T θ ˜ P at = a, Ea,t , (Ea,t )c + θ ˜ P at = a, Ea,t , Ea,t + E [Na,T ] = ˜c P at = a, Ea,t . [sent-126, score-0.327]
72 Term (C) is controlled by the concentration of the empirical sufficient statistic, and (B) is controlled by the tail probabilities of the posterior distribution. [sent-128, score-0.289]
73 We give the needed concentration results in Step 2. [sent-129, score-0.14]
74 When conditioned on the event that the optimal arm is played at least polynomially often, term (A) can be decomposed further, and then controled by the results from Step 2. [sent-130, score-0.356]
75 Step 3 proves that the optimal arm is played this many times. [sent-131, score-0.347]
76 Step 2: Concentration Results We state here the two concentration results that are necessary to evaluate the probability of the above events. [sent-132, score-0.14]
77 5 −1 (µa +∆a ))+ln(Na,t ) Step 3: Lower Bound the Number of Optimal Arm Plays with High Probability The main difficulty adressed in previous regret analyses for Thompson Sampling is the control of the number of draws of the optimal arm. [sent-145, score-0.235]
78 The proof of this result, an outline of which is given in Appendix D, explores in depth the randomised nature of Thompson Sampling. [sent-147, score-0.096]
79 ∀b ∈ (0, 1), ∃Cb (π, µ1 , µ2 , K) < ∞ such that t=1 P N1,t ≤ tb ≤ Cb . [sent-149, score-0.132]
80 Step 4: Bounding the Terms of the Decomposition Now we bound the terms of the decomposition as discussed in Step 1: An upper bound on term (C) is given in (6), whereas a bound on term (B) follows from Lemma 6 below. [sent-150, score-0.189]
81 Although the proof of this lemma is standard, and bears a strong similarity to Lemma 3 of [3], we provide it in Appendix C for the sake of completeness. [sent-151, score-0.104]
82 The first inequality comes from Proposition 5, and the second inequality comes from the following fact: if arm 1 is not chosen and arm a is such that µ(θa,t ) ≤ µa + ∆a , then µ(θ1,t ) ≤ µa + ∆a . [sent-156, score-0.602]
83 In Theorem 4, we bound the conditional probability that µ(θa,t ) exceed the true mean. [sent-158, score-0.081]
84 Step 2: Upper bounding the numerator of (9) We first note that on Θθ,∆ the leading term in the exponential is K(θ, θ ). [sent-173, score-0.202]
85 Step 3: Lower bounding the denominator of (9) To lower bound the denominator, we reduce the integral on the whole space Θ to a KL-ball, and use the structure of the prior to lower bound the measure of that KL-ball under the posterior obtained with the well-chosen observation ys . [sent-177, score-0.829]
86 Using this inequality we can then bound the denominator of (9) whenever u ≥ N1 (θ, F ) and δ < 1: e−(u−1)(K(θ,θ )+δ|θ−θ |) π(θ |ys )dθ ≥ θ ∈Θ ≥ e−(u−1)(K(θ,θ )+δ|θ−θ |) π(θ |ys )dθ θ ∈B1/u2 (θ) e −(u−1) K(θ,θ )+δ 2K(θ,θ ) F (θ) π(θ |ys )dθ ≥ π B1/u2 (θ)|ys e − 1+ F 2 (θ) . [sent-181, score-0.138]
87 Thus, it follows that Θ p(ys |θ )π0 (θ )dθ L (θ) = L (θ, F ) := Θ sup p(y|θ ) F (θ )dθ y:p(y|θ)>L(θ) <∞ Θ is an upper bound on the denominator of (12). [sent-187, score-0.132]
88 Note that when the prior is proper we do not need to introduce the observation ys , which significantly simplifies the argument. [sent-191, score-0.503]
89 1Eu P(µ(θu ) ≥ µ(θ) + ∆|Yu ) ≤ ˜ 5 Conclusion We have shown that choosing to use the Jeffreys prior in Thompson Sampling leads to an asymptotically optimal algorithm for bandit models whose rewards belong to a 1-dimensional canonical exponential family. [sent-194, score-0.708]
90 The cornerstone of our proof is a finite time concentration bound for posterior distributions in exponential families, which, to the best of our knowledge, is new to the literature. [sent-195, score-0.546]
91 Thompson Sampling with Jeffreys prior is now a provably competitive alternative to KL-UCB for exponential family bandits. [sent-197, score-0.329]
92 Moreover our proof holds for slightly more general problems than those for which KL-UCB is provably optimal, including some heavy-tailed exponential family bandits. [sent-198, score-0.253]
93 Notably generalising to n-dimensional exponential family bandits requires only generalising Lemma 3 and Step 3 in the proof of Theorem 4. [sent-200, score-0.45]
94 Another possible future direction lies the optimal choice of prior distribution. [sent-203, score-0.158]
95 Our theoretical guarantees only hold for Jeffreys’ prior, but a careful examination of our proof shows that the important property is to have, for every θa , − ln π0 (θ )dθ = o (n) , (θ :K(θa ,θ )≤n−2 ) which could hold for prior distributions other than the Jeffreys prior. [sent-204, score-0.321]
96 Analysis of thompson sampling for the multi-armed bandit problem. [sent-208, score-0.583]
97 A note on the bayesian regret of thompson sampling with an arbitrairy prior. [sent-234, score-0.558]
98 The kl-ucb algorithm for bounded stochastic bandits and beyond. [sent-248, score-0.077]
99 An asymptotically optimal bandit algorithm for bounded support models. [sent-253, score-0.223]
100 An invariant form for prior probability in estimation problem. [sent-257, score-0.155]
wordName wordTfidf (topN-words)
[('jeffreys', 0.566), ('ys', 0.371), ('thompson', 0.356), ('arm', 0.271), ('bandit', 0.156), ('concentration', 0.14), ('prior', 0.132), ('tb', 0.132), ('regret', 0.131), ('posterior', 0.127), ('exponential', 0.12), ('rewards', 0.11), ('families', 0.089), ('cb', 0.088), ('ln', 0.085), ('family', 0.077), ('bandits', 0.077), ('sampling', 0.071), ('statistic', 0.064), ('bernoulli', 0.06), ('generalising', 0.06), ('arms', 0.058), ('pareto', 0.057), ('proof', 0.056), ('bound', 0.055), ('fisher', 0.053), ('denominator', 0.053), ('korda', 0.051), ('canonical', 0.05), ('distributions', 0.048), ('analyses', 0.048), ('agrawal', 0.048), ('optimistic', 0.048), ('lemma', 0.048), ('reward', 0.047), ('numerator', 0.046), ('pa', 0.046), ('divergence', 0.046), ('xm', 0.045), ('belong', 0.042), ('asymptotically', 0.041), ('randomised', 0.04), ('kaufmann', 0.039), ('parametric', 0.038), ('normalisation', 0.038), ('garivier', 0.038), ('weibull', 0.038), ('sequel', 0.036), ('bounding', 0.036), ('capp', 0.036), ('lai', 0.036), ('nord', 0.036), ('kt', 0.036), ('event', 0.034), ('asymptotic', 0.034), ('xs', 0.033), ('sat', 0.033), ('nat', 0.033), ('lille', 0.033), ('argmaxa', 0.033), ('beta', 0.032), ('choosing', 0.031), ('ya', 0.031), ('ea', 0.031), ('balls', 0.03), ('europe', 0.03), ('inequality', 0.03), ('draws', 0.03), ('team', 0.029), ('munos', 0.029), ('proposition', 0.028), ('continuity', 0.028), ('eu', 0.028), ('pseudocode', 0.028), ('round', 0.027), ('colt', 0.026), ('step', 0.026), ('ts', 0.026), ('optimal', 0.026), ('constants', 0.026), ('exceed', 0.026), ('inria', 0.025), ('proves', 0.025), ('actions', 0.025), ('covered', 0.025), ('played', 0.025), ('appendix', 0.024), ('kl', 0.024), ('contextual', 0.024), ('xt', 0.024), ('upper', 0.024), ('optimality', 0.024), ('parametrized', 0.023), ('invariant', 0.023), ('theorem', 0.022), ('observations', 0.022), ('tail', 0.022), ('risk', 0.022), ('covx', 0.022), ('nf', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
2 0.3731927 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
Author: Sebastien Bubeck, Che-Yu Liu
Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
3 0.32851326 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
4 0.24412963 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
5 0.20870923 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
6 0.17418441 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
7 0.1584537 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
8 0.14778928 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
9 0.12119826 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
10 0.11512452 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
11 0.10835017 137 nips-2013-High-Dimensional Gaussian Process Bandits
12 0.1078539 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
13 0.10123151 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
14 0.10055266 7 nips-2013-A Gang of Bandits
15 0.10007557 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
16 0.095625058 325 nips-2013-The Pareto Regret Frontier
17 0.089354381 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
18 0.08836703 230 nips-2013-Online Learning with Costly Features and Labels
19 0.080056109 89 nips-2013-Dimension-Free Exponentiated Gradient
20 0.07851591 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
topicId topicWeight
[(0, 0.191), (1, -0.166), (2, 0.126), (3, -0.116), (4, -0.005), (5, -0.035), (6, 0.249), (7, -0.167), (8, -0.069), (9, -0.078), (10, 0.033), (11, 0.265), (12, -0.162), (13, -0.064), (14, -0.004), (15, -0.048), (16, -0.096), (17, 0.063), (18, -0.057), (19, 0.135), (20, 0.12), (21, 0.04), (22, 0.101), (23, -0.045), (24, -0.014), (25, 0.009), (26, 0.018), (27, 0.024), (28, 0.017), (29, 0.048), (30, 0.067), (31, -0.045), (32, -0.059), (33, 0.037), (34, -0.068), (35, 0.045), (36, 0.035), (37, -0.108), (38, 0.015), (39, -0.047), (40, 0.028), (41, -0.046), (42, 0.026), (43, -0.084), (44, 0.031), (45, -0.07), (46, 0.036), (47, -0.011), (48, 0.049), (49, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.94320071 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
2 0.86501837 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
Author: Sebastien Bubeck, Che-Yu Liu
Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
3 0.82914758 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
4 0.75576425 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
5 0.64781684 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
6 0.5948506 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
7 0.44963849 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
8 0.44840875 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
9 0.44831172 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
10 0.4043718 137 nips-2013-High-Dimensional Gaussian Process Bandits
11 0.38974172 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
12 0.38784939 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
13 0.38504979 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
14 0.37797371 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
15 0.34955546 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
17 0.3347325 7 nips-2013-A Gang of Bandits
18 0.33157933 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
19 0.32501665 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
20 0.31487858 217 nips-2013-On Poisson Graphical Models
topicId topicWeight
[(2, 0.016), (16, 0.027), (33, 0.256), (34, 0.099), (36, 0.01), (41, 0.012), (49, 0.019), (56, 0.173), (59, 0.155), (70, 0.026), (85, 0.052), (89, 0.039), (93, 0.025)]
simIndex simValue paperId paperTitle
1 0.95768952 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
Author: Myunghwan Kim, Jure Leskovec
Abstract: unkown-abstract
2 0.95449471 134 nips-2013-Graphical Models for Inference with Missing Data
Author: Karthika Mohan, Judea Pearl, Jin Tian
Abstract: We address the problem of recoverability i.e. deciding whether there exists a consistent estimator of a given relation Q, when data are missing not at random. We employ a formal representation called ‘Missingness Graphs’ to explicitly portray the causal mechanisms responsible for missingness and to encode dependencies between these mechanisms and the variables being measured. Using this representation, we derive conditions that the graph should satisfy to ensure recoverability and devise algorithms to detect the presence of these conditions in the graph. 1
3 0.94082892 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
Author: Raja Hafiz Affandi, Emily Fox, Ben Taskar
Abstract: Determinantal point processes (DPPs) are random point processes well-suited for modeling repulsion. In machine learning, the focus of DPP-based models has been on diverse subset selection from a discrete and finite base set. This discrete setting admits an efficient sampling algorithm based on the eigendecomposition of the defining kernel matrix. Recently, there has been growing interest in using DPPs defined on continuous spaces. While the discrete-DPP sampler extends formally to the continuous case, computationally, the steps required are not tractable in general. In this paper, we present two efficient DPP sampling schemes that apply to a wide range of kernel functions: one based on low rank approximations via Nystr¨ m o and random Fourier feature techniques and another based on Gibbs sampling. We demonstrate the utility of continuous DPPs in repulsive mixture modeling and synthesizing human poses spanning activity spaces. 1
same-paper 4 0.93399066 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
5 0.89716804 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
Author: Thomas Bonald, Alexandre Proutiere
Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1
6 0.89621019 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
8 0.89372486 233 nips-2013-Online Robust PCA via Stochastic Optimization
9 0.89353448 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
10 0.89335227 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning
11 0.89248759 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
12 0.89022654 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
13 0.88957787 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation
14 0.88953233 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
15 0.88883233 344 nips-2013-Using multiple samples to learn mixture models
16 0.88826102 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
17 0.88740361 247 nips-2013-Phase Retrieval using Alternating Minimization
18 0.88732082 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
19 0.88623339 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
20 0.88526809 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization