nips nips2013 nips2013-253 knowledge-graph by maker-knowledge-mining

253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

Source: pdf

Author: Sebastien Bubeck, Che-Yu Liu

Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We ﬁrst show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Prior-free and prior-dependent regret bounds for Thompson Sampling S´ bastien Bubeck, Che-Yu Liu e Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton. [sent-1, score-0.304]

2 edu Abstract We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. [sent-3, score-0.318]

3 We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. [sent-4, score-0.365]

4 We ﬁrst show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. [sent-5, score-0.449]

5 This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. [sent-6, score-0.426]

6 We also study the case of priors for the setting of Bubeck et al. [sent-7, score-0.056]

7 [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. [sent-8, score-0.33]

8 1 Introduction In this paper we are interested in the Bayesian multi-armed bandit problem which can be described as follows. [sent-9, score-0.138]

9 For i ∈ [K], let (Xi,s )s≥1 be identically distributed random variables taking values in [0, 1] and which are independent conditionally on θ. [sent-11, score-0.052]

10 Consider now an agent facing K actions (or arms). [sent-13, score-0.069]

11 The agent receives the reward Xi,s when he pulls arm i for the sth time. [sent-18, score-0.545]

12 The arm selection is based only on past observed rewards and potentially on an external source of randomness. [sent-19, score-0.304]

13 sequence of random variables uniformly distributed s on [0, 1], and let Ti (s) = t=1 1It =i , then It is a random variable measurable with respect to σ(I1 , X1,1 , . [sent-23, score-0.051]

14 We measure the performance of the agent through the Bayesian regret deﬁned as n BRn = E t=1 max µi (θ) − µIt (θ) , i∈[K] where the expectation is taken with respect to the parameter θ, the rewards (Xi,s )s≥1 , and the external source of randomness (Us )s≥1 . [sent-27, score-0.407]

15 We will also be interested in the individual regret Rn (θ) which is deﬁned similarly except that θ is ﬁxed (instead of being integrated over π0 ). [sent-28, score-0.332]

16 1 Given a prior π0 the problem of ﬁnding an optimal strategy to minimize the Bayesian regret BRn is a well deﬁned optimization problem and as such it is merely a computational problem. [sent-30, score-0.371]

17 In this latter view the agent’s strategy must have a low regret Rn (θ) for any θ ∈ Θ. [sent-32, score-0.337]

18 Both formulations of the problem have a long history and we refer the interested reader to Bubeck and Cesa-Bianchi [2012] for a survey of the extensive recent literature on the learning setting. [sent-33, score-0.038]

19 In the Bayesian setting a major breakthrough was achieved in Gittins [1979] where it was shown that when the prior distribution takes a product form an optimal strategy is given by the Gittins indices (which are relatively easy to compute). [sent-34, score-0.101]

20 The product assumption on the prior means that the reward processes (Xi,s )s≥1 are independent across arms. [sent-35, score-0.218]

21 In the present paper we are precisely interested in the situations where this assumption is not satisﬁed. [sent-36, score-0.038]

22 Indeed we believe that one of the strength of the Bayesian setting is that one can incorporate prior knowledge on the arms in very transparent way. [sent-37, score-0.12]

23 A prototypical example that we shall consider later on in this paper is when one knows the distributions of the arms up to a permutation, in which case the reward processes are strongly dependent. [sent-38, score-0.256]

24 In general without the product assumption on the prior it seems hopeless (from a computational perspective) to look for the optimal Bayesian strategy. [sent-39, score-0.058]

25 Thus, despite being in a Bayesian setting, it makes sense to view it as a learning problem and to evaluate the agent’s performance through its Bayesian regret. [sent-40, score-0.049]

26 In this paper we are particularly interested in studying the Thompson Sampling strategy which was proposed in the very ﬁrst paper on the multi-armed bandit problem Thompson [1933]. [sent-41, score-0.181]

27 This strategy can be described very succinctly: let πt be the posterior distribution on θ given the history Ht = (I1 , X1,1 , . [sent-42, score-0.065]

28 Then Thompson Sampling ﬁrst draws a parameter θ(t) from πt (independently from the past given πt ) and it pulls It ∈ argmaxi∈[K] µi (θ(t) ). [sent-46, score-0.112]

29 Recently there has been a surge of interest for this simple policy, mainly because of its ﬂexibility to incorporate prior knowledge on the arms, see for example Chapelle and Li [2011]. [sent-47, score-0.079]

30 The speciﬁc case of binary rewards with a Beta prior is now very well understood thanks to the papers Agrawal and Goyal [2012a], Kaufmann et al. [sent-49, score-0.156]

31 However as we pointed out above here we are interested in proving regret bounds for the more realistic scenario where one runs Thompson Sampling with a hand-tuned prior distribution, possibly very different from a Beta prior. [sent-51, score-0.4]

32 The ﬁrst result in that spirit was obtained very recently by Russo and Roy [2013] who showed that for any √ prior distribution π0 Thompson Sampling always satisﬁes BRn ≤ 5 nK log n. [sent-52, score-0.138]

33 Our ﬁrst contribution is to show in Section 2 that the extraneous logarithmic factor in these bounds can be removed by using ideas reminiscent of the MOSS algorithm of Audibert and Bubeck [2009]. [sent-54, score-0.072]

34 Our second contribution is to show that Thompson Sampling can take advantage of the properties of some non-trivial priors to attain much better regret guarantees. [sent-55, score-0.318]

35 More precisely in Section 2 and 3 we consider the setting of Bubeck et al. [sent-56, score-0.033]

36 [2013] (which we call the BPR setting) where µ∗ and ε > 0 are known values such that for any θ ∈ Θ, ﬁrst there is a unique best arm {i∗ (θ)} = argmaxi∈[K] µi (θ), and furthermore µi∗ (θ) (θ) = µ∗ , and ∆i (θ) := µi∗ (θ) (θ) − µi (θ) ≥ ε, ∀i = i∗ (θ). [sent-57, score-0.236]

37 In other words the value of the best arm is known as well as a non-trivial lower bound on the gap between the values of the best and second best arms. [sent-58, score-0.258]

38 [2013] (which we call the BPR policy), and it was shown that the BPR policy satisﬁes   log(∆i (θ)/ε) Rn (θ) = O  log log(1/ε) , ∀θ ∈ Θ, ∀n ≥ 1. [sent-60, score-0.231]

39 ∆i (θ) ∗ i=i (θ) Thus the BPR policy attains a regret uniformly bounded over time in the BPR setting, a feature that standard bandit algorithms such as UCB of Auer et al. [sent-61, score-0.7]

40 It is natural to view 1 Note however that the result of Agrawal and Goyal [2012b] applies to the individual regret Rn (θ) while the result of Russo and Roy [2013] only applies to the integrated Bayesian regret BRn . [sent-63, score-0.588]

41 2 the assumptions of the BPR setting as a prior over the reward distributions and to ask what regret guarantees attain Thompson Sampling in that situation. [sent-64, score-0.547]

42 More precisely we consider Thompson Sampling with Gaussian reward distributions and uniform prior over the possible range of parameters. [sent-65, score-0.252]

43 We then prove individual regret bounds for any sub-Gaussian distributions (similarly to Bubeck et al. [sent-66, score-0.371]

44 We obtain that Thompson Sampling uses optimally the prior information in the sense that it also attains uniformly bounded over time regret. [sent-68, score-0.206]

45 Furthermore as an added bonus we remove the extraneous log-log factor of the BPR policy’s regret bound. [sent-69, score-0.308]

46 The results presented in Section 2 and 3 can be viewed as a ﬁrst step towards a better understanding of prior-dependent regret bounds for Thompson Sampling. [sent-70, score-0.33]

47 2 Optimal prior-free regret bound for Thompson Sampling In this section we prove the following result. [sent-72, score-0.27]

48 Theorem 1 For any prior distribution π0 over reward distributions in [0, 1], Thompson Sampling satisﬁes √ BRn ≤ 14 nK. [sent-73, score-0.252]

49 Remark that the above result is unimprovable in the sense that there exist prior distributions π0 such √ 1 that for any algorithm one has Rn ≥ 20 nK (see e. [sent-74, score-0.157]

50 Step 1: rewriting of the Bayesian regret in terms of upper conﬁdence bounds. [sent-82, score-0.305]

51 Note that by deﬁnition θ(t) and θ are identically distributed conditionally on Ht . [sent-85, score-0.052]

52 Inspired by the MOSS strategy of Audibert and Bubeck [2009] we will now take log+ Bi,t = µi,Ti (t−1) + n KTi (t−1) Ti (t − 1) , s where µi,s = 1 t=1 Xi,t , and log+ (x) = log(x)1x≥1 . [sent-88, score-0.043]

53 In the following we denote δ0 = 2 s From now on we work conditionally on θ and thus we drop all the dependency on θ. [sent-89, score-0.052]

54 By a simple integration of the deviations one has 1 E (µi∗ − Bi∗ ,t ) ≤ δ0 + δ0 P(µi∗ − Bi∗ ,t ≥ u)du. [sent-92, score-0.066]

55 Next we extract the following inequality from Audibert and Bubeck [2010] (see p2683–2684), for any i ∈ [K], 4K 1 n P(µi − Bi,t ≥ u) ≤ log u + . [sent-93, score-0.101]

56 Let It is is easy to see that one has:   2 n n 3 log nu log+ Ks K µi,s + ≤ P − µi ≥ u s u2 s=1 n + s=s(u) P (µi,s − µi ≥ cu) . [sent-99, score-0.142]

57 Using an integration already done in Step 2 we have +∞ nu2 K 3 log u2 δ0 ≤ 3(1 + log(2)) n ≤ 5. [sent-100, score-0.09]

58 K Next using Hoeffding’s inequality and the fact that the rewards are in [0, 1] one has for u ≥ δ0 n n s=s(u) P (µi,s − µi ≥ cu) ≤ s=s(u) exp(−2sc2 u2 )1u≤1/c ≤ Now using that 1 − exp(−x) ≥ x − x2 /2 for x ≥ 0 one obtains 1/c δ0 K . [sent-102, score-0.125]

59 n +∞ n (BIt ,t − µIt ) ≤ δ0 n + n t=1 log 3 2 +1 E (BIt ,t − µIt (θ)|θ). [sent-103, score-0.057]

60 We start again by integrating the deviations: Next we use the following simple inequality: t=1 n K δ0 n K δ0 K log n 1 du 1 − exp(−2c2 u2 ) 1/(2c) = δ0 1/(2c) ≤ ≤ = ≤ δ0 1/(2c) exp(−12c2 log 2) 1u≤1/c . [sent-104, score-0.254]

61 1 − exp(−2c2 u2 ) 1 du + 1 − exp(−2c2 u2 ) 1/c 1/(2c) 1 du 1 − exp(−2c2 u2 ) 1 1 du + 2c2 u2 − 2c4 u4 2c(1 − exp(−1/2)) 2 1 2c(1 − exp(−1/2)) δ0 4 1 2 − + 2δ 3c 0 3c 2c(1 − exp(−1/2)) 3c2 u2 1. [sent-105, score-0.354]

62 K 4 du + Putting the pieces together we proved n E t=1 √ (BIt ,t − µIt ) ≤ 7. [sent-107, score-0.143]

63 6 nK, which concludes the proof together with the results of Step 1 and Step 2. [sent-108, score-0.052]

64 [2013]] we consider here the two-armed bandit problem with 2 sub-Gaussian reward distributions (that is they satisfy Eeλ(X−µ) ≤ eλ /2 for all λ ∈ R) and such that one reward distribution has mean µ∗ and the other one has mean µ∗ − ∆ where µ∗ and ∆ are known values. [sent-110, score-0.454]

65 In order to derive the Thompson Sampling strategy for this problem we further assume that the reward distributions are in fact Gaussian with variance 1. [sent-111, score-0.237]

66 2 s=1 2 s=1 Recall that Thompson Sampling draws θ(t) from πt and then pulls the best arm for the environment θ(t) . [sent-114, score-0.348]

67 Observe that under θ1 the best arm is arm 1 and under θ2 the best arm is arm 2. [sent-115, score-0.944]

68 In other words Thompson Sampling draws It at random with the probabilities given by the posterior πt . [sent-116, score-0.054]

69 This leads to a general algorithm for the two-armed BPR setting with sub-Gaussian reward distributions that we summarize in Figure 1. [sent-117, score-0.194]

70 The next result shows that it attains optimal performances in this setting up to a numerical constant (see Bubeck et al. [sent-118, score-0.096]

71 [2013] for lower bounds), for any sub-Gaussian reward distribution (not necessarily Gaussian) with largest mean µ∗ and gap ∆. [sent-119, score-0.182]

72 play It at random from pt where   T1 (t−1) T2 (t−1) 1 1 (µ∗ − X1,s )2 − (µ∗ − ∆ − X2,s )2  , pt (1) = c exp − 2 s=1 2 s=1   T1 (t−1) T2 (t−1) 1 1 pt (2) = c exp − (µ∗ − ∆ − X1,s )2 − (µ∗ − X2,s )2  , 2 s=1 2 s=1 and c > 0 is such that pt (1) + pt (2) = 1. [sent-124, score-1.139]

73 Theorem 2 The policy of Figure 1 has regret bounded as Rn ≤ ∆ + 5 578 ∆ , uniformly in n. [sent-126, score-0.504]

74 [2013] Policy of Figure 1 3000 0 5000 10000 Time n 15000 20000 Time n Figure 2: Empirical comparison of the policy of Figure 1 and Policy 1 of Bubeck et al. [sent-131, score-0.207]

75 Figure 2 shows an empirical comparison of the policy of Figure 1 with Policy 1 of Bubeck et al. [sent-134, score-0.207]

76 Note in particular that a regret bound of order 16/∆ was proved for the latter algorithm and the (limited) numerical simulation presented here suggests that Thompson Sampling outperforms this strategy. [sent-136, score-0.295]

77 Proof Without loss of generality we assume that arm 1 is the optimal arm, that is µ1 = µ∗ and s µ2 = µ∗ − ∆. [sent-137, score-0.236]

78 Note that large s (positive) values of γ1,s or γ2,s might mislead the algorithm into bad decisions, and we will need to control what happens in various regimes for these γ coefﬁcients. [sent-139, score-0.051]

79 This ﬁrst step will be useful in the rest of the analysis, it shows how the probability ratio of a bad pull over a good pull evolves as a function of the γ coefﬁcients introduced above. [sent-142, score-0.098]

80 We decompose the regret Rn as follows: Rn ∆ n = 1{It = 2} 1+E t=3 n = 1+E n 1 γ2,T2 (t−1) > t=3 n 1 γ2,T2 (t−1) ≤ +E t=3 ∆ ∆ ∆ 1 γ2,T2 (t−1) ≤ , γ1,T1 (t−1) ≤ , It = 2 , It = 2 + E 4 4 4 t=3 ∆ ∆ , γ1,T1 (t−1) > , It = 2 . [sent-145, score-0.313]

81 4 4 We use Hoeffding’s inequality to control the ﬁrst term: n 1 γ2,T2 (t−1) > E t=3 ∆ , It = 2 4 n ≤E 1 γ2,s > s=1 6 ∆ 4 n ≤ s=1 exp − s∆2 32 ≤ 32 . [sent-146, score-0.206]

82 ∆2 For the second term, using the rewriting of Step 1 as an upper bound on pt (2), one obtains: n E 1 t=3 γ2,T2 (t−1) ≤ n ∆ ∆ , γ1,T1 (t−1) ≤ , It = 2 4 4 = pt (2)1 E n ≤ exp t∆2 4 − t=3 ∆ ∆ , γ1,T1 (t−1) ≤ 4 4 γ2,T2 (t−1) ≤ t=3 ≤ 4 . [sent-147, score-0.517]

83 ∆2 The third term is more difﬁcult to control, and we further decompose the corresponding event as follows: ∆ ∆ γ2,T2 (t−1) ≤ , γ1,T1 (t−1) > , It = 2 4 4 ∆ ∆ ⊂ γ1,T1 (t−1) > , T1 (t − 1) > t/4 ∪ γ2,T2 (t−1) ≤ , It = 2, T1 (t − 1) ≤ t/4 . [sent-148, score-0.043]

84 4 4 The cumulative probability of the ﬁrst event in the above decomposition is easy to control thanks to Hoeffding’s maximal inequality2 which states that for any m ≥ 1 and x > 0 one has P(∃ 1 ≤ s ≤ m s. [sent-149, score-0.079]

85 s γ1,s > t=3 ∆ , T1 (t − 1) > t/4 4 ≤ ≤ exp − t∆2 512 , 512 . [sent-154, score-0.132]

86 ∆2 It only remains to control the term n E 1 t=3 γ2,T2 (t−1) ≤ n ∆ , It = 2, T1 (t − 1) ≤ t/4 4 = pt (2)1 E n ≤ E exp t=3 − ∆ , T1 (t − 1) ≤ t/4 4 γ2,T2 (t−1) ≤ t=3 t∆2 + ∆ max sγ1,s 1≤s≤t/4 4 , where the last inequality follows from Step 1. [sent-155, score-0.381]

87 By integrating the deviations and using again Hoeffding’s maximal inequality one obtains +∞ E exp ∆ max sγ1,s 1≤s≤t/4 ≤ 1+ P 1 max sγ1,s ≥ t 1≤s≤ 4 +∞ log x ∆ dx ≤ 1+ exp − 1 2(log x)2 ∆2 t dx. [sent-158, score-0.507]

88 Now, straightforward computation gives n exp t=3 − t∆2 4 +∞ 1+ exp 1 − 2(log x)2 ∆2 t n dx ≤ ≤ ≤ ≤ exp − t=3 4 + ∆2  t∆2 4 +∞ 0 √ 4 16 π + ∆2 ∆2 30 . [sent-159, score-0.418]

89 ∆2 1 + π∆2 t exp 2 +∞ √ t∆2 8 π∆2 t exp 2 − t∆2 8   dt u exp(−u) du 0 which concludes the proof by putting this together with the results of the previous step. [sent-160, score-0.458]

90 2 It is an easy exercise to verify that Azuma-Hoeffding holds for martingale differences with sub-Gaussian increments, which implies Hoeffding’s maximal inequality for sub-Gaussian distributions. [sent-161, score-0.071]

91 7 4 Optimal strategy for the BPR setting inspired by Thompson Sampling In this section we consider the general BPR setting. [sent-162, score-0.075]

92 That is the reward distributions are sub-Gaussian 2 (they satisfy Eeλ(X−µ) ≤ eλ /2 for all λ ∈ R), one reward distribution has mean µ∗ , and all the other means are smaller than µ∗ − ε where µ∗ and ε are known values. [sent-163, score-0.354]

93 Similarly to the previous section we assume that the reward distributions are Gaussian with variance 1 for the derivation of the Thompson Sampling strategy (but we do not make this assumption for the analysis of the resulting algorithm). [sent-164, score-0.237]

94 i=1 Assuming a uniform prior over the index of the best arm, and a prior λ over the mean of a suboptimal arm one obtains by Bayes rule that the probability density function of the posterior is given by:   K Tj (t−1) K 1 dπt (θ) ∝ exp − (Xj,s − θj )2  dλ(θj ). [sent-168, score-0.544]

95 2 j=1 s=1 ∗ j=1,j=i (θ) Now remark that with Thompson Sampling arm i is played at time t if and only if θ(t) ∈ Θi . [sent-169, score-0.262]

96 play It at random from pt where pt (i) = c and c > 0 is such that K i=1 exp − 1 3 µ∗ −ε −∞ Ti (t−1) (Xi,s s=1 − µ∗ )2 Ti (t−1) (Xi,s s=1 1 exp − 3 − v)2 dv , pt (i) = 1. [sent-176, score-0.789]

97 The following theorem shows that this policy attains the best known performance for the BPR setting, shaving off a log-log term in the regret bound of the BPR policy. [sent-178, score-0.507]

98 Theorem 3 The policy of Figure 3 has regret bounded as Rn ≤ uniformly in n. [sent-179, score-0.504]

99 Analysis of Thompson sampling for the multi-armed bandit problem. [sent-184, score-0.195]

100 Regret analysis of stochastic and nonstochastic multi-armed bandit problems. [sent-213, score-0.1]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('thompson', 0.509), ('bpr', 0.393), ('bubeck', 0.319), ('regret', 0.27), ('arm', 0.236), ('pt', 0.175), ('policy', 0.174), ('reward', 0.16), ('exp', 0.132), ('brn', 0.131), ('du', 0.118), ('agrawal', 0.102), ('bandit', 0.1), ('sampling', 0.095), ('hoeffding', 0.091), ('audibert', 0.091), ('goyal', 0.085), ('nu', 0.085), ('bit', 0.081), ('russo', 0.08), ('pulls', 0.08), ('argmaxi', 0.069), ('agent', 0.069), ('ti', 0.068), ('attains', 0.063), ('arms', 0.062), ('rn', 0.061), ('prior', 0.058), ('log', 0.057), ('gittins', 0.052), ('moss', 0.052), ('bayesian', 0.049), ('bi', 0.047), ('ks', 0.047), ('ebi', 0.046), ('inequality', 0.044), ('strategy', 0.043), ('decompose', 0.043), ('xit', 0.043), ('rewards', 0.043), ('roy', 0.042), ('unimprovable', 0.04), ('interested', 0.038), ('obtains', 0.038), ('extraneous', 0.038), ('pull', 0.036), ('rewriting', 0.035), ('ku', 0.035), ('distributions', 0.034), ('ee', 0.034), ('ht', 0.034), ('beta', 0.034), ('bounds', 0.034), ('integration', 0.033), ('bounded', 0.033), ('deviations', 0.033), ('et', 0.033), ('inspired', 0.032), ('alt', 0.032), ('draws', 0.032), ('nk', 0.032), ('bulletin', 0.031), ('cu', 0.031), ('control', 0.03), ('chapelle', 0.03), ('rescaled', 0.03), ('conditionally', 0.029), ('round', 0.029), ('colt', 0.028), ('kaufmann', 0.028), ('concludes', 0.028), ('uniformly', 0.027), ('maximal', 0.027), ('auer', 0.026), ('played', 0.026), ('step', 0.026), ('tj', 0.026), ('proved', 0.025), ('attain', 0.025), ('external', 0.025), ('sense', 0.025), ('measurable', 0.024), ('putting', 0.024), ('integrated', 0.024), ('view', 0.024), ('proof', 0.024), ('rounds', 0.024), ('identically', 0.023), ('tower', 0.023), ('perchet', 0.023), ('drop', 0.023), ('priors', 0.023), ('spirit', 0.023), ('posterior', 0.022), ('gap', 0.022), ('dx', 0.022), ('integrating', 0.022), ('thanks', 0.022), ('surge', 0.021), ('mislead', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

Author: Sebastien Bubeck, Che-Yu Liu

2 0.3731927 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos

Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a ﬁnite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1

3 0.3409335 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

Author: Thomas Bonald, Alexandre Proutiere

Abstract: We consider an inﬁnite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the ﬁrst failure and until the ﬁrst m failures, respectively, where m is a ﬁxed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for ﬁnite time horizons. 1

4 0.32434514 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

Author: Dan Russo, Benjamin Van Roy

Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper conﬁdence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for speciﬁc model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1

5 0.23838337 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

Author: Aijun Bai, Feng Wu, Xiaoping Chen

Abstract: Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorithm, named Dirichlet-NormalGamma MCTS (DNG-MCTS), models the uncertainty of the accumulated reward for actions in the search tree as a mixture of Normal distributions. We perform inferences on the mixture in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions and select the best action at each decision node using Thompson sampling. Experimental results conﬁrm that our algorithm advances the state-of-the-art UCT approach with better values on several benchmark problems. 1

6 0.22727264 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

7 0.18726836 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

8 0.1834221 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

9 0.17039888 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

10 0.16605286 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

11 0.1635557 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

12 0.14943182 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

13 0.14862807 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

14 0.14646332 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

15 0.14267036 137 nips-2013-High-Dimensional Gaussian Process Bandits

16 0.13274328 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

17 0.13167168 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

18 0.12894784 325 nips-2013-The Pareto Regret Frontier

19 0.12266271 89 nips-2013-Dimension-Free Exponentiated Gradient

20 0.12219607 26 nips-2013-Adaptive Market Making via Online Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.216), (1, -0.35), (2, 0.095), (3, -0.096), (4, -0.014), (5, -0.094), (6, 0.208), (7, -0.149), (8, -0.086), (9, -0.041), (10, 0.05), (11, 0.24), (12, -0.132), (13, -0.055), (14, 0.023), (15, -0.004), (16, -0.073), (17, 0.047), (18, -0.038), (19, 0.087), (20, 0.1), (21, -0.002), (22, 0.064), (23, -0.022), (24, 0.035), (25, 0.012), (26, 0.025), (27, 0.007), (28, 0.002), (29, 0.014), (30, 0.024), (31, -0.002), (32, -0.04), (33, 0.035), (34, -0.086), (35, 0.043), (36, 0.067), (37, -0.066), (38, 0.014), (39, -0.027), (40, 0.004), (41, -0.013), (42, 0.003), (43, -0.04), (44, -0.04), (45, -0.046), (46, 0.031), (47, 0.018), (48, 0.067), (49, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95186895 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

Author: Sebastien Bubeck, Che-Yu Liu

2 0.8675229 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

Author: Thomas Bonald, Alexandre Proutiere

3 0.85552931 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos

4 0.78761941 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill

Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may signiﬁcantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1

5 0.67212182 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

Author: Dan Russo, Benjamin Van Roy

6 0.64420694 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

7 0.64254302 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

8 0.61223012 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

9 0.57309234 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

10 0.52808416 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

11 0.50403619 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

12 0.48798981 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

13 0.44944939 137 nips-2013-High-Dimensional Gaussian Process Bandits

14 0.44648054 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

15 0.44003174 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

16 0.43851495 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

17 0.42504057 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

18 0.41904581 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

19 0.40630066 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

20 0.38994893 7 nips-2013-A Gang of Bandits

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.013), (33, 0.602), (34, 0.079), (49, 0.019), (56, 0.122), (85, 0.023), (89, 0.018), (93, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99706405 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

Author: Chris Hinrichs, Vamsi Ithapu, Qinyuan Sun, Sterling C. Johnson, Vikas Singh

Abstract: Multiple hypothesis testing is a signiﬁcant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non-parametric method of estimating the FWER for a given α-threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we show that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub–sampled — on the order of 0.5% — matrix completion methods. Based on this observation, we propose a novel permutation testing methodology which offers a large speedup, without sacriﬁcing the ﬁdelity of the estimated FWER. Our evaluations on four different neuroimaging datasets show that a computational speedup factor of roughly 50× can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated α-threshold is also recovered faithfully, and is stable. 1

2 0.99559039 217 nips-2013-On Poisson Graphical Models

Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu

Abstract: Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its inﬁnite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution. While this model can accommodate a wider range of conditional dependencies, some limitations still remain. To address this, we investigate two additional novel variants of the Poisson distribution and their corresponding joint graphical model distributions. Our three novel approaches provide classes of Poisson-like graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our models via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-sequencing data. 1

3 0.9954502 88 nips-2013-Designed Measurements for Vector Count Data

Author: Liming Wang, David Carlson, Miguel Rodrigues, David Wilcox, Robert Calderbank, Lawrence Carin

Abstract: We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, X ∈ Rn , and the + observed data are a vector of counts, Y ∈ Zm . The projection matrix is designed + by maximizing mutual information between Y and X, I(Y ; X). When there is a latent class label C ∈ {1, . . . , L} associated with X, we consider the mutual information with respect to Y and C, I(Y ; C). New analytic expressions for the gradient of I(Y ; X) and I(Y ; C) are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classiﬁcation (photon counting). 1

4 0.99071634 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov

Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difﬁculty of acquiring sufﬁcient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1

5 0.98750079 160 nips-2013-Learning Stochastic Feedforward Neural Networks

Author: Yichuan Tang, Ruslan Salakhutdinov

Abstract: Multilayer perceptrons (MLPs) or neural networks are popular models used for nonlinear regression and classiﬁcation tasks. As regressors, MLPs model the conditional distribution of the predictor variables Y given the input variables X. However, this predictive distribution is assumed to be unimodal (e.g. Gaussian). For tasks involving structured prediction, the conditional distribution should be multi-modal, resulting in one-to-many mappings. By using stochastic hidden variables rather than deterministic ones, Sigmoid Belief Nets (SBNs) can induce a rich multimodal distribution in the output space. However, previously proposed learning algorithms for SBNs are not efﬁcient and unsuitable for modeling real-valued data. In this paper, we propose a stochastic feedforward network with hidden layers composed of both deterministic and stochastic variables. A new Generalized EM training procedure using importance sampling allows us to efﬁciently learn complicated conditional distributions. Our model achieves superior performance on synthetic and facial expressions datasets compared to conditional Restricted Boltzmann Machines and Mixture Density Networks. In addition, the latent features of our model improves classiﬁcation and can learn to generate colorful textures of objects. 1

6 0.98670048 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

same-paper 7 0.98514497 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

8 0.98315257 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

9 0.97894877 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

10 0.94241476 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

11 0.93103498 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

12 0.91764742 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

13 0.91187227 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

14 0.90920198 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

15 0.90710157 67 nips-2013-Conditional Random Fields via Univariate Exponential Families

16 0.90651846 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

17 0.90271378 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

18 0.90001929 335 nips-2013-Transfer Learning in a Transductive Setting

19 0.89880508 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

20 0.89878434 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction