nips nips2013 nips2013-292 knowledge-graph by maker-knowledge-mining

292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models


Source: pdf

Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill

Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. [sent-2, score-0.337]

2 In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. [sent-3, score-0.664]

3 We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. [sent-4, score-0.397]

4 Recently, multi-task and transfer learning received much attention in the supervised and reinforcement learning (RL) setting with both empirical and theoretical encouraging results (see recent surveys by Pan and Yang, 2010; Lazaric, 2011). [sent-6, score-0.187]

5 We focus on a sequential transfer scenario where an (online) learner is acting in a series of tasks drawn from a stationary distribution over a finite set of MABs. [sent-13, score-0.286]

6 Prior to learning, the model parameters of each bandit problem are not known to the learner, nor does it know the distribution probability over the bandit problems. [sent-15, score-0.294]

7 In fact, the learner may encounter the same bandit problem over and over throughout the learning, and an efficient algorithm should be able to leverage the knowledge obtained in previous tasks, when it is presented with the same problem again. [sent-18, score-0.212]

8 To address this problem one can transfer the estimates of all the possible models from prior tasks to the current one. [sent-19, score-0.312]

9 , 2002) is able to efficiently exploit this prior knowledge and reduce the regret through tasks (Sec. [sent-21, score-0.287]

10 We also prove that the new algorithm is guaranteed to perform as well as UCB in early episodes, thus avoiding any negative transfer effect, and then to approach the performance of the ideal case when the models are all known in advance (Sec. [sent-28, score-0.194]

11 , 2013, 2012b) and extend it to the multi-task bandit setting1 :we prove that RTP provides a consistent estimate of the means of all arms (for all models) as long as they are pulled at least three times per task and prove sample complexity bounds for it (Sec. [sent-32, score-0.597]

12 2 Preliminaries We consider a stochastic MAB problem defined by a set of arms A = {1, . [sent-39, score-0.32]

13 , K}, where each arm i 2 A is characterized by a distribution ⌫i and the samples (rewards) observed from each arm are independent and identically distributed. [sent-42, score-0.464]

14 We denote the mean of an arm i, the best arm, and the best value of a model ✓ 2 ⇥ respectively by µi (✓), i⇤ (✓), µ⇤ (✓). [sent-47, score-0.235]

15 We define the arm gap of an arm i for a model ✓ as i (✓) = µ⇤ (✓) µi (✓), while the model gap for an arm i between two models ✓ and ✓0 is defined as i (✓, ✓0 ) = |µi (✓) µi (✓0 )|. [sent-48, score-0.781]

16 We also assume that arm rewards are bounded in [0, 1]. [sent-49, score-0.243]

17 We consider the sequential transfer setting where at ¯ each episode j the learner interacts with a task ✓j , drawn from a distribution ⇢ over ⇥, for n steps. [sent-50, score-0.355]

18 Let X 2 RK be a random realization of the rewards of all arms from a ran¯ ¯ dom model. [sent-53, score-0.342]

19 3 Multi-arm Bandit with Finite Models Before considering the transfer problem, we Require: Set of models ⇥, number of steps n show that a simple variation to UCB allows us for t = 1, . [sent-63, score-0.208]

20 1 takes Pull arm It = i⇤ (✓t ) Observe sample xIt and update as input a set of models ⇥ including the current ¯ end for (unknown) model ✓. [sent-68, score-0.279]

21 only the models whose means µi (✓) are com¯ patible with the current estimates µi,t of the means µi (✓) of the current model, obtained averaging ˆ 1 Notice that estimating the models involves solving a latent variable model estimation problem, for which RTP is the state-of-the-art. [sent-70, score-0.177]

22 Notice that it is enough that one arm does not satisfy the compatibility condition to discard a model ✓. [sent-73, score-0.235]

23 Among all the models in ⇥t , mUCB first selects the model with the largest optimal value and then it pulls its corresponding optimal arm. [sent-74, score-0.193]

24 This choice is coherent with the optimism in the face of uncertainty principle used in UCB-based algorithms, since mUCB always pulls the optimal arm corresponding to the optimistic model compatible with the current estimates µi,t . [sent-75, score-0.52]

25 We show that mUCB incurs a ˆ regret which is never worse than UCB and it is often significantly smaller. [sent-76, score-0.217]

26 We denote the set of arms which are optimal for at least a model in a set ⇥0 as A⇤ (⇥0 ) = {i 2 A : 9✓ 2 ⇥0 : i⇤ (✓) = i}. [sent-77, score-0.354]

27 The set of models for which the arms in A0 are optimal is ⇥(A0 ) = {✓ 2 ⇥ : ¯ 9i 2 A0 : i⇤ (✓) = i}. [sent-78, score-0.384]

28 The set of optimistic models for a given model ✓ is ⇥+ = {✓ 2 ⇥ : µ⇤ (✓) ¯ µ⇤ (✓)}, and their corresponding optimal arms A+ = A⇤ (⇥+ ). [sent-79, score-0.518]

29 The following theorem bounds the expected regret (similar bounds hold in high probability). [sent-80, score-0.206]

30 The UCB algorithm incurs a regret ⇣X ⇣ log n ⌘ log n ⌘ E[Rn (UCB)]  O O K ¯ ¯ . [sent-87, score-0.212]

31 i2A+ min✓2⇥ mini min✓2⇥ i (✓, ✓) i (✓, ✓) +,i +,i This result suggests that mUCB tends to discard all the models in ⇥+ from the most optimistic ¯ down to the actual model ✓ which, with high-probability, is never discarded. [sent-91, score-0.249]

32 As a result, even if ¯ other models are still in ⇥t , the optimal arm of ✓ is pulled until the end. [sent-92, score-0.37]

33 This significantly reduces the set of arms which are actually pulled by mUCB and the previous bound only depends on the number of arms in A+ , which is |A+ |  |A⇤ (⇥)|  K. [sent-93, score-0.739]

34 Furthermore, for all arms i, the minimum ¯ ¯ gap min✓2⇥+,i i (✓, ✓) is guaranteed to be larger than the arm gap i (✓) (see Lem. [sent-94, score-0.587]

35 4 Online Transfer with Unknown Models We now consider the case when the set of models is unknown and the regret is cumulated over multiple tasks drawn from ⇢ (Eq. [sent-101, score-0.285]

36 We introduce tUCB (transfer-UCB) which transfers estimates of ⇥, whose accuracy is improved through episodes using a method-of-moments approach. [sent-103, score-0.202]

37 2 outlines the structure of our online transfer bandit algorithm tUCB (transfer-UCB). [sent-106, score-0.317]

38 The algorithm uses two sub-algorithms, the bandit algorithm umUCB (uncertain model-UCB), whose objective is to minimize the regret at each episode, and RTP (robust tensor power method) which at each episode j computes an estimate {ˆj (✓)} of the arm means of all the models. [sent-107, score-0.772]

39 Once t 3 Require: number of arms K, number of models m, constant C(✓). [sent-112, score-0.364]

40 steps n Pull each arm three times for t = 3K + 1, . [sent-117, score-0.235]

41 Require: samples R 2 Rj⇥n , number of models m and arms K, episode j c c Estimate the second and third moment M2 and M3 using the reward samples from R (Eq. [sent-122, score-0.566]

42 the active set is computed, the algorithm computes an upper-confidence bound on the value of each arm i for each model ✓ and returns the best arm for the most optimistic model. [sent-128, score-0.605]

43 4) updates the estimates of the model means j j µ (✓) = {ˆi (✓)}i 2 RK using the samples obtained from each arm i. [sent-132, score-0.311]

44 At the beginning of each task b µ umUCB pulls all the arms 3 times, since RTP needs at least 3 samples from each arm to accurately estimate the 2nd and 3rd moments (Anandkumar et al. [sent-133, score-0.745]

45 More precisely, RTP uses all the reward samples generated up to episode j to estimate the 2nd and 3rd moments (see Sec. [sent-135, score-0.199]

46 2) as c M2 = j 1 Xj l=1 µ1l ⌦ µ2l , c M3 = j and 1 Xj l=1 µ1l ⌦ µ2l ⌦ µ3l , (4) l where the vectors µ1l , µ2l , µ3l 2 RK are obtained by dividing the Ti,n samples observed from arm l i in episode l in three batches and taking their average (e. [sent-136, score-0.364]

47 2 Notice that 1/3([µ1l ]i + [µ2l ]i + [µ1l ]i ) = µl , the empirical mean of arm i at the end of episode l. [sent-151, score-0.342]

48 3 Regret Analysis of umUCB We now analyze the regret of umUCB when an estimated set of models ⇥j is provided as input. [sent-197, score-0.236]

49 At episode j, for each model ✓ we define the set of non-dominated arms (i. [sent-198, score-0.455]

50 Among the non-dominated arms, when the ˆi ˆi ⇤ ¯ ¯ ¯ actual model is ✓j , the set of optimistic arms is Aj (✓; ✓j ) = {i 2 Aj (✓) : µj (✓) + "j ˆi µ⇤ (✓j )}. [sent-201, score-0.479]

51 In some cases, As a result, the set of optimistic models is ⇥+ (✓ ) = {✓ 2 ⇥ : A+ (✓; ✓ ¯ because of the uncertainty in the model estimates, unlike in mUCB, not all the models ✓ 6= ✓j can be discarded, not even at the end of a very long episode. [sent-203, score-0.239]

52 Among the optimistic models, the set of models ¯ ¯ µ ¯ e ¯ that cannot be discarded is defined as ⇥j (✓j ) = {✓ 2 ⇥j (✓j ) : 8i 2 Aj (✓; ✓j ), |ˆj (✓) µi (✓j )|  + + + i j 0 " }. [sent-204, score-0.202]

53 (2013), here we only report the number of pulls, and the corresponding regret bound. [sent-210, score-0.174]

54 , set of arms only proposed by models that can be discarded), and Aj = 2 e ¯ ¯ Aj (⇥j (✓j ); ✓j ) (i. [sent-218, score-0.364]

55 , set of arms only proposed by models that cannot be discarded). [sent-220, score-0.364]

56 + + 5 The previous corollary states that arms which cannot be optimal for any optimistic model (i. [sent-221, score-0.474]

57 , the optimistic non-dominated arms) are never pulled by umUCB, which focuses only on arms in ¯ ¯ i 2 Aj (⇥j (✓j ); ✓j ). [sent-223, score-0.549]

58 , i 2 Aj ) are potentially pulled less than UCB, while the remaining arms, which are optimal for 1 the models that cannot be discarded (i. [sent-226, score-0.187]

59 2 Similar to mUCB, umUCB first pulls the arms that are more optimistic until either the active set ⇥j t changes or they are no longer optimistic (because of the evidence from the actual samples). [sent-229, score-0.68]

60 We are now ready to derive the per-episode regret of umUCB. [sent-230, score-0.174]

61 If umUCB is run for n steps on the set of models ⇥j estimated by RTP after j episodes ¯ with = 1/n, and the actual model is ✓j , then its expected regret (w. [sent-232, score-0.432]

62 the random realization in ¯ episode j and conditional on ✓j ) is E[Rj ]  K + n X + X j i2A1 j i2A2 ⇢ log 2mKn3 min 2/ 2 log 2mKn3 / ¯j )2 , 1/ 2 min j ✓2⇥ i (✓ ¯j ) i,+ (✓ ¯j ). [sent-235, score-0.227]

63 The transfer of knowledge introduces a bias in the learning process which is often beneficial. [sent-237, score-0.178]

64 3 we showed that mUCB exploits the knowledge of ⇥ to focus on a restricted set of arms which are pulled less than UCB. [sent-243, score-0.433]

65 3, umUCB focuses on arms in Aj [ Aj which is potentially 1 2 a smaller set than A. [sent-250, score-0.32]

66 Furthermore, the number of pulls to arms in Aj is smaller than for UCB 1 ¯ ¯ whenever the estimated model gap b i (✓; ✓j ) is bigger than i (✓j ). [sent-251, score-0.47]

67 , ⇥j (✓j ) ⌘ ⇥+ (✓j )) and all the + optimistic models have only optimal arms (i. [sent-255, score-0.504]

68 , for any ✓ 2 ⇥+ the set of non-dominated optimistic ¯ ¯ ¯ arms is A+ (✓; ✓j ) = {i⇤ (✓)}), which corresponds to Aj ⌘ A⇤ (⇥+ (✓j )) and Aj ⌘ {i⇤ (✓j )}, which 1 2 matches the condition of mUCB. [sent-257, score-0.44]

69 For instance, for any model ✓, in order to have A⇤ (✓) = {i⇤ (✓)}, for any arm i 6= i⇤ (✓) we need that µj (✓) + "j  µj⇤ (✓) (✓) "j . [sent-258, score-0.235]

70 episodes, all the optimistic models have only one optimal arm independently from the actual identity ¯ of the model ✓j . [sent-260, score-0.444]

71 4 Regret Analysis of tUCB Given the previous results, we derive the bound on the cumulative regret over J episodes (Eq. [sent-264, score-0.357]

72 If tUCB is run over J episodes of n steps in which the tasks ✓j are drawn from a fixed distribution ⇢ over a set of models ⇥, then its cumulative regret is RJ  JK + + w. [sent-267, score-0.468]

73 the randomization over tasks and the realizations of the arms in each episode. [sent-272, score-0.418]

74 If the task was revealed at the beginning of the task, a bandit algorithm could simply cluster all the samples coming from the same task and incur a much smaller cumulative regret with a logarithmic dependency on episodes and steps, i. [sent-290, score-0.585]

75 Nonetheless, as discussed in the previous section, the cumulative regret of tUCB is never worse than for UCB and as the number of tasks increases it approaches the performance of mUCB, which fully exploits the prior knowledge of ⇥. [sent-293, score-0.356]

76 We define a set ⇥ of m = 5 MAB problems with K = 7 arms each, whose means {µi (✓)}i,✓ are reported in Fig. [sent-296, score-0.341]

77 (2013) for the actual values), where each model has a different color and squares correspond to optimal arms (e. [sent-299, score-0.379]

78 4 Models ✓1 and ✓2 only differ in their optimal arms and this makes it difficult to distinguish them. [sent-303, score-0.34]

79 For arm 3 (which is optimal for model ✓3 and thus potentially selected by mUCB), all the models share exactly the same mean value. [sent-304, score-0.299]

80 Although this might suggest that mUCB gets stuck in pulling arm 3, we showed in Thm. [sent-306, score-0.262]

81 Only 5 out of the 7 arms are actually optimal for a model in ⇥. [sent-309, score-0.354]

82 Thus, we also report the performance of UCB+ which, under the assumption that ⇥ is known, immediately discards all the arms which are not optimal (i 2 A⇤ ) and / performs UCB on the remaining arms. [sent-310, score-0.354]

83 Before discussing the transfer results, we compare UCB, UCB+, and mUCB, to illustrate the advantage of the prior knowledge of ⇥ w. [sent-314, score-0.196]

84 7 reports the per-episode regret of the three 4 Notice that although ⇥ satisfies Assumption 1, the smallest singular value 0. [sent-319, score-0.174]

85 6 the horizontal lines correspond to the value of the regret bounds up to the n dependent terms and constants5 for the different models in ⇥ averaged w. [sent-327, score-0.234]

86 8 we show how the per-episode regret changes through episodes for a transfer problem with J = 5000 tasks of length n = 5000. [sent-335, score-0.534]

87 In fact, at the beginning tUCB selects arms according to a UCB strategy, since no prior information about the models ⇥ is available. [sent-340, score-0.402]

88 On the other hand, as more tasks are observed, tUCB is able to transfer the knowledge acquired through episodes and build an increasingly accurate estimate of the models, thus approaching the behavior of mUCB. [sent-341, score-0.403]

89 This is due to the fact that some models have relatively small gaps and thus the number of episodes to have an accurate enough estimate of the models to reach the performance of mUCB is much larger than 5000 (see also the Remarks of Thm. [sent-345, score-0.231]

90 Since the final objective is to achieve a small global regret (Eq. [sent-347, score-0.174]

91 7 we report the cumulative regret averaged over the total number of tasks (J) for different values of J and n. [sent-349, score-0.267]

92 6 Conclusions and Open Questions In this paper we introduce the transfer problem in the multi–armed bandit framework when a tasks are drawn from a finite set of bandit problems. [sent-351, score-0.497]

93 We first introduced the bandit algorithm mUCB and we showed that it is able to leverage the prior knowledge on the set of bandit problems ⇥ and reduce the regret w. [sent-352, score-0.5]

94 For these algorithms we derive regret bounds, and we show preliminary numerical simulations. [sent-358, score-0.193]

95 At each episode, tUCB transfers the knowledge about ⇥ acquired from previous tasks to achieve a small per-episode regret using umUCB. [sent-362, score-0.31]

96 Although this strategy guarantees that the per-episode regret of tUCB is never worse than UCB, it may not be the optimal strategy in terms of the cumulative regret through episodes. [sent-363, score-0.437]

97 In fact, if J is large, it could be preferable to run a model identification algorithm instead of umUCB in earlier episodes so as to improve the quality of the estimates µi (✓). [sent-364, score-0.19]

98 Although such an algorithm would incur a much larger regret in earlier tasks (up ˆ to linear), it could approach the performance of mUCB in later episodes much faster than done by tUCB. [sent-365, score-0.399]

99 This trade-off between identification of the models and transfer of knowledge may suggest that different algorithms than tUCB are possible. [sent-366, score-0.222]

100 Sequential transfer in multi-armed bandit with finite set of models. [sent-434, score-0.29]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mucb', 0.47), ('tucb', 0.346), ('umucb', 0.332), ('arms', 0.32), ('ucb', 0.295), ('arm', 0.221), ('rtp', 0.207), ('regret', 0.174), ('transfer', 0.15), ('episodes', 0.143), ('bandit', 0.14), ('aj', 0.138), ('episode', 0.121), ('optimistic', 0.12), ('azar', 0.111), ('anandkumar', 0.102), ('pulls', 0.095), ('pulled', 0.085), ('tensor', 0.08), ('tasks', 0.067), ('rj', 0.06), ('learner', 0.044), ('models', 0.044), ('lazaric', 0.043), ('pulling', 0.041), ('rk', 0.04), ('mab', 0.04), ('pull', 0.038), ('discarded', 0.038), ('reinforcement', 0.037), ('moments', 0.035), ('transferring', 0.035), ('min', 0.034), ('estimates', 0.033), ('realizations', 0.031), ('kakade', 0.031), ('kleibergen', 0.028), ('lifelong', 0.028), ('paap', 0.028), ('knowledge', 0.028), ('online', 0.027), ('dence', 0.027), ('whitening', 0.027), ('multi', 0.027), ('cumulative', 0.026), ('transfers', 0.026), ('hsu', 0.026), ('armed', 0.025), ('actual', 0.025), ('sequential', 0.025), ('eigenvalues', 0.024), ('choe', 0.024), ('never', 0.024), ('gap', 0.023), ('brunskill', 0.022), ('xit', 0.022), ('rewards', 0.022), ('samples', 0.022), ('bb', 0.022), ('lugosi', 0.022), ('mini', 0.022), ('mann', 0.021), ('bv', 0.021), ('reward', 0.021), ('means', 0.021), ('optimal', 0.02), ('beginning', 0.02), ('preliminary', 0.019), ('transferred', 0.019), ('worse', 0.019), ('log', 0.019), ('cmu', 0.018), ('estimated', 0.018), ('prior', 0.018), ('uncertainty', 0.017), ('et', 0.017), ('remark', 0.017), ('bt', 0.016), ('pan', 0.016), ('robbins', 0.016), ('ge', 0.016), ('eigenvectors', 0.016), ('bounds', 0.016), ('moment', 0.016), ('dependency', 0.015), ('incur', 0.015), ('task', 0.015), ('acquired', 0.015), ('computes', 0.015), ('colt', 0.015), ('multitask', 0.015), ('rl', 0.014), ('bound', 0.014), ('improvement', 0.014), ('notice', 0.014), ('im', 0.014), ('steps', 0.014), ('assumption', 0.014), ('model', 0.014), ('auer', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill

Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1

2 0.33885998 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

Author: Thomas Bonald, Alexandre Proutiere

Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1

3 0.28200242 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh

Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1

4 0.20522219 137 nips-2013-High-Dimensional Gaussian Process Bandits

Author: Josip Djolonga, Andreas Krause, Volkan Cevher

Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1

5 0.19753507 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

Author: Dan Russo, Benjamin Van Roy

Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1

6 0.16605286 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

7 0.14778928 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

8 0.13312626 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

9 0.1232517 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

10 0.10728221 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

11 0.10223568 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

12 0.095802203 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

13 0.093677372 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

14 0.093654156 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

15 0.090444125 230 nips-2013-Online Learning with Costly Features and Labels

16 0.089699365 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

17 0.089601628 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

18 0.079285838 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

19 0.075876758 7 nips-2013-A Gang of Bandits

20 0.070475414 325 nips-2013-The Pareto Regret Frontier


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.152), (1, -0.149), (2, 0.139), (3, -0.076), (4, 0.011), (5, -0.185), (6, 0.183), (7, -0.175), (8, -0.083), (9, -0.066), (10, 0.021), (11, 0.218), (12, -0.13), (13, -0.052), (14, 0.002), (15, -0.027), (16, -0.092), (17, 0.071), (18, -0.005), (19, 0.079), (20, 0.141), (21, 0.001), (22, 0.076), (23, 0.02), (24, 0.052), (25, -0.001), (26, 0.034), (27, 0.031), (28, -0.049), (29, -0.027), (30, 0.009), (31, 0.101), (32, 0.046), (33, 0.035), (34, -0.047), (35, -0.006), (36, -0.027), (37, -0.015), (38, -0.009), (39, 0.018), (40, -0.066), (41, 0.022), (42, -0.07), (43, 0.041), (44, 0.056), (45, 0.013), (46, 0.054), (47, -0.024), (48, -0.076), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93661815 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill

Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1

2 0.89095771 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

Author: Thomas Bonald, Alexandre Proutiere

Abstract: We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed parameter. This two-target algorithm achieves a √ long-term average regret in 2n for a large parameter m and a known time horizon n. This regret is optimal and strictly less than the regret achieved by the best √ known algorithms, which is in 2 n. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons. 1

3 0.7700932 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh

Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1

4 0.72349697 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

Author: Sebastien Bubeck, Che-Yu Liu

Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1

5 0.68879271 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos

Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1

6 0.53378278 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

7 0.48959699 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

8 0.44407779 137 nips-2013-High-Dimensional Gaussian Process Bandits

9 0.4407199 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

10 0.43958434 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

11 0.43282926 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

12 0.36821818 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

13 0.36528519 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

14 0.36014819 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

15 0.35039786 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

16 0.34704891 7 nips-2013-A Gang of Bandits

17 0.34438193 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

18 0.33447975 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

19 0.32464531 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

20 0.29183149 230 nips-2013-Online Learning with Costly Features and Labels


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.032), (16, 0.022), (33, 0.179), (34, 0.058), (41, 0.039), (49, 0.019), (56, 0.191), (70, 0.025), (85, 0.061), (89, 0.025), (90, 0.196), (93, 0.031), (95, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90912992 225 nips-2013-One-shot learning and big data with n=2

Author: Lee H. Dicker, Dean P. Foster

Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1

same-paper 2 0.87101895 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill

Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1

3 0.85104245 54 nips-2013-Bayesian optimization explains human active search

Author: Ali Borji, Laurent Itti

Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1

4 0.80945867 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

Author: Forest Agostinelli, Michael R. Anderson, Honglak Lee

Abstract: Stacked sparse denoising autoencoders (SSDAs) have recently been shown to be successful at removing noise from corrupted images. However, like most denoising techniques, the SSDA is not robust to variation in noise types beyond what it has seen during training. To address this limitation, we present the adaptive multi-column stacked sparse denoising autoencoder (AMC-SSDA), a novel technique of combining multiple SSDAs by (1) computing optimal column weights via solving a nonlinear optimization program and (2) training a separate network to predict the optimal weights. We eliminate the need to determine the type of noise, let alone its statistics, at test time and even show that the system can be robust to noise not seen in the training set. We show that state-of-the-art denoising performance can be achieved with a single system on a variety of different noise types. Additionally, we demonstrate the efficacy of AMC-SSDA as a preprocessing (denoising) algorithm by achieving strong classification performance on corrupted MNIST digits. 1

5 0.80541176 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

Author: Yuchen Zhang, John Duchi, Michael Jordan, Martin J. Wainwright

Abstract: We establish lower bounds on minimax risks for distributed statistical estimation under a communication budget. Such lower bounds reveal the minimum amount of communication required by any procedure to achieve the centralized minimax-optimal rates for statistical estimation. We study two classes of protocols: one in which machines send messages independently, and a second allowing for interactive communication. We establish lower bounds for several problems, including various types of location models, as well as for parameter estimation in regression models. 1

6 0.80227512 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

7 0.80142528 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

8 0.80137789 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

9 0.80075598 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

10 0.8004005 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

11 0.80020189 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

12 0.80005288 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

13 0.79899353 89 nips-2013-Dimension-Free Exponentiated Gradient

14 0.79886729 230 nips-2013-Online Learning with Costly Features and Labels

15 0.79874474 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

16 0.79859847 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

17 0.79536825 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

18 0.79248399 224 nips-2013-On the Sample Complexity of Subspace Learning

19 0.79237771 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization

20 0.79206353 344 nips-2013-Using multiple samples to learn mixture models