nips nips2011 nips2011-210 knowledge-graph by maker-knowledge-mining

210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Source: pdf

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is signiﬁcantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). [sent-12, score-1.405]

2 The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. [sent-13, score-1.121]

3 If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). [sent-14, score-1.016]

4 However, if the side information I⇢t (S; A) is not fully used, the regret bound is signiﬁcantly tighter. [sent-15, score-0.638]

5 Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. [sent-17, score-0.624]

6 We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. [sent-18, score-0.903]

7 1 Introduction Multiarmed bandits with side information are an elegant mathematical model for many real-life interactive systems, such as personalized online advertising, personalized medical treatment, and so on. [sent-19, score-0.55]

8 This model is also known as contextual bandits or associative bandits (Kaelbling, 1994, Strehl et al. [sent-20, score-0.718]

9 In multiarmed bandits with side information the learner repeatedly observes states (side information) {s1 , s2 , . [sent-23, score-0.929]

10 } (for example, symptoms of a patient) and has to perform actions (for example, prescribe drugs), such that the expected regret is minimized. [sent-26, score-0.482]

11 Most of the existing analyses of multiarmed bandits with side information has focused on the adversarial (worst-case) model, where the sequence of rewards associated with each state-action pair is chosen by an adversary. [sent-28, score-0.926]

12 We derive datadependent analysis for stochastic multiarmed bandits with side information. [sent-30, score-0.866]

13 We restrict ourselves to problems with ﬁnite number of states N and ﬁnite number of actions K and leave generalization to continuous state and action spaces to future work. [sent-33, score-0.39]

14 1 The result presented in this paper exhibits adaptive dependency on the side information (state identity) that is actually used by the algorithm. [sent-37, score-0.235]

15 This allows us to provide the algorithm a large amount of side information and let the algorithm decide, which of this side information is actually relevant to the task. [sent-38, score-0.416]

16 For example, in online advertising we can increase the state resolution and provide the algorithm the town from which the web page was accessed, but if this reﬁned state information is not used by the algorithm the regret bound will not deteriorate. [sent-39, score-0.616]

17 This can be opposed to existing analysis of adversarial multiarmed bandits, where the regret bound depends on a predeﬁned complexity of the underlying expert class (Beygelzimer et al. [sent-40, score-0.885]

18 Thus, the existing analysis of adversarial multiarmed bandits would either become looser if we add more side information or a-priori limit the usage of the side information through its internal structure. [sent-42, score-1.104]

19 (We note that through the relation between PAC-Bayesian analysis and the analysis of adversarial online learning described in Banerjee (2006) it might be possible to extend our analysis to adversarial setting, but we leave this research direction to future work. [sent-43, score-0.192]

20 ) The idea of regularization by relevant mutual information goes back to the Information Bottleneck principle in supervised and unsupervised learning (Tishby et al. [sent-44, score-0.208]

21 Tishby and Polani (2010) further suggested to measure the complexity of a policy in reinforcement learning by the mutual information between states and actions used by the policy. [sent-46, score-0.506]

22 We note, however, that our starting point is the regret bound and we derive the regularization term from our analysis without introducing it a-priori. [sent-47, score-0.479]

23 Unlike in VC-dimension-based approaches and their successors, where the complexity is deﬁned for a hypothesis class, in PAC-Bayesian analysis the complexity is deﬁned for individual hypotheses. [sent-52, score-0.183]

24 In supervised learning every hypothesis in a hypothesis class can be evaluated on all the samples, whereas in reinforcement learning rewards of one action cannot be used to evaluate another action. [sent-55, score-0.452]

25 Here, we apply this generalization to multiarmed bandits with side information. [sent-58, score-0.766]

26 We start with deﬁnitions in Section 2 and provide our main results in Section 3, which include an instantaneous regret bound and a new algorithm for stochastic multiarmed bandits with side information. [sent-60, score-1.296]

27 We start with the deﬁnition of stochastic multiarmed bandits with side information. [sent-64, score-0.817]

28 Let S be a set of |S| = N states and let A be a set of |A| = K actions, such that any action can be performed in any state. [sent-65, score-0.225]

29 Let R(a, s) be the expected reward for performing action a in state s. [sent-67, score-0.247]

30 At each round t of the game the learner is presented a state St drawn i. [sent-68, score-0.243]

31 The learner draws an action At according to his choice of a distribution (policy) ⇡t (a|s) and obtains a stochastic reward Rt with expected value R(At , St ). [sent-72, score-0.295]

32 For every state s we deﬁne the “best” action in that state as a⇤ = arg maxa R(a, s) (if there are multiple “best” actions, one of them is chosen arbitrarily). [sent-117, score-0.241]

33 We then deﬁne the expected and empirical regret for performing any other action a in state s as: (a, s) = R(a⇤ (s), s) ˆ t (a, s) = Rt (a⇤ (s), s) ˆ R(a, s), ˆ Rt (a, s). [sent-118, score-0.524]

34 (s) Let pt (s) = ntt be the empirical distribution over states observed up to time t. [sent-119, score-0.18]

35 For any polˆ icy ⇢(a|s) we deﬁne the empirical reward, empirical regret, and expected regret of the policy P P ˆ ˆ ˆ t (⇢) = P pt (s) P ⇢(a|s) ˆ t (a, s), and (⇢) = as: Rt (⇢) = ˆ s pt (s) a ⇢(a|s)Rt (a, s), s ˆ a P P p(s) a ⇢(a|s) (a, s). [sent-120, score-0.537]

36 Denote by a = h(s) the action assigned by hypothesis h to state s. [sent-125, score-0.281]

37 Denote by R(h) = s2S p(s)R(h(s), s) the expected reward of a hypothesis h. [sent-127, score-0.168]

38 Let h⇤ = arg maxh2H R(h) be the “best” hypothesis (the one that chooses the “best” action in each state). [sent-130, score-0.22]

39 Any policy ⇢(a|s) deﬁnes a distribution over H: we can draw an action a for each state s according to ⇢(a|s) and thus obtain a hypothesis h 2 H. [sent-133, score-0.401]

40 PN Finally, let nh (a) = s=1 Ih(s)=a be the number of states in which action a is played by the n o h hypothesis h. [sent-137, score-0.446]

41 Let A = nh (a) be the normalized cardinality proﬁle (histogram) over the N a2A 3 actions played by hypothesis h (with respect to the uniform distribution over S). [sent-138, score-0.375]

42 In other words, H(Ah ) is the entropy a N ln N of an action choice of hypothesis h (with respect to the uniform distribution over S). [sent-140, score-0.568]

43 Note, that the optimal policy ⇢⇤ (a|s) (the one, that selects the “best” action in each state) is deterministic and we ⇤ have I⇢⇤ (S; A) = H(Ah ). [sent-141, score-0.214]

44 3 Main Results Our main result is a data and complexity dependent regret bound for a general class of prediction strategies of a smoothed exponential form. [sent-142, score-0.459]

45 Let ⇢t (a) be an arbitrary distribution over actions, let ˆ ⇢exp (a|s) = t where Z(⇢exp , s) = t P a ⇢t (a)e ˆ t Rt (a,s) exp ⇢t (a)e t Rt (a,s) , Z(⇢exp , s) t (1) is a normalization factor, and let ⇢t (a|s) = (1 ˜ (2) K"t+1 )⇢exp (a|s) + "t+1 t be a smoothed exponential policy. [sent-143, score-0.179]

46 The following theorem provides a regret bound for playing ⇢t ˜ at round t + 1 of the game. [sent-144, score-0.551]

47 It also provides an instantaneous regret bound for any algorithm that plays the policies {˜exp , ⇢exp , . [sent-165, score-0.537]

48 Then, by the union bound, the result holds with probability greater than 1 for all rounds of the game simultaneously. [sent-173, score-0.186]

49 This leads to Algorithm 1 for stochastic multiarmed bandits with side information. [sent-174, score-0.817]

50 Theorem 1 is based on the following regret decomposition and the subsequent theorem and two lemmas that bound the three terms in the decomposition. [sent-176, score-0.515]

51 , ⇡t } and c, simultaneously for all policies ⇢ that satisfy (3) with probability greater than 1 : s 2(e 2)(N I⇢(S;A) + K(ln N + ln K) + ln 2mt ) (⇢) ˆ t (⇢)  (1 + c) , (6) t"t 4 Algorithm 1: Algorithm for stochastic contextual bandits. [sent-182, score-0.79]

52 n(St ) n(St ) + 1 R ˆ t , St ) ˆ R(A R(At , St ) + ⇢(At t t ) |S t t+1 N 1 N ⇢(a) 1 N ⇢(a|St ) and for all ⇢ that do not satisfy (3) with the same probability: (⇢) 2mt ) ˆ t (⇢)  2(N I⇢ (S; A) + K(ln N + ln K) + ln . [sent-187, score-0.596]

53 For any distribution ⇢exp of the form (1), where ⇢t (a) ✏ for all a, we have: t 1 ˆ t (⇢exp )  ln ✏ . [sent-190, score-0.323]

54 Theorem 1 exhibits what we were looking for: the regret of a policy ⇢exp depends on the trade-off between its complexity, N I⇢exp (S; A), and the empirical regret, which ˜t t 1 is bounded by 1t ln ✏t+1 . [sent-195, score-0.764]

55 We note that 0  I⇢t (S; A)  ln K, hence, the result is interesting when N K, since otherwise K ln K term in the bound neutralizes the advantage we get from having small mutual information values. [sent-196, score-0.808]

56 We believe that the dependence of the ﬁrst term of the regret bound (4) on "t is an artifact of our crude upper bound on the variance of the sampling process (given in Lemma 3 in the proof of Theorem 2) and that this term should not be in the bound. [sent-198, score-0.629]

57 This is supported by an empirical study of stochastic multiarmed bandits (Seldin et al. [sent-199, score-0.679]

58 With the current bound the best choice for "t is "t = (Kt) 1/3 , which, by integration over the game rounds, yields O(K 1/3 t2/3 ) dependence of the cumulative regret on the number of arms and game rounds. [sent-201, score-0.779]

59 However, if we manage to derive a tighter analysis and remove "t from the ﬁrst term in (4), the best choice of "t will be "t = (Kt) 1/2 and the dependence of the cumulative regret on the number of arms and time horizon will improve to O((Kt)1/2 ). [sent-202, score-0.611]

60 We note 5 that although UCB algorithm for stochastic multiarmed bandits (Auer et al. [sent-209, score-0.679]

61 , 2002a) is asymptotically better than the EXP3 algorithm, it is not compatible with PAC-Bayesian analysis and we are not aware of a way to derive a UCB-type algorithm and analysis for multiarmed bandits with side information, whose dependence on the number of states would be better than O(N ln K). [sent-210, score-1.284]

62 Generally, higher values of t decrease the second term of the bound, but also lead to more concentrated policies (conditional distributions) ⇢exp (a|s) and thus higher mutual information values I⇢exp (S; A). [sent-214, score-0.184]

63 This can be approximated by taking the value of mutual information from the previous round (or approximation of the value of mutual information from the previous round). [sent-216, score-0.317]

64 By regret decomposition (5) and Theorem 2, regret at round t + 1 is minimized by a policy ⇢t (a|s) that minimizes a certain trade-off between the mutual information ˆ I⇢ (S; A) and the empirical regret Rt (⇢). [sent-219, score-1.318]

65 N s t Running a similar type of iterations in our case would be prohibitively expensive, since they require iteration over all states s 2 S at each round of the game. [sent-222, score-0.171]

66 We approximate these iterations by approximating the marginal distribution over the actions by a running average: ⇢exp (a) = ˜t+1 N 1 N ⇢exp (a) + ˜t 1 exp ⇢ (a|St ). [sent-223, score-0.283]

67 If these independent multiarmed bandits independently converge to similar strategies, we still get a tighter regret bound. [sent-227, score-0.963]

68 This happens because the corresponding subspace of the hypothesis space is signiﬁcantly smaller than the total hypothesis space, which enables us to put a higher prior on it (Seldin and Tishby, 2010). [sent-228, score-0.202]

69 We are not aware of algorithms for stochastic multiarmed bandits with side information. [sent-231, score-0.817]

70 The best known to us algorithm for adversarial multiarmed bandits with p side information is EXP4. [sent-232, score-0.874]

71 P has O( Kt ln |H|) regret and O(K|H|) complexity per game round. [sent-236, score-0.755]

72 P would p have O( KtN ln K) regret and O(K N +1 ) computational complexity. [sent-238, score-0.642]

73 For hard problems, where all side information has to be used, our regret bound is inferior to the regret bound of Beygelzimer et al. [sent-239, score-1.114]

74 For simple problems the dependence of our regret bound on the number of states is signiﬁcantly better, up to the point p that when the side information is irrelevant for the task we can get O( K ln N ) dependence on the p number of states versus O( N ln K) in EXP4. [sent-242, score-1.528]

75 For N K this leads to tighter regret bounds for small t even despite the “incorrect” dependence on t of our bound, and if we improve the analysis it will lead to tighter regret bounds for all t. [sent-244, score-0.827]

76 As we said it already, our algorithm is able to ﬁlter relevant information from large amounts of side information automatically, whereas in EXP4. [sent-245, score-0.232]

77 P the usage of side information has to be restricted externally through the construction of the hypothesis class. [sent-246, score-0.331]

78 “Baseline” in the ﬁrst graph cort responds to playing N independent multiarmed bandits, one in each state. [sent-254, score-0.296]

79 We take N = 100, K = 20, a ⇤ uniform distribution over states (p(s) = 0. [sent-259, score-0.156]

80 In the ⇤ ﬁrst case, the same action is the best in all states (and hence H(Ah ) = 0 for the optimal hypothesis ⇤ h ). [sent-261, score-0.35]

81 In the second case, for the ﬁrst 33 states the best action is number 1, for the next 33 states the best action is number 2, and for the rest third of the states the best ⇤ action is number 3 (thus, depending on the state, one of the three actions is the “best” and H(Ah ) = ln(3)). [sent-262, score-0.851]

82 In the last case, there are 20 groups of 5 states and each of K = 20 actions is the best in exactly one of the 20 groups. [sent-264, score-0.234]

83 For all states, the reward of the best action in a state has Bernoulli distribution with bias 0. [sent-265, score-0.296]

84 6 and the rewards of all other actions in that state have Bernoulli distribution with bias 0. [sent-266, score-0.239]

85 We run the Pt experiment for T = 4, 000, 000 rounds and calculate the cumulative regret (t) = ⌧ =1 (˜exp ) ⇢⌧ and instantaneous regret bound given in (4). [sent-268, score-0.98]

86 t As we can see from the graphs (see Figure 1), the algorithm exhibits sublinear cumulative regret ⇤ (put attention to the axes’ scales). [sent-270, score-0.451]

87 Furthermore, for simple problems (with small H(Ah )) the regret grows slower than for complex problems. [sent-271, score-0.344]

88 a shows the performance of an algorithm with the same parameter values that runs N multiarmed bandits, one in each state independently of other states. [sent-273, score-0.357]

89 Let Vi (h) = j=1 E[Zj (h)2 | (Uj 1 )] be cumulative variances of ˆ ˆ ˆ the martingales Mi (h). [sent-295, score-0.191]

90 For a distribution ⇢ over H deﬁne Mi (⇢) = E⇢(h) [Mi (h)] and Vt (⇢) = E⇢(h) [Vt (h)] as weighted averages of the martingales and their cumulative variances according to a distribution ⇢. [sent-296, score-0.241]

91 Note that Mt (h) = t( (h) ˆ t (h)) are martingales and their cumulative variances are Vt (h) = ⇥ ⇤ Pt 2 h⇤ (S⌧ ),S⌧ h(S ),S R⌧ ⌧ ⌧ ] [R(h⇤ ) R(h)] T⌧ 1 . [sent-302, score-0.191]

92 In order to apply Theorem 3 we ⌧ =1 E [R⌧ have to derive an upper bound on Vt (⇢exp ),1 a prior µ(h) over H, and calculate (or upper bound) the t KL-divergence KL(⇢exp kµ). [sent-303, score-0.165]

93 However, exp we were not able to prove yet that the probability ⇢t (h) of the remaining hypotheses (those with large (h)) gets sufﬁciently small (of order K"t ), so that the weighted cumulative variance would be of order 2Kt. [sent-313, score-0.234]

94 Improving the upper bound on Vt (⇢exp ) will improve the regret bound, but t 2t for the moment we present the regret bound based on the crude upper bound Vt (⇢exp )  "t . [sent-316, score-1.045]

95 It is possible to deﬁne a distribution µ over H that satisﬁes: h µ(h) e N H(A ) K ln N K ln K . [sent-319, score-0.621]

96 For the distribution µ that satisﬁes (8) and any distribution ⇢(a|s): KL(⇢kµ)  N I⇢ (S; A) + K ln N + K ln K. [sent-321, score-0.646]

97 t t 6 Discussion We presented PAC-Bayesian analysis of stochastic multiarmed bandits with side information. [sent-323, score-0.841]

98 Our analysis provides data-dependent algorithm and data-dependent regret analysis for this problem. [sent-324, score-0.392]

99 The selection of task-relevant side information is delegated from the user to the algorithm. [sent-325, score-0.208]

100 PAC-Bayeso ¸ Bernstein inequality for martingales and its application to multiarmed bandits. [sent-367, score-0.382]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ah', 0.346), ('regret', 0.344), ('ln', 0.298), ('multiarmed', 0.296), ('bandits', 0.286), ('seldin', 0.272), ('side', 0.184), ('rt', 0.173), ('exp', 0.154), ('st', 0.13), ('action', 0.119), ('states', 0.106), ('actions', 0.104), ('mutual', 0.102), ('hypothesis', 0.101), ('tishby', 0.095), ('policy', 0.095), ('vt', 0.095), ('martingales', 0.086), ('bound', 0.086), ('game', 0.084), ('yevgeny', 0.084), ('ba', 0.083), ('cumulative', 0.08), ('rounds', 0.077), ('nicol', 0.072), ('tt', 0.071), ('beygelzimer', 0.069), ('reward', 0.067), ('round', 0.065), ('kl', 0.063), ('played', 0.063), ('state', 0.061), ('contextual', 0.06), ('auer', 0.06), ('adversarial', 0.06), ('john', 0.058), ('policies', 0.058), ('nh', 0.057), ('theorem', 0.056), ('stochastic', 0.051), ('naftali', 0.051), ('francois', 0.051), ('pt', 0.049), ('instantaneous', 0.049), ('rewards', 0.049), ('nitions', 0.049), ('peter', 0.048), ('et', 0.046), ('reinforcement', 0.046), ('crude', 0.045), ('kt', 0.044), ('zi', 0.044), ('robert', 0.042), ('polani', 0.042), ('nt', 0.041), ('dependence', 0.041), ('advertising', 0.04), ('associative', 0.04), ('tighter', 0.037), ('bandit', 0.037), ('ert', 0.037), ('laviolette', 0.037), ('cured', 0.037), ('arms', 0.036), ('supervised', 0.036), ('langford', 0.036), ('vn', 0.035), ('lemma', 0.035), ('zj', 0.034), ('symptoms', 0.034), ('learner', 0.033), ('mi', 0.032), ('hardest', 0.032), ('leslie', 0.032), ('comments', 0.032), ('ui', 0.03), ('lemmas', 0.029), ('williamson', 0.029), ('complexity', 0.029), ('personalized', 0.028), ('strehl', 0.028), ('upper', 0.027), ('sequence', 0.027), ('exhibits', 0.027), ('uniform', 0.025), ('derive', 0.025), ('variances', 0.025), ('distribution', 0.025), ('martingale', 0.025), ('greater', 0.025), ('best', 0.024), ('ucb', 0.024), ('analysis', 0.024), ('information', 0.024), ('pac', 0.024), ('patients', 0.023), ('mt', 0.023), ('un', 0.022), ('usage', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

2 0.29409346 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modiﬁcation of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modiﬁcation improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller conﬁdence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

3 0.26148203 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a ﬁnite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1

4 0.258863 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

Author: Shie Mannor, Ohad Shamir

Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1

5 0.22239968 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is signiﬁcantly better than all others. We propose a new way of setting the learning rate, which adapts to the difﬁculty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

6 0.20362526 61 nips-2011-Contextual Gaussian Process Bandit Optimization

7 0.20205085 220 nips-2011-Prediction strategies without loss

8 0.18431653 56 nips-2011-Committing Bandits

9 0.17303969 80 nips-2011-Efficient Online Learning via Randomized Rounding

10 0.14399269 32 nips-2011-An Empirical Evaluation of Thompson Sampling

11 0.14386088 145 nips-2011-Learning Eigenvectors for Free

12 0.14245763 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

13 0.14191304 109 nips-2011-Greedy Model Averaging

14 0.137999 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

15 0.13393596 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

16 0.13341366 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

17 0.1286989 272 nips-2011-Stochastic convex optimization with bandit feedback

18 0.12698492 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

19 0.11499006 215 nips-2011-Policy Gradient Coagent Networks

20 0.11418192 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.242), (1, -0.418), (2, 0.066), (3, 0.104), (4, 0.153), (5, -0.026), (6, 0.041), (7, 0.092), (8, 0.05), (9, 0.022), (10, -0.072), (11, -0.004), (12, -0.061), (13, -0.057), (14, -0.002), (15, -0.058), (16, -0.055), (17, -0.077), (18, -0.004), (19, 0.047), (20, -0.021), (21, 0.019), (22, 0.021), (23, 0.146), (24, 0.065), (25, 0.081), (26, 0.061), (27, 0.043), (28, -0.01), (29, -0.006), (30, -0.062), (31, -0.068), (32, -0.048), (33, -0.073), (34, 0.078), (35, -0.044), (36, 0.012), (37, -0.079), (38, 0.048), (39, 0.012), (40, 0.04), (41, -0.046), (42, -0.001), (43, -0.039), (44, 0.035), (45, 0.004), (46, -0.072), (47, -0.024), (48, 0.044), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96905833 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

2 0.84692323 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

3 0.78479493 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

Author: Shie Mannor, Ohad Shamir

4 0.75247389 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

5 0.74575013 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

6 0.70938373 56 nips-2011-Committing Bandits

7 0.67941046 32 nips-2011-An Empirical Evaluation of Thompson Sampling

8 0.67381918 272 nips-2011-Stochastic convex optimization with bandit feedback

9 0.61628157 109 nips-2011-Greedy Model Averaging

10 0.60700542 80 nips-2011-Efficient Online Learning via Randomized Rounding

11 0.60309702 220 nips-2011-Prediction strategies without loss

12 0.59926635 61 nips-2011-Contextual Gaussian Process Bandit Optimization

13 0.59051239 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

14 0.54364043 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

15 0.50840288 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

16 0.50302154 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

17 0.45968464 177 nips-2011-Multi-armed bandits on implicit metric spaces

18 0.44641295 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

19 0.44004208 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

20 0.43761674 145 nips-2011-Learning Eigenvectors for Free

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.014), (4, 0.034), (20, 0.055), (26, 0.059), (31, 0.124), (33, 0.019), (43, 0.082), (45, 0.134), (57, 0.021), (74, 0.051), (79, 0.014), (83, 0.041), (87, 0.207), (99, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92195112 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

Author: Nicolas Boumal, Pierre-antoine Absil

Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion ﬁnds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply ﬁrst- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1

2 0.86686766 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman

Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a ﬂat histogram sampling technique by signiﬁcantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efﬁcient weight learning in Markov Logic theories. 1

3 0.86490566 153 nips-2011-Learning large-margin halfspaces with more malicious noise

Author: Phil Long, Rocco Servedio

Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efﬁcient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassiﬁcation error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1

same-paper 4 0.8409934 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

5 0.74794233 29 nips-2011-Algorithms and hardness results for parallel large margin learning

Author: Phil Long, Rocco Servedio

Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1

6 0.73128623 180 nips-2011-Multiple Instance Filtering

7 0.7297979 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

8 0.72397381 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

9 0.72310925 258 nips-2011-Sparse Bayesian Multi-Task Learning

10 0.72066969 229 nips-2011-Query-Aware MCMC

11 0.72059989 25 nips-2011-Adaptive Hedge

12 0.72029024 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

13 0.71873778 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

14 0.71866453 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

15 0.71851957 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

16 0.71830195 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

17 0.71694767 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

18 0.71681118 231 nips-2011-Randomized Algorithms for Comparison-based Search

19 0.7156359 271 nips-2011-Statistical Tests for Optimization Efficiency

20 0.71537709 197 nips-2011-On Tracking The Partition Function