nips nips2011 nips2011-210 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). [sent-12, score-1.405]
2 The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. [sent-13, score-1.121]
3 If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). [sent-14, score-1.016]
4 However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. [sent-15, score-0.638]
5 Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. [sent-17, score-0.624]
6 We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. [sent-18, score-0.903]
7 1 Introduction Multiarmed bandits with side information are an elegant mathematical model for many real-life interactive systems, such as personalized online advertising, personalized medical treatment, and so on. [sent-19, score-0.55]
8 This model is also known as contextual bandits or associative bandits (Kaelbling, 1994, Strehl et al. [sent-20, score-0.718]
9 In multiarmed bandits with side information the learner repeatedly observes states (side information) {s1 , s2 , . [sent-23, score-0.929]
10 } (for example, symptoms of a patient) and has to perform actions (for example, prescribe drugs), such that the expected regret is minimized. [sent-26, score-0.482]
11 Most of the existing analyses of multiarmed bandits with side information has focused on the adversarial (worst-case) model, where the sequence of rewards associated with each state-action pair is chosen by an adversary. [sent-28, score-0.926]
12 We derive datadependent analysis for stochastic multiarmed bandits with side information. [sent-30, score-0.866]
13 We restrict ourselves to problems with finite number of states N and finite number of actions K and leave generalization to continuous state and action spaces to future work. [sent-33, score-0.39]
14 1 The result presented in this paper exhibits adaptive dependency on the side information (state identity) that is actually used by the algorithm. [sent-37, score-0.235]
15 This allows us to provide the algorithm a large amount of side information and let the algorithm decide, which of this side information is actually relevant to the task. [sent-38, score-0.416]
16 For example, in online advertising we can increase the state resolution and provide the algorithm the town from which the web page was accessed, but if this refined state information is not used by the algorithm the regret bound will not deteriorate. [sent-39, score-0.616]
17 This can be opposed to existing analysis of adversarial multiarmed bandits, where the regret bound depends on a predefined complexity of the underlying expert class (Beygelzimer et al. [sent-40, score-0.885]
18 Thus, the existing analysis of adversarial multiarmed bandits would either become looser if we add more side information or a-priori limit the usage of the side information through its internal structure. [sent-42, score-1.104]
19 (We note that through the relation between PAC-Bayesian analysis and the analysis of adversarial online learning described in Banerjee (2006) it might be possible to extend our analysis to adversarial setting, but we leave this research direction to future work. [sent-43, score-0.192]
20 ) The idea of regularization by relevant mutual information goes back to the Information Bottleneck principle in supervised and unsupervised learning (Tishby et al. [sent-44, score-0.208]
21 Tishby and Polani (2010) further suggested to measure the complexity of a policy in reinforcement learning by the mutual information between states and actions used by the policy. [sent-46, score-0.506]
22 We note, however, that our starting point is the regret bound and we derive the regularization term from our analysis without introducing it a-priori. [sent-47, score-0.479]
23 Unlike in VC-dimension-based approaches and their successors, where the complexity is defined for a hypothesis class, in PAC-Bayesian analysis the complexity is defined for individual hypotheses. [sent-52, score-0.183]
24 In supervised learning every hypothesis in a hypothesis class can be evaluated on all the samples, whereas in reinforcement learning rewards of one action cannot be used to evaluate another action. [sent-55, score-0.452]
25 Here, we apply this generalization to multiarmed bandits with side information. [sent-58, score-0.766]
26 We start with definitions in Section 2 and provide our main results in Section 3, which include an instantaneous regret bound and a new algorithm for stochastic multiarmed bandits with side information. [sent-60, score-1.296]
27 We start with the definition of stochastic multiarmed bandits with side information. [sent-64, score-0.817]
28 Let S be a set of |S| = N states and let A be a set of |A| = K actions, such that any action can be performed in any state. [sent-65, score-0.225]
29 Let R(a, s) be the expected reward for performing action a in state s. [sent-67, score-0.247]
30 At each round t of the game the learner is presented a state St drawn i. [sent-68, score-0.243]
31 The learner draws an action At according to his choice of a distribution (policy) ⇡t (a|s) and obtains a stochastic reward Rt with expected value R(At , St ). [sent-72, score-0.295]
32 For every state s we define the “best” action in that state as a⇤ = arg maxa R(a, s) (if there are multiple “best” actions, one of them is chosen arbitrarily). [sent-117, score-0.241]
33 We then define the expected and empirical regret for performing any other action a in state s as: (a, s) = R(a⇤ (s), s) ˆ t (a, s) = Rt (a⇤ (s), s) ˆ R(a, s), ˆ Rt (a, s). [sent-118, score-0.524]
34 (s) Let pt (s) = ntt be the empirical distribution over states observed up to time t. [sent-119, score-0.18]
35 For any polˆ icy ⇢(a|s) we define the empirical reward, empirical regret, and expected regret of the policy P P ˆ ˆ ˆ t (⇢) = P pt (s) P ⇢(a|s) ˆ t (a, s), and (⇢) = as: Rt (⇢) = ˆ s pt (s) a ⇢(a|s)Rt (a, s), s ˆ a P P p(s) a ⇢(a|s) (a, s). [sent-120, score-0.537]
36 Denote by a = h(s) the action assigned by hypothesis h to state s. [sent-125, score-0.281]
37 Denote by R(h) = s2S p(s)R(h(s), s) the expected reward of a hypothesis h. [sent-127, score-0.168]
38 Let h⇤ = arg maxh2H R(h) be the “best” hypothesis (the one that chooses the “best” action in each state). [sent-130, score-0.22]
39 Any policy ⇢(a|s) defines a distribution over H: we can draw an action a for each state s according to ⇢(a|s) and thus obtain a hypothesis h 2 H. [sent-133, score-0.401]
40 PN Finally, let nh (a) = s=1 Ih(s)=a be the number of states in which action a is played by the n o h hypothesis h. [sent-137, score-0.446]
41 Let A = nh (a) be the normalized cardinality profile (histogram) over the N a2A 3 actions played by hypothesis h (with respect to the uniform distribution over S). [sent-138, score-0.375]
42 In other words, H(Ah ) is the entropy a N ln N of an action choice of hypothesis h (with respect to the uniform distribution over S). [sent-140, score-0.568]
43 Note, that the optimal policy ⇢⇤ (a|s) (the one, that selects the “best” action in each state) is deterministic and we ⇤ have I⇢⇤ (S; A) = H(Ah ). [sent-141, score-0.214]
44 3 Main Results Our main result is a data and complexity dependent regret bound for a general class of prediction strategies of a smoothed exponential form. [sent-142, score-0.459]
45 Let ⇢t (a) be an arbitrary distribution over actions, let ˆ ⇢exp (a|s) = t where Z(⇢exp , s) = t P a ⇢t (a)e ˆ t Rt (a,s) exp ⇢t (a)e t Rt (a,s) , Z(⇢exp , s) t (1) is a normalization factor, and let ⇢t (a|s) = (1 ˜ (2) K"t+1 )⇢exp (a|s) + "t+1 t be a smoothed exponential policy. [sent-143, score-0.179]
46 The following theorem provides a regret bound for playing ⇢t ˜ at round t + 1 of the game. [sent-144, score-0.551]
47 It also provides an instantaneous regret bound for any algorithm that plays the policies {˜exp , ⇢exp , . [sent-165, score-0.537]
48 Then, by the union bound, the result holds with probability greater than 1 for all rounds of the game simultaneously. [sent-173, score-0.186]
49 This leads to Algorithm 1 for stochastic multiarmed bandits with side information. [sent-174, score-0.817]
50 Theorem 1 is based on the following regret decomposition and the subsequent theorem and two lemmas that bound the three terms in the decomposition. [sent-176, score-0.515]
51 , ⇡t } and c, simultaneously for all policies ⇢ that satisfy (3) with probability greater than 1 : s 2(e 2)(N I⇢(S;A) + K(ln N + ln K) + ln 2mt ) (⇢) ˆ t (⇢) (1 + c) , (6) t"t 4 Algorithm 1: Algorithm for stochastic contextual bandits. [sent-182, score-0.79]
52 n(St ) n(St ) + 1 R ˆ t , St ) ˆ R(A R(At , St ) + ⇢(At t t ) |S t t+1 N 1 N ⇢(a) 1 N ⇢(a|St ) and for all ⇢ that do not satisfy (3) with the same probability: (⇢) 2mt ) ˆ t (⇢) 2(N I⇢ (S; A) + K(ln N + ln K) + ln . [sent-187, score-0.596]
53 For any distribution ⇢exp of the form (1), where ⇢t (a) ✏ for all a, we have: t 1 ˆ t (⇢exp ) ln ✏ . [sent-190, score-0.323]
54 Theorem 1 exhibits what we were looking for: the regret of a policy ⇢exp depends on the trade-off between its complexity, N I⇢exp (S; A), and the empirical regret, which ˜t t 1 is bounded by 1t ln ✏t+1 . [sent-195, score-0.764]
55 We note that 0 I⇢t (S; A) ln K, hence, the result is interesting when N K, since otherwise K ln K term in the bound neutralizes the advantage we get from having small mutual information values. [sent-196, score-0.808]
56 We believe that the dependence of the first term of the regret bound (4) on "t is an artifact of our crude upper bound on the variance of the sampling process (given in Lemma 3 in the proof of Theorem 2) and that this term should not be in the bound. [sent-198, score-0.629]
57 This is supported by an empirical study of stochastic multiarmed bandits (Seldin et al. [sent-199, score-0.679]
58 With the current bound the best choice for "t is "t = (Kt) 1/3 , which, by integration over the game rounds, yields O(K 1/3 t2/3 ) dependence of the cumulative regret on the number of arms and game rounds. [sent-201, score-0.779]
59 However, if we manage to derive a tighter analysis and remove "t from the first term in (4), the best choice of "t will be "t = (Kt) 1/2 and the dependence of the cumulative regret on the number of arms and time horizon will improve to O((Kt)1/2 ). [sent-202, score-0.611]
60 We note 5 that although UCB algorithm for stochastic multiarmed bandits (Auer et al. [sent-209, score-0.679]
61 , 2002a) is asymptotically better than the EXP3 algorithm, it is not compatible with PAC-Bayesian analysis and we are not aware of a way to derive a UCB-type algorithm and analysis for multiarmed bandits with side information, whose dependence on the number of states would be better than O(N ln K). [sent-210, score-1.284]
62 Generally, higher values of t decrease the second term of the bound, but also lead to more concentrated policies (conditional distributions) ⇢exp (a|s) and thus higher mutual information values I⇢exp (S; A). [sent-214, score-0.184]
63 This can be approximated by taking the value of mutual information from the previous round (or approximation of the value of mutual information from the previous round). [sent-216, score-0.317]
64 By regret decomposition (5) and Theorem 2, regret at round t + 1 is minimized by a policy ⇢t (a|s) that minimizes a certain trade-off between the mutual information ˆ I⇢ (S; A) and the empirical regret Rt (⇢). [sent-219, score-1.318]
65 N s t Running a similar type of iterations in our case would be prohibitively expensive, since they require iteration over all states s 2 S at each round of the game. [sent-222, score-0.171]
66 We approximate these iterations by approximating the marginal distribution over the actions by a running average: ⇢exp (a) = ˜t+1 N 1 N ⇢exp (a) + ˜t 1 exp ⇢ (a|St ). [sent-223, score-0.283]
67 If these independent multiarmed bandits independently converge to similar strategies, we still get a tighter regret bound. [sent-227, score-0.963]
68 This happens because the corresponding subspace of the hypothesis space is significantly smaller than the total hypothesis space, which enables us to put a higher prior on it (Seldin and Tishby, 2010). [sent-228, score-0.202]
69 We are not aware of algorithms for stochastic multiarmed bandits with side information. [sent-231, score-0.817]
70 The best known to us algorithm for adversarial multiarmed bandits with p side information is EXP4. [sent-232, score-0.874]
71 P has O( Kt ln |H|) regret and O(K|H|) complexity per game round. [sent-236, score-0.755]
72 P would p have O( KtN ln K) regret and O(K N +1 ) computational complexity. [sent-238, score-0.642]
73 For hard problems, where all side information has to be used, our regret bound is inferior to the regret bound of Beygelzimer et al. [sent-239, score-1.114]
74 For simple problems the dependence of our regret bound on the number of states is significantly better, up to the point p that when the side information is irrelevant for the task we can get O( K ln N ) dependence on the p number of states versus O( N ln K) in EXP4. [sent-242, score-1.528]
75 For N K this leads to tighter regret bounds for small t even despite the “incorrect” dependence on t of our bound, and if we improve the analysis it will lead to tighter regret bounds for all t. [sent-244, score-0.827]
76 As we said it already, our algorithm is able to filter relevant information from large amounts of side information automatically, whereas in EXP4. [sent-245, score-0.232]
77 P the usage of side information has to be restricted externally through the construction of the hypothesis class. [sent-246, score-0.331]
78 “Baseline” in the first graph cort responds to playing N independent multiarmed bandits, one in each state. [sent-254, score-0.296]
79 We take N = 100, K = 20, a ⇤ uniform distribution over states (p(s) = 0. [sent-259, score-0.156]
80 In the ⇤ first case, the same action is the best in all states (and hence H(Ah ) = 0 for the optimal hypothesis ⇤ h ). [sent-261, score-0.35]
81 In the second case, for the first 33 states the best action is number 1, for the next 33 states the best action is number 2, and for the rest third of the states the best ⇤ action is number 3 (thus, depending on the state, one of the three actions is the “best” and H(Ah ) = ln(3)). [sent-262, score-0.851]
82 In the last case, there are 20 groups of 5 states and each of K = 20 actions is the best in exactly one of the 20 groups. [sent-264, score-0.234]
83 For all states, the reward of the best action in a state has Bernoulli distribution with bias 0. [sent-265, score-0.296]
84 6 and the rewards of all other actions in that state have Bernoulli distribution with bias 0. [sent-266, score-0.239]
85 We run the Pt experiment for T = 4, 000, 000 rounds and calculate the cumulative regret (t) = ⌧ =1 (˜exp ) ⇢⌧ and instantaneous regret bound given in (4). [sent-268, score-0.98]
86 t As we can see from the graphs (see Figure 1), the algorithm exhibits sublinear cumulative regret ⇤ (put attention to the axes’ scales). [sent-270, score-0.451]
87 Furthermore, for simple problems (with small H(Ah )) the regret grows slower than for complex problems. [sent-271, score-0.344]
88 a shows the performance of an algorithm with the same parameter values that runs N multiarmed bandits, one in each state independently of other states. [sent-273, score-0.357]
89 Let Vi (h) = j=1 E[Zj (h)2 | (Uj 1 )] be cumulative variances of ˆ ˆ ˆ the martingales Mi (h). [sent-295, score-0.191]
90 For a distribution ⇢ over H define Mi (⇢) = E⇢(h) [Mi (h)] and Vt (⇢) = E⇢(h) [Vt (h)] as weighted averages of the martingales and their cumulative variances according to a distribution ⇢. [sent-296, score-0.241]
91 Note that Mt (h) = t( (h) ˆ t (h)) are martingales and their cumulative variances are Vt (h) = ⇥ ⇤ Pt 2 h⇤ (S⌧ ),S⌧ h(S ),S R⌧ ⌧ ⌧ ] [R(h⇤ ) R(h)] T⌧ 1 . [sent-302, score-0.191]
92 In order to apply Theorem 3 we ⌧ =1 E [R⌧ have to derive an upper bound on Vt (⇢exp ),1 a prior µ(h) over H, and calculate (or upper bound) the t KL-divergence KL(⇢exp kµ). [sent-303, score-0.165]
93 However, exp we were not able to prove yet that the probability ⇢t (h) of the remaining hypotheses (those with large (h)) gets sufficiently small (of order K"t ), so that the weighted cumulative variance would be of order 2Kt. [sent-313, score-0.234]
94 Improving the upper bound on Vt (⇢exp ) will improve the regret bound, but t 2t for the moment we present the regret bound based on the crude upper bound Vt (⇢exp ) "t . [sent-316, score-1.045]
95 It is possible to define a distribution µ over H that satisfies: h µ(h) e N H(A ) K ln N K ln K . [sent-319, score-0.621]
96 For the distribution µ that satisfies (8) and any distribution ⇢(a|s): KL(⇢kµ) N I⇢ (S; A) + K ln N + K ln K. [sent-321, score-0.646]
97 t t 6 Discussion We presented PAC-Bayesian analysis of stochastic multiarmed bandits with side information. [sent-323, score-0.841]
98 Our analysis provides data-dependent algorithm and data-dependent regret analysis for this problem. [sent-324, score-0.392]
99 The selection of task-relevant side information is delegated from the user to the algorithm. [sent-325, score-0.208]
100 PAC-Bayeso ¸ Bernstein inequality for martingales and its application to multiarmed bandits. [sent-367, score-0.382]
wordName wordTfidf (topN-words)
[('ah', 0.346), ('regret', 0.344), ('ln', 0.298), ('multiarmed', 0.296), ('bandits', 0.286), ('seldin', 0.272), ('side', 0.184), ('rt', 0.173), ('exp', 0.154), ('st', 0.13), ('action', 0.119), ('states', 0.106), ('actions', 0.104), ('mutual', 0.102), ('hypothesis', 0.101), ('tishby', 0.095), ('policy', 0.095), ('vt', 0.095), ('martingales', 0.086), ('bound', 0.086), ('game', 0.084), ('yevgeny', 0.084), ('ba', 0.083), ('cumulative', 0.08), ('rounds', 0.077), ('nicol', 0.072), ('tt', 0.071), ('beygelzimer', 0.069), ('reward', 0.067), ('round', 0.065), ('kl', 0.063), ('played', 0.063), ('state', 0.061), ('contextual', 0.06), ('auer', 0.06), ('adversarial', 0.06), ('john', 0.058), ('policies', 0.058), ('nh', 0.057), ('theorem', 0.056), ('stochastic', 0.051), ('naftali', 0.051), ('francois', 0.051), ('pt', 0.049), ('instantaneous', 0.049), ('rewards', 0.049), ('nitions', 0.049), ('peter', 0.048), ('et', 0.046), ('reinforcement', 0.046), ('crude', 0.045), ('kt', 0.044), ('zi', 0.044), ('robert', 0.042), ('polani', 0.042), ('nt', 0.041), ('dependence', 0.041), ('advertising', 0.04), ('associative', 0.04), ('tighter', 0.037), ('bandit', 0.037), ('ert', 0.037), ('laviolette', 0.037), ('cured', 0.037), ('arms', 0.036), ('supervised', 0.036), ('langford', 0.036), ('vn', 0.035), ('lemma', 0.035), ('zj', 0.034), ('symptoms', 0.034), ('learner', 0.033), ('mi', 0.032), ('hardest', 0.032), ('leslie', 0.032), ('comments', 0.032), ('ui', 0.03), ('lemmas', 0.029), ('williamson', 0.029), ('complexity', 0.029), ('personalized', 0.028), ('strehl', 0.028), ('upper', 0.027), ('sequence', 0.027), ('exhibits', 0.027), ('uniform', 0.025), ('derive', 0.025), ('variances', 0.025), ('distribution', 0.025), ('martingale', 0.025), ('greater', 0.025), ('best', 0.024), ('ucb', 0.024), ('analysis', 0.024), ('information', 0.024), ('pac', 0.024), ('patients', 0.023), ('mt', 0.023), ('un', 0.022), ('usage', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
2 0.29409346 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
3 0.26148203 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
4 0.258863 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
Author: Shie Mannor, Ohad Shamir
Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1
5 0.22239968 25 nips-2011-Adaptive Hedge
Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald
Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1
6 0.20362526 61 nips-2011-Contextual Gaussian Process Bandit Optimization
7 0.20205085 220 nips-2011-Prediction strategies without loss
8 0.18431653 56 nips-2011-Committing Bandits
9 0.17303969 80 nips-2011-Efficient Online Learning via Randomized Rounding
10 0.14399269 32 nips-2011-An Empirical Evaluation of Thompson Sampling
11 0.14386088 145 nips-2011-Learning Eigenvectors for Free
12 0.14245763 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
13 0.14191304 109 nips-2011-Greedy Model Averaging
14 0.137999 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
15 0.13393596 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
16 0.13341366 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
17 0.1286989 272 nips-2011-Stochastic convex optimization with bandit feedback
18 0.12698492 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
19 0.11499006 215 nips-2011-Policy Gradient Coagent Networks
20 0.11418192 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
topicId topicWeight
[(0, 0.242), (1, -0.418), (2, 0.066), (3, 0.104), (4, 0.153), (5, -0.026), (6, 0.041), (7, 0.092), (8, 0.05), (9, 0.022), (10, -0.072), (11, -0.004), (12, -0.061), (13, -0.057), (14, -0.002), (15, -0.058), (16, -0.055), (17, -0.077), (18, -0.004), (19, 0.047), (20, -0.021), (21, 0.019), (22, 0.021), (23, 0.146), (24, 0.065), (25, 0.081), (26, 0.061), (27, 0.043), (28, -0.01), (29, -0.006), (30, -0.062), (31, -0.068), (32, -0.048), (33, -0.073), (34, 0.078), (35, -0.044), (36, 0.012), (37, -0.079), (38, 0.048), (39, 0.012), (40, 0.04), (41, -0.046), (42, -0.001), (43, -0.039), (44, 0.035), (45, 0.004), (46, -0.072), (47, -0.024), (48, 0.044), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.96905833 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
2 0.84692323 25 nips-2011-Adaptive Hedge
Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald
Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1
3 0.78479493 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
Author: Shie Mannor, Ohad Shamir
Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1
4 0.75247389 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
5 0.74575013 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
6 0.70938373 56 nips-2011-Committing Bandits
7 0.67941046 32 nips-2011-An Empirical Evaluation of Thompson Sampling
8 0.67381918 272 nips-2011-Stochastic convex optimization with bandit feedback
9 0.61628157 109 nips-2011-Greedy Model Averaging
10 0.60700542 80 nips-2011-Efficient Online Learning via Randomized Rounding
11 0.60309702 220 nips-2011-Prediction strategies without loss
12 0.59926635 61 nips-2011-Contextual Gaussian Process Bandit Optimization
13 0.59051239 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
14 0.54364043 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
15 0.50840288 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
16 0.50302154 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
17 0.45968464 177 nips-2011-Multi-armed bandits on implicit metric spaces
18 0.44641295 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
19 0.44004208 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
20 0.43761674 145 nips-2011-Learning Eigenvectors for Free
topicId topicWeight
[(0, 0.014), (4, 0.034), (20, 0.055), (26, 0.059), (31, 0.124), (33, 0.019), (43, 0.082), (45, 0.134), (57, 0.021), (74, 0.051), (79, 0.014), (83, 0.041), (87, 0.207), (99, 0.053)]
simIndex simValue paperId paperTitle
1 0.92195112 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
Author: Nicolas Boumal, Pierre-antoine Absil
Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1
2 0.86686766 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1
3 0.86490566 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
same-paper 4 0.8409934 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
5 0.74794233 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
6 0.73128623 180 nips-2011-Multiple Instance Filtering
7 0.7297979 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
8 0.72397381 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
9 0.72310925 258 nips-2011-Sparse Bayesian Multi-Task Learning
10 0.72066969 229 nips-2011-Query-Aware MCMC
11 0.72059989 25 nips-2011-Adaptive Hedge
12 0.72029024 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
13 0.71873778 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
14 0.71866453 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
15 0.71851957 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
16 0.71830195 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
17 0.71694767 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
18 0.71681118 231 nips-2011-Randomized Algorithms for Comparison-based Search
19 0.7156359 271 nips-2011-Statistical Tests for Optimization Efficiency
20 0.71537709 197 nips-2011-On Tracking The Partition Function