nips nips2011 nips2011-25 knowledge-graph by maker-knowledge-mining

25 nips-2011-Adaptive Hedge


Source: pdf

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. [sent-15, score-0.246]

2 We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. [sent-16, score-0.174]

3 In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. [sent-17, score-0.436]

4 In DTOL an agent is given access to a fixed set of K actions, and at the start of each round must make a decision by assigning a probability to every action. [sent-21, score-0.204]

5 Then all actions incur a loss from the range [0, 1], and the agent’s loss is the expected loss of the actions under the probability distribution it produced. [sent-22, score-0.477]

6 Losses add up over rounds and the goal for the agent is to minimize its regret after T rounds, which is the difference in accumulated loss between the agent and the action that has accumulated the least amount of loss. [sent-23, score-0.778]

7 Different ways of tuning the learning rate have been proposed, which all aim to minimize the regret for the worst possible sequence of losses the actions might incur. [sent-26, score-0.676]

8 If T is known to the agent, then the learning rate may be tuned to achieve worst-case regret bounded by T ln(K)/2, which is known to be optimal as T and K become large [4]. [sent-27, score-0.348]

9 Suppose for example that the cumulative loss L∗ of the best action is known T to the agent beforehand. [sent-29, score-0.376]

10 Then, if the learning rate is set appropriately, the regret is bounded by 2L∗ ln(K) + ln(K) [4], which has the same asymptotics as the previous bound in the worst case T 1 (because L∗ ≤ T ) but may be much better when L∗ turns out to be small. [sent-30, score-0.421]

11 Bounding the regret in terms of L∗ or VARmax is based on the idea that worst-case performance is T T not the only property of interest: such bounds give essentially the same guarantee in the worst case, but a much better guarantee in a plausible favourable case (when L∗ or VARmax is small). [sent-33, score-0.312]

12 Then in odd rounds the first action gets loss a + and the second action gets loss b − ; in even rounds the actions get losses a − and b + , respectively. [sent-36, score-0.908]

13 Informally, this seems like a very easy instance of DTOL, because the cumulative losses of the actions diverge and it is easy to see from the losses which action is the best one. [sent-37, score-0.883]

14 In fact, the Follow-the-Leader strategy, which puts all probability mass on the action with smallest cumulative loss, gives a regret of at most 1 in this case — the worst-case bound O( L∗ ln(K)) is very loose by comparison, and so is O( VARmax ln(K)), which is of the T T same order T ln(K). [sent-38, score-0.51]

15 On the other hand, for Follow-the-Leader one cannot guarantee sublinear regret for worst-case instances. [sent-39, score-0.227]

16 (For example, if one out of two actions yields losses 1 , 0, 1, 0, 1, . [sent-40, score-0.322]

17 2 and the other action yields losses 0, 1, 0, 1, 0, . [sent-43, score-0.333]

18 ) To get the best of both worlds, we introduce an adaptive version of Hedge, called AdaHedge, that automatically adapts to the difficulty of the problem by varying the learning rate appropriately. [sent-47, score-0.114]

19 As a result we obtain constant regret for the simplistic example above and other ‘easy’ instances of DTOL, while at the same time guaranteeing O( L∗ ln(K)) regret in the worst case. [sent-48, score-0.577]

20 We measure the difficulty of the problem in terms of the speed at which the posterior probability of the best action converges to one. [sent-51, score-0.177]

21 In the previous example, this happens at an exponential rate, whereas for worst-case instances the posterior probability of the best action does not converge to one at all. [sent-52, score-0.218]

22 To construct the AdaHedge algorithm, we then add the doubling trick to this idea in Section 3, and analyse its worst-case regret. [sent-54, score-0.206]

23 In Section 4 we show that AdaHedge in fact incurs much smaller regret on easy problems. [sent-55, score-0.291]

24 the agent A is to assign a probability wt to each action k by producing a vector 1 K wt = (wt , . [sent-67, score-0.603]

25 , wt ) with nonnegative components that sum up to 1. [sent-70, score-0.187]

26 Then every action k incurs a loss k ∈ [0, 1], which we collect in the loss vector t = ( 1 , . [sent-71, score-0.282]

27 , K ), and the loss of the agent t t t K T k is wt · t = k=1 wt k . [sent-74, score-0.54]

28 After T rounds action k has accumulated loss Lk = t=1 k , and the t t T agent’s regret is T wt · RA (T ) = t − L∗ , T t=1 where L∗ = min1≤k≤K Lk is the cumulative loss of the best action. [sent-75, score-0.904]

29 Bt = (2) s=1 We will sometimes write wt (η) and Bt (η) instead of wt and Bt in order to emphasize the dependence of these quantities on η. [sent-79, score-0.374]

30 The mixability gap measures how closely we approach this ideal. [sent-82, score-0.135]

31 As the same interpretation still holds in the more general DTOL setting of this paper, we can measure the difficulty of the problem, and tune η, in terms of the cumulative mixability gap: T T ∆T (η) = wt (η) · δt (η) = t=1 t + 1 η ln BT (η). [sent-83, score-0.638]

32 t=1 We proceed to list some basic properties of the mixability gap. [sent-84, score-0.129]

33 The lower bound follows by applying Jensen’s inequality to the concave function ln, the upper bound from Hoeffding’s bound on the cumulant generating function [4, Lemma A. [sent-88, score-0.157]

34 Further, the cumulative mixability gap ∆T (η) can be related to L∗ via the following upper bound, T proved in the Additional Material: ηL∗ + ln(K) T Lemma 2. [sent-90, score-0.246]

35 However, for easy instances of DTOL this inequality is very loose, T in which case we can prove substantially better regret bounds. [sent-93, score-0.314]

36 We could now proceed by optimizing the learning rate η given the rather awkward assumption that ∆T (η) is bounded by a known constant b for all η, which would be the natural counterpart to an analysis that optimizes η when a bound on L∗ is known. [sent-94, score-0.186]

37 We can then simply run the Hedge algorithm until the smallest T such that ∆T (η) exceeds an appropriate budget b(η), which we set to b(η) = 1 η + 3 1 e−1 ln(K). [sent-96, score-0.111]

38 Suppose the agent runs Hedge with learning rate η ∈ (0, 1], and after T rounds has just used up the budget (3), i. [sent-101, score-0.352]

39 Then its regret is bounded by RHedge(η) (T ) < 4 ∗ e−1 LT ln(K) + 1 e−1 ln(K) + 1 . [sent-104, score-0.268]

40 The cumulative loss of Hedge is bounded by T wt · t = ∆T (η) − 1 η ln BT < b(η) + η/8 − 1 η ln BT ≤ 1 e−1 ln(K) + 1 + 8 2 η ln(K) + L∗ , (5) T t=1 where we have used the bound BT ≥ 3 1 −ηL∗ T. [sent-106, score-0.951]

41 The AdaHedge Algorithm We now introduce the AdaHedge algorithm by adding the doubling trick to the analysis of the previous section. [sent-108, score-0.206]

42 The doubling trick divides the rounds in segments i = 1, 2, . [sent-109, score-0.389]

43 , and on each segment restarts Hedge with a different learning rate ηi . [sent-112, score-0.141]

44 We monitor ∆t (ηi ), measured only on the losses in the i-th segment, and when it exceeds its budget bi = b(ηi ) a new segment is started. [sent-114, score-0.404]

45 , K ) end if Make a decision Output probabilities w for round t Actions receive losses t Prepare for the next round 1 ∆ ← ∆ + w · t + η ln(w · e−η t ) 1 w ← (w1 · e−η t , . [sent-127, score-0.368]

46 , wK · e−η end for K t )/(w · e−η t ) end The regret of AdaHedge is determined by the number of segments it creates: the fewer segments there are, the smaller the regret. [sent-130, score-0.391]

47 Then its regret is bounded by RAdaHedge (T ) < 2 ln(K) φm − 1 +m φ−1 1 e−1 ln(K) + 1 8 . [sent-133, score-0.268]

48 4 Using (4), one can obtain an upper bound on the number of segments that leads to the following guarantee for AdaHedge: Theorem 5. [sent-137, score-0.123]

49 Then its regret is bounded by RAdaHedge (T ) ≤ φ φ2 − 1 φ−1 4 ∗ e−1 LT ln(K) + O ln(L∗ + 2) ln(K) , T For details see the proof in the Additional Material. [sent-139, score-0.29]

50 4 Easy Instances While the previous sections reassure us that AdaHedge performs well for the worst possible sequence of losses, we are also interested in its behaviour when the losses are not maximally antagonistic. [sent-143, score-0.3]

51 We will characterise such sequences in terms of convergence of the Hedge posterior probability of the best action: ∗ k wt (η) = max wt (η). [sent-144, score-0.465]

52 1≤k≤K −ηLk t−1 ∗ is proportional to e (Recall that , so wt corresponds to the posterior probability of the action with smallest cumulative loss. [sent-145, score-0.469]

53 For any t and η ∈ (0, 1] we have δt (η) ≤ (e − 2)η 1 − wt (η) . [sent-148, score-0.187]

54 k wt This lemma, which may be of independent interest, is a variation on Hoeffding’s bound on the cumulant generating function. [sent-149, score-0.262]

55 In fact, if the posterior probabilities wt converge to 1 sufficiently quickly, then ∆T (η) is bounded, as shown by the following lemma. [sent-151, score-0.252]

56 , T ∗ there exists a single action k ∗ that achieves minimal cumulative loss Lk = L∗ , and for k = k ∗ the t t cumulative losses diverge as Lk − L∗ ≥ αtβ . [sent-158, score-0.631]

57 Together with Lemmas 1 and 6, it gives an upper bound on ∆T (η), which may be used to bound the number of segments started by AdaHedge. [sent-161, score-0.203]

58 Let s(m) denote the round in which AdaHedge starts its m-th segment, and let Lk (m) = r k Lk s(m)+r−1 − Ls(m)−1 denote the cumulative loss of action k in that segment. [sent-163, score-0.371]

59 Suppose there ∗ exists a segment m∗ ∈ Z+ started by AdaHedge, such that τ := 8 ln(K)φ(m −1)(2−1/β) − 8(e − ∗ ∗ 2)CK + 1 ≥ 1 and for some action k the cumulative losses in segment m diverge as ∗ Lk (m∗ ) − Lk (m∗ ) ≥ αrβ r r for all r ≥ τ and k = k ∗ . [sent-166, score-0.668]

60 (6) ∗ Then AdaHedge starts at most m segments, and hence by Lemma 4 its regret is bounded by a constant: RAdaHedge (T ) = O(1). [sent-167, score-0.291]

61 Suppose the loss vectors t are independent random variables such that the expected differences in loss satisfy min E[ ∗ k=k k t − k∗ t ] ≥ 2α for all t ∈ Z+ . [sent-172, score-0.144]

62 (7) Then, with probability at least 1 − δ, AdaHedge starts at most m∗ = 1 + logφ (K − 1)(e − 2) ln 2K/(α2 δ) 1 + + α ln(K) 4α2 ln(K) 8 ln(K) (8) segments and consequently its regret is bounded by a constant: RAdaHedge (T ) = O K + log(1/δ) . [sent-173, score-0.648]

63 This shows that the probabilistic setting of the theorem is much easier than the worst case, for which only a bound on the regret of order O( T ln(K)) is possible, and that AdaHedge automatically adapts to this easier setting. [sent-174, score-0.402]

64 This algorithm is included because it is simple and very effective if the losses are not antagonistic, although as mentioned in the introduction its regret is linear in the worst case. [sent-180, score-0.491]

65 We also include Hedge with a fixed learning rate η= 2 ln(K)/L∗ , T (9) which achieves the regret bound 2 ln(K)L∗ + ln(K)1 . [sent-182, score-0.329]

66 The common way to apply the doubling trick to L∗ is to set a budget on T L∗ and multiply it by some constant φ at the start of each new segment, after which η is optimized T for the new budget [4, 7]. [sent-185, score-0.427]

67 Instead, we proceed the other way around and with each new segment first divide η by φ = 2 and then calculate the new budget such that (9) holds when ∆t (η) reaches the budget. [sent-186, score-0.204]

68 This way we keep the same invariant (η is never larger than the right-hand side of (9), with equality when the budget is depleted), and the frequency of doubling remains logarithmic in L∗ with a constant determined by φ, so both approaches are equally valid. [sent-187, score-0.246]

69 Rather than using the doubling trick, this algorithm, described in [8], changes the learning rate each round as a function of L∗ . [sent-196, score-0.262]

70 This way there is no need to relearn t the weights of the actions in each block, which leads to a better worst-case bound and potentially better performance in practice. [sent-197, score-0.15]

71 2 Generating the Losses In both experiments we choose losses in {0, 1}. [sent-200, score-0.213]

72 losses 3000 4000 5000 6000 Number of Rounds 7000 8000 9000 10000 (b) Correlated losses Figure 1: Simulation results I. [sent-206, score-0.426]

73 In the first experiment, all T = 10 000 losses for all K = 4 actions are independent, with distribution depending only on the action: the probabilities of incurring loss 1 are 0. [sent-210, score-0.417]

74 In addition there are dependencies within the loss vectors t , between the losses for the K = 2 available actions: each round is hard with probability 0. [sent-218, score-0.366]

75 If round t is hard, then action 1 yields loss 1 with probability 1 − 0. [sent-220, score-0.273]

76 01/t and action 2 yields loss 1 with probability 1 − 0. [sent-221, score-0.207]

77 If the round is easy, then the probabilities are flipped and the actions yield loss 0 with the same probabilities. [sent-223, score-0.27]

78 We plot the regret (averaged over repetitions of the experiment) as a function of the number of rounds, for each of the considered algorithms. [sent-227, score-0.254]

79 In the first considered regime, the accumulated losses for each action diverge linearly with high probability, so that the regret of Follow-the-Leader is bounded. [sent-232, score-0.641]

80 Based on Theorem 9 we expect AdaHedge to incur bounded regret also; this is confirmed in Figure 1(a). [sent-233, score-0.296]

81 In fact, if we would include more rounds, the learning rate would be set to an even smaller value, clearly showing the need to determine the learning rate adaptively. [sent-236, score-0.122]

82 The doubling trick provides one way to adapt the learning rate; indeed, we observe that the regret of Hedge with the doubling trick is initially smaller than the regret of Hedge with fixed learning rate. [sent-237, score-0.866]

83 In the second simulation we investigate the case where the mean cumulative loss of two actions is extremely close — within O(log t) of one another. [sent-242, score-0.299]

84 If the losses of the actions where independent, such a small difference would be dwarfed by random fluctuations in the cumula√ tive losses, which would be of order O( t). [sent-243, score-0.322]

85 Thus the two actions can only be distinguished because we have made their losses dependent. [sent-244, score-0.322]

86 Depending on the application, this may actually be a more natural scenario than complete independence as in the first simulation; for example, we can think of the losses as mistakes of two binary classifiers, say, two naive Bayes classifiers with different smoothing parameters. [sent-245, score-0.213]

87 In such a scenario, losses will be dependent, and the difference in cumulative loss √ will be much smaller than O( t). [sent-246, score-0.375]

88 In the previous experiment, the posterior weights of the actions 7 converged relatively quickly for a large range of learning rates, so that the exact value of the learning rate was most important at the start (e. [sent-247, score-0.241]

89 , from 3000 rounds onward Hedge with fixed learning rate does not incur much additional regret any more). [sent-249, score-0.432]

90 For any η > 0 and any time t, the function f ( t ) = ln wt · e−η t is convex. [sent-257, score-0.447]

91 We need to bound δt = wt (η) · t + η ln(wt (η) · e−η t ), which is a convex function of t by Lemma 10. [sent-261, score-0.228]

92 As a consequence, its maximum is achieved when t lies on the boundary of its domain, such that the losses k are either 0 or 1 for all k, and in the remainder of the t proof we will assume (without loss of generality) that this is the case. [sent-262, score-0.307]

93 Now let αt = wt · t be the posterior probability of the actions with loss 1. [sent-263, score-0.425]

94 Then 1 1 δt = αt + ln (1 − αt ) + αt e−η = αt + ln 1 + αt (e−η − 1) . [sent-264, score-0.52]

95 η η Using ln x ≤ x − 1 and e−η ≤ 1 − η + 1 η 2 , we get δt ≤ 1 αt η, which is tight for αt near 0. [sent-265, score-0.26]

96 For αt 2 2 near 1, rewrite 1 δt = αt − 1 + ln(eη (1 − αt ) + αt ) η and use ln x ≤ x − 1 and eη ≤ 1 + η + (e − 2)η 2 for η ≤ 1 to obtain δt ≤ (e − 2)(1 − αt )η. [sent-266, score-0.26]

97 ∗ ∗ ∗ k ∗ Now, let k ∗ be an action such that wt = wt . [sent-268, score-0.494]

98 On the other t k∗ ∗ ∗ ∗ hand, if t = 1, then αt ≥ wt so 1−αt ≤ 1−wt . [sent-270, score-0.187]

99 For hard instances of DTOL, for which the posterior does not converge, it was shown that the regret of AdaHedge is of the optimal order O( L∗ ln(K)); for easy instances, for which T the posterior converges sufficiently fast, the regret was bounded by a constant. [sent-274, score-0.666]

100 A starting point might be to consider how fast the posterior probability of the best action converges to one, and plug that into Lemma 6. [sent-278, score-0.177]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hedge', 0.522), ('adahedge', 0.505), ('ln', 0.26), ('regret', 0.227), ('losses', 0.213), ('wt', 0.187), ('doubling', 0.135), ('dtol', 0.135), ('action', 0.12), ('lk', 0.114), ('actions', 0.109), ('mixability', 0.101), ('varmax', 0.101), ('rounds', 0.101), ('budget', 0.096), ('agent', 0.094), ('cumulative', 0.09), ('lemma', 0.084), ('segments', 0.082), ('bt', 0.081), ('segment', 0.08), ('loss', 0.072), ('trick', 0.071), ('radahedge', 0.067), ('round', 0.066), ('rate', 0.061), ('lt', 0.052), ('worst', 0.051), ('diverge', 0.046), ('easy', 0.046), ('cwi', 0.044), ('depleted', 0.044), ('ptk', 0.044), ('posterior', 0.042), ('bound', 0.041), ('bounded', 0.041), ('instances', 0.041), ('started', 0.039), ('expert', 0.039), ('adapts', 0.036), ('behaviour', 0.036), ('accumulated', 0.035), ('gap', 0.034), ('characterise', 0.034), ('cumulant', 0.034), ('egham', 0.034), ('favourable', 0.034), ('leader', 0.034), ('amsterdam', 0.033), ('netherlands', 0.032), ('simplistic', 0.031), ('culty', 0.03), ('centrum', 0.03), ('golden', 0.03), ('informatica', 0.03), ('wiskunde', 0.03), ('wouter', 0.03), ('freund', 0.029), ('ck', 0.029), ('start', 0.029), ('simulation', 0.028), ('incur', 0.028), ('proceed', 0.028), ('hedging', 0.027), ('repetitions', 0.027), ('tim', 0.026), ('probabilities', 0.023), ('starts', 0.023), ('haussler', 0.023), ('completes', 0.023), ('proof', 0.022), ('suppose', 0.021), ('proved', 0.021), ('advice', 0.021), ('gb', 0.021), ('fixed', 0.021), ('hoeffding', 0.02), ('generalisation', 0.02), ('tuned', 0.019), ('constants', 0.018), ('hazan', 0.018), ('incurs', 0.018), ('park', 0.017), ('schapire', 0.017), ('loose', 0.017), ('adaptive', 0.017), ('rmed', 0.016), ('easier', 0.016), ('plugging', 0.016), ('optimizes', 0.015), ('additional', 0.015), ('probability', 0.015), ('tuning', 0.015), ('prediction', 0.015), ('exceeds', 0.015), ('proportional', 0.015), ('theorem', 0.015), ('never', 0.015), ('wk', 0.015), ('game', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

2 0.22239968 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1

3 0.18415262 220 nips-2011-Prediction strategies without loss

Author: Michael Kapralov, Rina Panigrahy

Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1

4 0.15549728 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

5 0.15253848 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

Author: Samory Kpotufe

Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1

6 0.14066131 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

7 0.1333701 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

8 0.12536937 145 nips-2011-Learning Eigenvectors for Free

9 0.1213149 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

10 0.11720444 109 nips-2011-Greedy Model Averaging

11 0.10637868 59 nips-2011-Composite Multiclass Losses

12 0.106226 218 nips-2011-Predicting Dynamic Difficulty

13 0.10610399 56 nips-2011-Committing Bandits

14 0.10420763 80 nips-2011-Efficient Online Learning via Randomized Rounding

15 0.1027749 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

16 0.10211843 61 nips-2011-Contextual Gaussian Process Bandit Optimization

17 0.090956718 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

18 0.090561517 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

19 0.085127898 272 nips-2011-Stochastic convex optimization with bandit feedback

20 0.082577504 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.161), (1, -0.25), (2, 0.001), (3, 0.03), (4, 0.189), (5, -0.011), (6, 0.014), (7, 0.065), (8, 0.007), (9, 0.07), (10, -0.058), (11, -0.02), (12, -0.038), (13, -0.102), (14, -0.026), (15, -0.06), (16, -0.032), (17, -0.067), (18, -0.003), (19, 0.062), (20, 0.03), (21, 0.004), (22, 0.066), (23, 0.148), (24, -0.003), (25, 0.126), (26, 0.015), (27, -0.02), (28, -0.071), (29, -0.096), (30, -0.085), (31, -0.016), (32, -0.022), (33, -0.043), (34, 0.231), (35, -0.005), (36, 0.089), (37, -0.11), (38, 0.064), (39, -0.027), (40, 0.044), (41, 0.036), (42, 0.001), (43, -0.031), (44, 0.031), (45, 0.087), (46, -0.031), (47, 0.027), (48, 0.15), (49, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96127003 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

2 0.76865852 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1

3 0.65894079 109 nips-2011-Greedy Model Averaging

Author: Dong Dai, Tong Zhang

Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1

4 0.63207334 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

Author: Joseph Keshet, David A. McAllester

Abstract: We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1

5 0.52695888 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

Author: Samory Kpotufe

Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1

6 0.51420832 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

7 0.49947515 272 nips-2011-Stochastic convex optimization with bandit feedback

8 0.49768934 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

9 0.49253625 220 nips-2011-Prediction strategies without loss

10 0.49151707 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

11 0.47450182 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

12 0.47156906 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

13 0.46320438 61 nips-2011-Contextual Gaussian Process Bandit Optimization

14 0.4566296 56 nips-2011-Committing Bandits

15 0.44483283 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

16 0.42484847 80 nips-2011-Efficient Online Learning via Randomized Rounding

17 0.41701052 218 nips-2011-Predicting Dynamic Difficulty

18 0.40326202 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

19 0.39635953 32 nips-2011-An Empirical Evaluation of Thompson Sampling

20 0.35718712 145 nips-2011-Learning Eigenvectors for Free


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.017), (4, 0.029), (20, 0.073), (26, 0.027), (31, 0.078), (33, 0.016), (43, 0.092), (45, 0.132), (57, 0.023), (59, 0.232), (74, 0.057), (83, 0.054), (87, 0.01), (99, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79975736 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

2 0.78643417 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

Author: Daniel J. Lizotte

Abstract: Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, “Expansion-Constrained Ordinary Least Squares” (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the ∞-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of “averagers,” we prove a minimax property of the ECOLS residual error, and we give an efficient algorithm for computing the coefficients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties. 1

3 0.77053005 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1

4 0.65665942 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

Author: Po-ling Loh, Martin J. Wainwright

Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1

5 0.65559596 258 nips-2011-Sparse Bayesian Multi-Task Learning

Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau

Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1

6 0.65408671 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

7 0.65302515 186 nips-2011-Noise Thresholds for Spectral Clustering

8 0.65299892 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

9 0.65006477 227 nips-2011-Pylon Model for Semantic Segmentation

10 0.64967084 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

11 0.64966947 180 nips-2011-Multiple Instance Filtering

12 0.64916968 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

13 0.64897943 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

14 0.64846563 276 nips-2011-Structured sparse coding via lateral inhibition

15 0.64754409 271 nips-2011-Statistical Tests for Optimization Efficiency

16 0.6465103 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

17 0.64641458 109 nips-2011-Greedy Model Averaging

18 0.64536977 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

19 0.64525062 80 nips-2011-Efficient Online Learning via Randomized Rounding

20 0.64481837 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data