nips nips2011 nips2011-80 knowledge-graph by maker-knowledge-mining

80 nips-2011-Efficient Online Learning via Randomized Rounding


Source: pdf

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 it Abstract Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. [sent-4, score-0.232]

2 In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. [sent-5, score-0.368]

3 As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. [sent-6, score-0.29]

4 As a second application, we solve an open question linking batch learning and transductive online learning. [sent-7, score-0.425]

5 It computes minimax predictions in the case of known horizon, binary outcomes, and absolute loss. [sent-12, score-0.228]

6 The new algorithm is based on a combination of “random playout” and randomized rounding, which assigns random binary labels to future unseen instances, in a way depending on the loss subgradients. [sent-15, score-0.156]

7 The regret of the R2 Forecaster is determined by the Rademacher complexity of the comparison class. [sent-17, score-0.267]

8 The connection between online learnability and Rademacher complexity has also been explored in [2, 1]. [sent-18, score-0.224]

9 The idea of “random playout”, in the context of online learning, has also been used in [16, 3], but we apply this idea in a different way. [sent-20, score-0.169]

10 We show that the R2 Forecaster can be used to design the first efficient online learning algorithm for collaborative filtering with trace-norm constrained matrices. [sent-21, score-0.25]

11 While this is a well-known setting, a straightforward application of standard online learning approaches, such as mirror descent, appear to give only trivial performance guarantees. [sent-22, score-0.239]

12 Moreover, our 1 regret bound matches the best currently known sample complexity bound in the batch distribution-free setting [21]. [sent-23, score-0.361]

13 As a different application, we consider the relationship between batch learning and transductive online learning. [sent-24, score-0.425]

14 Their main result was that efficient learning in a statistical setting implies efficient learning in the transductive online setting, but at an inferior rate of T 3/4 (where T is the number of rounds). [sent-26, score-0.405]

15 This shows that efficient batch learning not only implies efficient transductive online learning (the main thesis of [16]), but also that the same rates can be obtained, and for possibly non-binary prediction problems as well. [sent-29, score-0.512]

16 Nevertheless, it seems to be a useful tool in showing that efficient online learnability is possible in various settings, often working in cases where more standard techniques appear to fail. [sent-32, score-0.202]

17 Moreover, we hope the techniques we employ might prove useful in deriving practical online algorithms in other contexts. [sent-33, score-0.169]

18 2 The Minimax Forecaster We start by introducing the sequential game of prediction with expert advice —see [10]. [sent-34, score-0.206]

19 The game is played between a forecaster and an adversary, and is specified by an outcome space Y, a prediction space P, a nonnegative loss function : P × Y → R, which measures the discrepancy between the forecaster’s prediction and the outcome, and an expert class F. [sent-35, score-1.021]

20 Here we focus on classes F of static experts, whose prediction at each round t does not depend on the outcome in previous rounds. [sent-36, score-0.293]

21 of the game, the forecaster outputs a prediction pt ∈ P and simultaneously the adversary reveals an outcome yt ∈ Y. [sent-44, score-1.597]

22 The forecaster’s goal is to predict the outcome sequence almost as well as the best expert in the class F, irrespective of the outcome sequence y = (y1 , y2 , . [sent-45, score-0.4]

23 The performance of a forecasting strategy A is measured by the worst-case regret T VT (A, F) = sup y∈Y T T (pt , yt ) − inf f ∈F t=1 (ft , yt ) (1) t=1 viewed as a function of the horizon T . [sent-49, score-1.387]

24 To simplify notation, let L(f , y) = T t=1 (ft , yt ). [sent-50, score-0.464]

25 Consider now the special case where the horizon T is fixed and known in advance, the outcome space is Y = {−1, +1}, the prediction space is P = [−1, +1], and the loss is the abs absolute loss (p, y) = |p − y|. [sent-51, score-0.47]

26 We will denote the regret in this special case as VT (A, F). [sent-52, score-0.245]

27 The Minimax Forecaster —which is based on work presented in [9] and [11], see also [10] abs for an exposition— is derived by an explicit analysis of the minimax regret inf A VT (A, F), where the infimum is over all forecasters A producing at round t a prediction pt as a function of p1 , y1 , . [sent-53, score-1.058]

28 For general online learning problems, the analysis of this quantity is intractable. [sent-57, score-0.169]

29 However, for the specific setting we focus on (absolute loss and binary outcomes), one can get both an explicit expression for the minimax regret, as well as an T explicit algorithm, provided inf f ∈F t=1 (ft , yt ) can be efficiently computed for any sequence y1 , . [sent-58, score-0.917]

30 In a nutshell, the idea is to begin by calculating the optimal prediction in the last round T , and then work backwards, calculating the abs optimal prediction at round T − 1, T − 2 etc. [sent-64, score-0.41]

31 Remarkably, the value of inf A VT (A, F) is exactly the Rademacher complexity RT (F) of the class F, which is known to play a crucial role in understanding the sample complexity in statistical learning [5]. [sent-65, score-0.196]

32 When RT (F) = o(T ), we get a minimax regret inf A VT (A, F) = o(T ) which implies a vanishing per-round regret. [sent-73, score-0.555]

33 In terms of an explicit algorithm, the optimal prediction pt at round t is given by a complicated-looking recursive expression, involving exponentially many terms. [sent-74, score-0.448]

34 Indeed, for general online learning problems, this is the most one seems able to hope for. [sent-75, score-0.169]

35 However, an apparently little-known fact is that when one deals with a class F of fixed binary sequences as discussed above, then one can write the optimal prediction pt in a much simpler way. [sent-76, score-0.386]

36 Rademacher random variables, the optimal prediction at round t can be written as pt = E inf L (f , y1 · · · yt−1 (−1) Yt+1 · · · YT ) − inf L (f , y1 · · · yt−1 1 Yt+1 · · · YT ) . [sent-83, score-0.667]

37 We denote this optimal strategy (for absolute loss and binary outcomes) as the Minimax Forecaster (mf): Algorithm 1 Minimax Forecaster (mf) for t = 1 to T do Predict pt as defined in Eq. [sent-86, score-0.371]

38 (3) Receive outcome yt and suffer loss |pt − yt | end for The relevant guarantee for mf is summarized in the following theorem. [sent-87, score-1.193]

39 For any class F ⊆ [−1, +1]T of static experts, the regret of the Minimax abs Forecaster (Algorithm 1) satisfies VT (mf, F) = RT (F). [sent-89, score-0.398]

40 1 Making the Minimax Forecaster Efficient The Minimax Forecaster described above is not computationally efficient, as the computation of pt requires averaging over exponentially many ERM’s. [sent-91, score-0.305]

41 , T , let Yi be a Rademacher random variable Let pt := inf f ∈F L (f , y1 . [sent-96, score-0.387]

42 YT ) Predict pt , receive outcome yt and suffer loss |pt − yt | end for Theorem 2. [sent-108, score-1.38]

43 To prove the second statement, note that E[pt ]−yt = E |pt −yt | for any fixed yt ∈ {−1, +1} and pt bounded in [−1, +1], and use Thm. [sent-119, score-0.748]

44 To prove the first statement, note that |pt − yt | − Ept [pt ] − yt for t = 1, . [sent-121, score-0.928]

45 The second statement in the theorem bounds the regret only in expectation and is thus weaker than the first one. [sent-128, score-0.266]

46 Depending on the specific learning problem, it might be easier to re-compute the infimum after changing a single point in the outcome sequence, as opposed to computing the infimum over a different outcome sequence in each round. [sent-134, score-0.254]

47 We note that extending the forecaster to other losses or different outcome spaces is not trivial: indeed, the recursive unwinding of the minimax regret term, leading to an explicit expression and an explicit algorithm, does not work as-is for other cases. [sent-136, score-1.107]

48 The algorithm we propose essentially uses the Minimax Forecaster as a subroutine, by feeding it with a carefully chosen sequence of binary values zt , and using predictions ft which are scaled to lie in the interval [−1, +1]. [sent-138, score-0.218]

49 The values of zt are based on a randomized rounding of values in [−1, +1], which depend in turn on the loss subgradient. [sent-139, score-0.268]

50 Also, we let ∂pt (pt , yt ) denote any subgradient of the loss function with respect to the prediction pt . [sent-144, score-0.857]

51 The pseudocode of the R2 Forecaster is presented as Algorithm 3 below, and its regret guarantee is summarized in Thm. [sent-145, score-0.245]

52 For any F ⊆ [−b, b]T the regret of the R2 Forecaster (Algorithm 3) satisfies VT (R2 , F) ≤ ρ RT (F) + ρ b 1 +2 η 2T ln 2T δ (4) with probability at least 1 − δ. [sent-150, score-0.283]

53 The prediction pt which the algorithm computes is an empirical approximation to b EYt+1 ,. [sent-151, score-0.334]

54 A larger value of η improves the regret bound, but also increases the runtime of the algorithm. [sent-165, score-0.245]

55 Thus, η provides a trade-off between the computational complexity of the algorithm and its regret guarantee. [sent-166, score-0.267]

56 for t = 1 to T do pt := 0 for j = 1 to η T do For i = t, . [sent-171, score-0.265]

57 , T } that is used to define the prediction f (πt ) abs of expert f at time t. [sent-200, score-0.236]

58 4 Application 1: Transductive Online Learning The first application we consider is a rather straightforward one, in the context of transductive online learning [6]. [sent-203, score-0.359]

59 At each round t, the learner must provide a prediction pt for the label of yt . [sent-211, score-0.943]

60 The true label yt is then revealed, and the learner incurs a loss (pt , yt ). [sent-212, score-1.043]

61 The learner’s T goal is to minimize the transductive online regret t=1 (pt , yt ) − inf f ∈F (f (xt ), yt ) with respect to a fixed class of predictors F of the form {x → f (x)}. [sent-213, score-1.706]

62 The significance of this result is that efficient batch learning (via empirical risk minimization) implies efficient learning in the transductive online setting. [sent-216, score-0.486]

63 This is an important result, as online learning can be computationally harder than batch learning —see, e. [sent-217, score-0.275]

64 This shows that efficient batch learning not only implies efficient transductive online learning (the main thesis of [16]), but also that the same rates can be obtained, and for possibly non-binary prediction problems as well. [sent-222, score-0.512]

65 2 Formally, at each step t: (1) the adversary chooses and reveals the next element πt of the permutation; (2) the forecaster chooses pt ∈ P and simultaneously the adversary chooses yt ∈ Y. [sent-223, score-1.631]

66 Suppose we have a computationally efficient algorithm for empirical risk minimization (with respect to the zero-one loss) over a class F of {0, 1}-valued functions with VC dimension d. [sent-225, score-0.152]

67 Then, in the transductive online model, the efficient randomized forecaster √ mf* achieves an expected regret of O( dT ) with respect to the zero-one loss. [sent-226, score-1.236]

68 , xT } of unlabeled examples is known, we reduce the online transductive model to prediction with expert advice in the setting of Remark 1. [sent-232, score-0.562]

69 When F maps to {0, 1}, and we care about the zero-one loss, we can use the forecaster mf* to compute randomized predictions and apply Thm. [sent-237, score-0.664]

70 2 to bound the expected transductive online regret with RT (F). [sent-238, score-0.604]

71 We close this section by contrasting our results for online transductive learning with those of [7] about standard online learning. [sent-242, score-0.528]

72 If F contains {0, 1}-valued functions, then the optimal √ regret bound for online learning is order of d T , where d is the Littlestone dimension of F. [sent-243, score-0.433]

73 Since the Littlestone dimension of a class is never smaller than its VC dimension, we conclude that online learning is a harder setting than online transductive learning. [sent-244, score-0.605]

74 In collaborative filtering, the learning problem is to predict entries of an unknown m × n matrix based on a subset of its observed entries. [sent-246, score-0.155]

75 Thus, an online adversarial setting, where no distributional assumptions whatsoever are required, seems to be particularly well-suited to this problem domain. [sent-253, score-0.189]

76 In an online setting, at each round t the adversary reveals an index pair (it , jt ) and secretely chooses a value yt for the corresponding matrix entry. [sent-254, score-0.889]

77 After that, the learner selects a prediction pt for that entry. [sent-255, score-0.39]

78 Then yt is revealed and the learner suffers a loss (pt , yt ). [sent-256, score-1.065]

79 Hence, the goal of a learner is to minimize the regret with respect to a fixed class W T T Wit ,jt , yt . [sent-257, score-0.795]

80 Following reality, we of prediction matrices, t=1 (pt , yt ) − inf W ∈W t=1 will assume that the adversary picks a different entry in each round. [sent-258, score-0.771]

81 When the learner’s performance is measured by the regret after all T = mn entries have been predicted, the online collaborative filtering setting reduces to prediction with expert advice as discussed in Remark 1. [sent-259, score-0.79]

82 Many convex learning problems, such as linear and kernel-based predictors, as well as matrix-based predictors, can be learned efficiently both in a stochastic and an online setting, using mirror descent or regularized follow-the-leader methods. [sent-261, score-0.267]

83 In particular, in the case of W consisting of m × n √ matrices with trace-norm at most r, standard online regret bounds would scale like O r T . [sent-263, score-0.439]

84 √ Since for this norm one typically has r = O mn , we get a per-round regret guarantee of O( mn/T ). [sent-264, score-0.282]

85 On the other hand, based on general techniques developed in [15] and greatly extended in [1], it can be shown that online learnability is information-theoretically possible for such W. [sent-266, score-0.202]

86 Thus, to the best of our knowledge, there is currently no efficient (polynomial time) online algorithm, which attain non-trivial regret. [sent-268, score-0.169]

87 Consider first the transductive online setting, where the set of indices to be predicted is known in advance, and the adversary may only choose the order and values of the entries. [sent-270, score-0.475]

88 It is readily seen that the R2 Forecaster can be applied in this setting, using any convex class W of fixed matrices with bounded entries to compete against, and any convex Lipschitz loss function. [sent-271, score-0.258]

89 The algorithm we propose is very simple: we apply the R2 Forecaster as if we are in a setting with time horizon T = mn, which is played over all entries of the m × n matrix. [sent-276, score-0.168]

90 Consider a (possibly randomized) forecaster A for a class F whose regret after 1 T steps satisfies VT (A, F) ≤ G with probability at least 1 − δ > 2 . [sent-282, score-0.832]

91 Furthermore, suppose the loss function is such that inf sup inf (p, y) − (p , y) ≥ 0. [sent-283, score-0.303]

92 Using this lemma, the following theorem exemplifies how we can obtain a regret guarantee for our algorithm, in the case of W consisting of the convex set of matrices with bounded trace-norm and bounded entries. [sent-289, score-0.343]

93 Also, let W consist of n × n matrices with trace-norm at most r = O(n) and entries at most b = O(1), suppose we apply the R2 Forecaster over time horizon n2 and all entries of the matrix. [sent-293, score-0.193]

94 Then with probability at least 1 − δ, after T rounds, the algorithm achieves an average per-round regret of at most O n3/2 + n ln(n/δ) T uniformly over T = 1, . [sent-294, score-0.245]

95 In our setting, where the adversary chooses a different entry at each round, [21, Theorem 6] implies that for the class W of all matrices with trace-norm at most r = O(n), 7 it holds that RT (W )/T ≤ O(n3/2 /T ). [sent-299, score-0.22]

96 3, the regret after n2 rounds is O(n3/2 + n ln(n/δ)) with probability at least 1 − δ. [sent-303, score-0.269]

97 Applying Lemma 1, we get that the cumulative regret at the end of any round T = 1, . [sent-304, score-0.361]

98 While the regret might seem unusual compared to standard √ regret bounds (which usually have rates of 1/ T for general losses), it is a natural outcome of the non-asymptotic nature of our setting, where T can never be larger than n2 . [sent-309, score-0.596]

99 Optimal strategies and minimax lower bounds for online convex games. [sent-324, score-0.353]

100 A stochastic view of optimal regret through minimax duality. [sent-402, score-0.394]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('forecaster', 0.557), ('yt', 0.464), ('pt', 0.265), ('regret', 0.245), ('transductive', 0.19), ('online', 0.169), ('minimax', 0.149), ('inf', 0.122), ('adversary', 0.116), ('outcome', 0.106), ('rt', 0.105), ('mf', 0.1), ('abs', 0.094), ('round', 0.089), ('rademacher', 0.089), ('erent', 0.085), ('collaborative', 0.081), ('vt', 0.08), ('randomized', 0.075), ('expert', 0.073), ('ft', 0.072), ('di', 0.071), ('erm', 0.07), ('prediction', 0.069), ('zt', 0.069), ('batch', 0.066), ('rounding', 0.065), ('loss', 0.059), ('horizon', 0.058), ('learner', 0.056), ('entries', 0.055), ('playout', 0.053), ('ltering', 0.051), ('su', 0.051), ('outcomes', 0.046), ('remark', 0.045), ('mirror', 0.044), ('risk', 0.043), ('colt', 0.042), ('vc', 0.042), ('experts', 0.04), ('computationally', 0.04), ('ln', 0.038), ('mn', 0.037), ('wit', 0.035), ('convex', 0.035), ('abernethy', 0.035), ('forecasting', 0.034), ('mum', 0.034), ('advice', 0.033), ('learnability', 0.033), ('salakhutdinov', 0.033), ('predictions', 0.032), ('game', 0.031), ('chooses', 0.031), ('class', 0.03), ('static', 0.029), ('setting', 0.028), ('shamir', 0.027), ('cumulative', 0.027), ('srebro', 0.027), ('played', 0.027), ('trivial', 0.026), ('appendix', 0.025), ('explicit', 0.025), ('polynomial', 0.025), ('matrices', 0.025), ('absolute', 0.025), ('rounds', 0.024), ('permutation', 0.024), ('erence', 0.024), ('littlestone', 0.024), ('rakhlin', 0.024), ('xt', 0.024), ('sequence', 0.023), ('predictors', 0.022), ('minimizers', 0.022), ('complexity', 0.022), ('binary', 0.022), ('receive', 0.022), ('revealed', 0.022), ('bartlett', 0.022), ('martingale', 0.021), ('lipschitz', 0.021), ('statement', 0.021), ('vanishing', 0.021), ('irrespective', 0.02), ('boolean', 0.02), ('reveals', 0.02), ('minimization', 0.02), ('distributional', 0.02), ('dimension', 0.019), ('opposed', 0.019), ('bounded', 0.019), ('predict', 0.019), ('moreover', 0.019), ('descent', 0.019), ('lemma', 0.018), ('implies', 0.018), ('nips', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

2 0.3138085 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

3 0.25689733 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

Author: Elad Hazan, Satyen Kale

Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1

4 0.21773584 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

Author: Dan Garber, Elad Hazan

Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1

5 0.20507455 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

6 0.19756469 220 nips-2011-Prediction strategies without loss

7 0.17303969 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

8 0.16847658 202 nips-2011-On the Universality of Online Mirror Descent

9 0.1680641 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

10 0.15596387 21 nips-2011-Active Learning with a Drifting Distribution

11 0.14083649 145 nips-2011-Learning Eigenvectors for Free

12 0.1382079 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

13 0.12373648 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

14 0.11973131 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

15 0.11919926 61 nips-2011-Contextual Gaussian Process Bandit Optimization

16 0.10420763 25 nips-2011-Adaptive Hedge

17 0.10337849 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

18 0.10152671 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

19 0.099238336 272 nips-2011-Stochastic convex optimization with bandit feedback

20 0.091026455 162 nips-2011-Lower Bounds for Passive and Active Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.195), (1, -0.332), (2, -0.023), (3, -0.075), (4, 0.272), (5, 0.004), (6, 0.092), (7, -0.004), (8, -0.003), (9, -0.048), (10, 0.147), (11, 0.003), (12, 0.143), (13, 0.033), (14, -0.038), (15, -0.012), (16, 0.037), (17, 0.053), (18, 0.029), (19, -0.005), (20, -0.036), (21, -0.057), (22, -0.019), (23, -0.021), (24, 0.061), (25, 0.082), (26, 0.049), (27, 0.097), (28, -0.025), (29, 0.024), (30, -0.015), (31, 0.029), (32, -0.044), (33, -0.005), (34, 0.016), (35, -0.04), (36, -0.087), (37, 0.067), (38, 0.067), (39, -0.028), (40, 0.02), (41, -0.001), (42, 0.006), (43, 0.0), (44, 0.012), (45, 0.004), (46, -0.056), (47, 0.001), (48, -0.15), (49, -0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96518266 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

2 0.82061297 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

3 0.82006741 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

Author: Elad Hazan, Satyen Kale

Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1

4 0.70922005 21 nips-2011-Active Learning with a Drifting Distribution

Author: Liu Yang

Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1

5 0.68618166 220 nips-2011-Prediction strategies without loss

Author: Michael Kapralov, Rina Panigrahy

Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1

6 0.67513466 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

7 0.64559245 145 nips-2011-Learning Eigenvectors for Free

8 0.63857394 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

9 0.59342289 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

10 0.56387538 202 nips-2011-On the Universality of Online Mirror Descent

11 0.54418123 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

12 0.50977963 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

13 0.49399251 272 nips-2011-Stochastic convex optimization with bandit feedback

14 0.48925126 162 nips-2011-Lower Bounds for Passive and Active Learning

15 0.43626466 25 nips-2011-Adaptive Hedge

16 0.40445226 61 nips-2011-Contextual Gaussian Process Bandit Optimization

17 0.40147504 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

18 0.39318186 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

19 0.38160959 218 nips-2011-Predicting Dynamic Difficulty

20 0.37494931 225 nips-2011-Probabilistic amplitude and frequency demodulation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (4, 0.098), (20, 0.047), (26, 0.032), (31, 0.068), (43, 0.093), (45, 0.15), (48, 0.034), (52, 0.109), (57, 0.031), (74, 0.054), (83, 0.045), (99, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91900551 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

2 0.91724312 177 nips-2011-Multi-armed bandits on implicit metric spaces

Author: Aleksandrs Slivkins

Abstract: The multi-armed bandit (MAB) setting is a useful abstraction of many online learning tasks which focuses on the trade-off between exploration and exploitation. In this setting, an online algorithm has a fixed set of alternatives (“arms”), and in each round it selects one arm and then observes the corresponding reward. While the case of small number of arms is by now well-understood, a lot of recent work has focused on multi-armed bandits with (infinitely) many arms, where one needs to assume extra structure in order to make the problem tractable. In particular, in the Lipschitz MAB problem there is an underlying similarity metric space, known to the algorithm, such that any two arms that are close in this metric space have similar payoffs. In this paper we consider the more realistic scenario in which the metric space is implicit – it is defined by the available structure but not revealed to the algorithm directly. Specifically, we assume that an algorithm is given a tree-based classification of arms. For any given problem instance such a classification implicitly defines a similarity metric space, but the numerical similarity information is not available to the algorithm. We provide an algorithm for this setting, whose performance guarantees (almost) match the best known guarantees for the corresponding instance of the Lipschitz MAB problem. 1

3 0.88771981 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

4 0.87763429 29 nips-2011-Algorithms and hardness results for parallel large margin learning

Author: Phil Long, Rocco Servedio

Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1

5 0.86695123 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

Author: Patrick O. Perry, Michael W. Mahoney

Abstract: Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick approximation to the first nontrivial eigenvector of a data graph Laplacian exactly solve certain regularized Semi-Definite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which 2 -regularized or 1 -regularized 2 -regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefficient vector of the regression problem. Our framework will imply that the solutions to the MahoneyOrecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the first nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation implicitly performs statistical regularization, relative to running the corresponding exact algorithm. 1

6 0.8628462 198 nips-2011-On U-processes and clustering performance

7 0.862481 186 nips-2011-Noise Thresholds for Spectral Clustering

8 0.86231339 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

9 0.86204243 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

10 0.8620218 263 nips-2011-Sparse Manifold Clustering and Embedding

11 0.86064017 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

12 0.85346991 231 nips-2011-Randomized Algorithms for Comparison-based Search

13 0.8511408 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

14 0.85025865 199 nips-2011-On fast approximate submodular minimization

15 0.84991044 64 nips-2011-Convergent Bounds on the Euclidean Distance

16 0.84869426 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

17 0.84621382 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

18 0.84582841 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

19 0.84578168 202 nips-2011-On the Universality of Online Mirror Descent

20 0.84551817 258 nips-2011-Sparse Bayesian Multi-Task Learning