nips nips2011 nips2011-202 knowledge-graph by maker-knowledge-mining

202 nips-2011-On the Universality of Online Mirror Descent


Source: pdf

Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. [sent-5, score-0.59]

2 Mirror Descent is also applicable, and has been analyzed, in a stochastic optimization setting [9] and in an online setting, where it can ensure bounded online regret [20]. [sent-7, score-0.613]

3 In fact, many classical online learning algorithms can be viewed as instantiations or variants of Online Mirror Descent, generally either with the Euclidean geometry (e. [sent-8, score-0.256]

4 More recently, the Online Mirror Descent framework has been applied, with appropriate distance generating functions derived for a variety of new learning problems like multi-task learning and other matrix learning problems [10], online PCA [26] etc. [sent-11, score-0.402]

5 That is, for any convex online learning problem, of a general form (specified in Section 2), if the problem is online learnable, then it is online learnable, with a nearly optimal regret rate, using Online Mirror Descent, with an appropriate distance generating function. [sent-13, score-1.143]

6 Since Mirror descent is a first order method and often has simple and computationally efficient update rules, this makes the result especially attractive. [sent-14, score-0.266]

7 Viewing online learning as a sequentially repeated game, this means that Online Mirror Descent is a near optimal strategy, guaranteeing an outcome very close to the value of the game. [sent-15, score-0.252]

8 We then extend the notion of a martingale type of a Banach space to be sensitive to both the constraint set and the data domain, and building on results of [24], we relate the value of the online learning repeated game to this generalized notion of martingale type (Section 4). [sent-17, score-1.171]

9 Finally, again building on and generalizing the work of [16], we show how having appropriate martingale type guarantees the existence of a good uniformly convex function (Section 5), that in turn establishes the desired nearly-optimal guarantee on Online Mirror Descent (Section 6). [sent-18, score-0.725]

10 We mainly build on the analysis of [24], who related the value of the online game to the notion of martingale type of a Banach space and uniform convexity when the constraint set and data domain are dual to each other. [sent-19, score-0.987]

11 Shalev-Shwartz and Singer later observed that the online version of Mirror Descent, again with an ￿p bound and matching ￿q Lipschitz assumption (1 ≤ p ≤ 2, 1/q + 1/p = 1), is also optimal in terms 1 of the worst-case (adversarial) online regret. [sent-23, score-0.452]

12 We emphasize that although in most, if not all, settings known to us these three notions of optimality coincide, here we focus only on the worst-case online regret. [sent-25, score-0.256]

13 Sridharan and Tewri [24] generalized the optimality of online Mirror Descent (w. [sent-26, score-0.256]

14 regret) to scenarios where learner is constrained to a unit ball of an arbitrary Banach space (not necessarily and ￿p space) and the objective functions have sub-gradients that lie in the dual ball of the space—for reasons that will become clear shortly, we refer to this as the data domain. [sent-29, score-0.484]

15 However, often we encounter problems where the constraint set and data domain are not dual balls, but rather are arbitrary convex subsets. [sent-30, score-0.377]

16 In this paper, we explore this more general, “non-dual”, variant, and show that also in such scenarios online Mirror Descent is (nearly) optimal in terms of the (asymptotic) worst-case online regret. [sent-31, score-0.492]

17 2 Online Convex Learning Problem An online convex learning problem can be viewed as a multi-round repeated game where on round t, the learner first picks a vector (predictor) wt from some fixed set W, which is a closed convex subset of a vector space B. [sent-32, score-0.892]

18 Next, the adversary picks a convex cost function ft : W ￿→ R from a class of convex functions F. [sent-33, score-0.624]

19 At the end of the round, the learner pays instantaneous cost ft (wt ). [sent-34, score-0.161]

20 We refer to the strategy used by the learner to pick the ft ’s as an online learning algorithm. [sent-35, score-0.387]

21 More formally, an online learning algorithm A for the problem is specified by the mapping ￿ A : n∈N F n−1 ￿→ W. [sent-36, score-0.226]

22 , fn ) = n 1￿ 1￿ ft (A(f1:t−1 )) − inf ft (w) . [sent-43, score-0.387]

23 w∈W n n t=1 t=1 The goal of the learner (or the online learning algorithm), is to minimize the regret for any n. [sent-44, score-0.391]

24 In this paper, we consider cost function classes F specified by a convex subset X ⊂ B ￿ of the dual space B ￿ . [sent-45, score-0.313]

25 Formally : Vn (F, X , W) = inf sup A f1:n ∈F (X ) Rn (A, f1:n ) (1) It is well known that the value of a game for all the above sets F is the same. [sent-47, score-0.246]

26 The class Fsup (X ) corresponds to linear prediction with an absolute-difference loss, and thus its value is the best possible guarantee for online supervised learning with this loss. [sent-54, score-0.35]

27 The equality in the above proposition can also be extended to other commonly occurring convex loss function classes like the hinge loss class with some extra constant factors. [sent-57, score-0.326]

28 2 Any convex supervised learning problem can be viewed as linear classification with some convex constraint W on predictors. [sent-59, score-0.503]

29 Further, for any p ∈ [1, 2] let, ￿ ￿ ￿ 1 ￿ Vp := inf V ￿ ∀n ∈ N, Vn (W, X ) ≤ V n−(1− p ) (3) Most prior work on online learning and optimization considers the case when W is the unit ball of some Banach space, and X is the unit ball of the dual space, i. [sent-61, score-0.724]

30 In this work, however, we analyze the general problem where X ∈ B ￿ is not necessarily the dual ball of W. [sent-64, score-0.214]

31 It will be convenient for us to relate the notions of a convex set and a corresponding norm. [sent-65, score-0.25]

32 Throughout this paper, we will require that W and X are convex and centrally symmetric. [sent-70, score-0.245]

33 3 Mirror Descent and Uniform Convexity A key tool in the analysis mirror descent is the notion of strong convexity, or more generally uniform convexity: Definition 1. [sent-77, score-0.899]

34 To this end define, ￿￿ ￿ ￿ p−1 ￿ ￿ p ￿ p + Dp := inf sup Ψ(w) ￿ Ψ : W ￿→ R is p−1 -uniformly convex w. [sent-84, score-0.368]

35 As an example notice that when Ψ(w) = 1 ￿w￿2 then we get the gradient descent algorithm and when W is 2 ￿d the d dimensional simplex and Ψ(w) = i=1 wi log(1/wi ) then we get the multiplicative weights update algorithm. [sent-88, score-0.36]

36 Let Ψ : B ￿→ R be non-negative and q-uniformly convex w. [sent-90, score-0.214]

37 The Mirror Descent bound suggests that as long as we can find an appropriate function Ψ that is uniformly convex ∗ w. [sent-115, score-0.299]

38 ￿·￿X ∗ on W and ψ≥0 ˜ If no q-uniformly convex function exists then Ψq = ∞ is assumed by default. [sent-123, score-0.238]

39 4 Martingale Type and Value In [24], it was shown that the concept of the Martingale type (also sometimes called the Haar type) of a Banach space and optimal rates for online convex optimization problem, where X and W are duals of each other, are closely related. [sent-126, score-0.512]

40 Also throughout (xn )n∈N represents a sequence of mappings where each xn : {±1}n−1 ￿→ B ￿ . [sent-132, score-0.158]

41 We shall commit to the abuse of notation and use xn (￿) to represent xn (￿) = xn (￿1 , . [sent-133, score-0.361]

42 Further, for any p ∈ [1, 2] we also define,   ￿   ￿p ￿ ￿￿ ￿ n ￿ ￿  ￿ ￿ ￿ ￿ ￿ Cp := inf C ￿ ∀x0 ∈ B￿ , ∀(xn )n∈N , sup E ￿x0 + ￿i xi (￿)￿ ≤ C p ￿x0 ￿p + E ￿xn (￿)￿p  X X ￿  ￿ ￿ ￿ ￿  n i=1 n≥1 W Cp is useful in determining if the pair (W ￿ , X ) has Martingale type p. [sent-143, score-0.263]

43 [24, 18] For any W ∈ B and any X ∈ B ￿ and any n ≥ 1, ￿ ￿ ￿ ￿ ￿￿ n ￿￿ n ￿1 ￿ ￿ ￿1 ￿ ￿ ￿ ￿ ￿ ￿ sup E ￿ ￿i xi (￿)￿ ≤ Vn (W, X ) ≤ 2 sup E ￿ ￿i xi (￿)￿ ￿n ￿ ￿ ￿n ￿ ￿ x x i=1 i=1 W where the supremum above is over sequence of mappings (xn )n≥1 where each xn : {±1} W n−1 ￿→ X . [sent-146, score-0.349]

44 Our main interest here will is in establishing that low regret implies Martingale type. [sent-147, score-0.166]

45 To do so, we start with the above theorem to relate value of the online convex optimization game to rate of convergence of martingales in the Banach space. [sent-148, score-0.653]

46 In [16], it was shown that a Banach space has Martingale type p (the classical notion) if and only if uniformly convex functions with certain properties exist on that space (w. [sent-154, score-0.358]

47 In this section, we extend this result and show how the Martingale type of a pair (W ￿ , X ) are related to existence of certain uniformly convex functions. [sent-158, score-0.409]

48 Specifically, the following theorem shows that the notion of Martingale type of pair (W ￿ , X ) is equivalent to the existence of a non-negative function that is uniformly convex w. [sent-159, score-0.457]

49 If, for some p ∈ (1, 2], there exists a constant C > 0, such that for all sequences of mappings (xn )n≥1 where each xn : {±1}n−1 ￿→ B ￿ and any x0 ∈ B ￿ :   ￿p ￿ ￿￿ n ￿ ￿ ￿ ￿ ￿ ￿ p p sup E ￿x0 + ￿i xi (￿)￿ ≤ C p ￿x0 ￿X + E [￿xn (￿)￿X ] ￿ ￿ ￿ n i=1 n≥1 W (i. [sent-164, score-0.265]

50 (W ￿ , X ) has Martingale type p), then there exists a convex function Ψ : B ￿→ R+ with Ψ(0) = 0, that is q q q q-uniformly convex w. [sent-166, score-0.524]

51 The proof of Lemma 7 goes further and gives a specific uniformly convex function Ψ satisfying the desired requirement (i. [sent-175, score-0.285]

52 establishing Dp ≤ Cp ) under the assumptions of the previous lemma:    ￿p ￿ ￿￿ n ￿ ￿ ￿ ￿ ￿ ￿ 1 ￿ ￿ Ψ∗ (x) := sup sup E ￿x + ￿i xi (￿)￿ − E ￿xi (￿)￿p , Ψq := (Ψ∗ )∗ . [sent-177, score-0.208]

53 Optimality of Mirror Descent In the Section 3, we saw that if we can find an appropriate uniformly convex function to use in the mirror descent algorithm, we can guarantee diminishing regret. [sent-179, score-1.206]

54 Then, in Section 5, we saw how the concept of M-type related to existence of certain uniformly convex functions. [sent-182, score-0.325]

55 We can now combine these results to show that the mirror descent algorithm is a universal online learning algorithm for convex learning problems. [sent-183, score-1.266]

56 Specifically we show that whenever a problem is online learnable, the mirror descent algorithm can guarantee near optimal rates: 5 1 Theorem 9. [sent-184, score-1.108]

57 If for some constant V > 0 and some q ∈ [2, ∞), Vn (W, X ) ≤ V n− q for all n, then for any n > eq−1 , there exists regularizer function Ψ and step-size η, such that the regret of the mirror descent algorithm using Ψ against any f1 , . [sent-185, score-0.974]

58 Combining Mirror descent guarantee in Lemma 2, Lemma 7 and the lower bound in Lemma 5 with p = q 1 q−1 − log(n) we get the above statement. [sent-189, score-0.349]

59 The above Theorem tells us that, with appropriate Ψ and learning rate η, mirror descent will obtain regret at most a factor of 6002 log(n) from the best possible worst-case upper bound. [sent-190, score-0.988]

60 This basically tells us that the pair (W, X ) is online learnable, if and only if we can sandwich a q-uniformly convex norm in-between X ￿ and a scaled version of W (for some q < ∞). [sent-196, score-0.585]

61 Also note that by definition of uniform convexity, if any function Ψ is q-uniformly convex w. [sent-197, score-0.239]

62 some norm ￿·￿ and we have that ￿·￿ ≥ c ￿·￿X , then Ψ(·) is q-uniformly convex w. [sent-200, score-0.322]

63 These two observations cq ￿ together suggest that given pair (W, X ) what we need to do is find a norm ￿·￿ in between ￿·￿X and C ￿·￿W (C < ∞, q smaller the C better the bound ) such that ￿·￿ is q-uniformly convex w. [sent-204, score-0.359]

64 7 Examples We demonstrate our results on several online learning problems, specified by W and X . [sent-207, score-0.226]

65 ￿p non-dual pairs It is usual in the literature to consider the case when W is the unit ball of the ￿p norm in some finite dimension d while X is taken to be the unit ball of the dual norm ￿q where p, q are H¨ lder conjugate exponents. [sent-208, score-0.643]

66 Using o the machinery developed in this paper, it becomes effortless to consider the non-dual case when W is the unit ball Bp1 of some ￿p1 norm while X is the unit ball Bp2 for arbitrary p1 , p2 in [1, ∞]. [sent-209, score-0.436]

67 On the other hand by Clarkson’s inequality, we have that for r ∈ (2, ∞), r ψr (w) := 2r ￿w￿r is r-uniformly convex w. [sent-215, score-0.214]

68 Putting it together we see that for any r ∈ (1, ∞), the function r ψr defined above, is Q-uniformly convex w. [sent-219, score-0.214]

69 Finally we show that using ψr := dQ max{ q2 − r ,0} ψr in Mirror descent Lemma 2 yields the bound that for any f1 , . [sent-223, score-0.266]

70 Ball et al [3] tightly calculate the constants of strong convexity of squared ￿p norms, establishing the tightness of D2 when p1 = p2 . [sent-232, score-0.171]

71 These results again follow using similar arguments as ￿p case and tight constants for strong convexity parameters of the Schatten norm from [3]. [sent-239, score-0.21]

72 Non-dual group norm pairs in finite dimensions In applications such as multitask learning, groups norms such as ￿w￿q,1 are often used on matrices w ∈ Rk×d where (q, 1) norm means taking the ￿1 -norm of the ￿q -norms of the columns of w. [sent-240, score-0.241]

73 Here, it may be quite unnatural to use the dual norm (p, ∞) to define the space X where the data lives. [sent-242, score-0.207]

74 Max Norm Max-norm has been proposed as a convex matrix regularizer for application such as matrix completion [21]. [sent-245, score-0.328]

75 In the online version of the matrix completion problem at each time step one element of the matrix is revealed, corresponding to X being the set of all matrices with a single element being 1 and the rest 0. [sent-246, score-0.34]

76 Since we need X to be convex we can take the absolute convex hull of this set and use X to be the unit element-wise ￿1 ball. [sent-247, score-0.559]

77 As noted in [22] the max-norm ball is equivalent, up to a factor two, to the convex hull of all rank one sign matrices. [sent-251, score-0.411]

78 , wK }), the absolute convex hull of K points w1 , . [sent-256, score-0.296]

79 In this case, the Minkowski norm for this W is given by ￿K ￿w￿W := inf α1 ,. [sent-260, score-0.179]

80 In this case, for any q ∈ (1, 2], if we define the norm ￿w￿W,q = i=1 ￿￿ ￿1/q K 2 1 q inf α1 ,. [sent-264, score-0.179]

81 ,αK :w=￿K αi wi , then the function Ψ(w) = 2(q−1) ￿w￿W,q is 2-uniformly convex w. [sent-267, score-0.214]

82 For the max norm case the norm is equivalent to the norm got by the taking the absolute convex hull of the set of all rank one sign matrices. [sent-272, score-0.62]

83 Hence using the above proposition and noting that X ￿￿ the unit ball of | · |∞ we see that Ψ is obviously 2-uniformly convex w. [sent-274, score-0.41]

84 This matches the stochastic (PAC) learning guarantee [22], and is the first guarantee we n are aware of for the max norm matrix completion problem in the online setting. [sent-278, score-0.573]

85 8 Conclusion and Discussion In this paper we showed that for a general class of convex online learning problems, there always exists a distance generating function Ψ such that Mirror Descent using this function achieves a near-optimal regret guarantee. [sent-279, score-0.703]

86 This 7 shows that a fairly simple first-order method, in which each iteration requires a gradient computation and a proxmap computation, is sufficient for online learning in a very general sense. [sent-280, score-0.266]

87 Furthermore, we know that in terms of worst-case behavior, both in the stochastic and in the online setting, for convex cost functions, the value is unchained when the convex hull of a non-convex constraint set [18]. [sent-288, score-0.806]

88 The requirement that the data domain X be convex is perhaps more restrictive, since even with non-convex data domain, the objective is still convex. [sent-289, score-0.269]

89 However set X we consider here is not the entire dual ball and in fact is neither convex nor symmetric. [sent-294, score-0.428]

90 This is general enough for considering supervised learning with an arbitrary convex loss in a worst-case setting, as the sub-gradients in this case exactly correspond to the data points, and so restricting F through its sub gradients corresponds to restricting the data domain. [sent-300, score-0.283]

91 Even for the statistical learning setting, online methods along with online to batch conversion are often preferred due to their efficiency especially in high dimensional problems. [sent-303, score-0.452]

92 In fact for ￿p spaces in the dual case, using lower bounds on the sample complexity for statistical learning of these problems, one can show that for large dimensional problems, mirror descent is an optimal procedure even for the statistical learning problem. [sent-304, score-0.925]

93 We would like to consider the question of whether Mirror Descent is optimal for stochastic convex optimization (convex statistical learning) setting [9, 19, 23] in general. [sent-305, score-0.251]

94 Establishing such universality would have significant implications, as it would indicate that any learnable (convex) problem, is learnable using a one-pass first-order online method (i. [sent-306, score-0.474]

95 Optimal strategies and minimax lower bounds for online convex games. [sent-315, score-0.44]

96 Information-theoretic lower bounds on the oracle complexity of convex optimization. [sent-320, score-0.214]

97 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-331, score-0.48]

98 On cesaro’s convergence of the gradient descent method for finding saddle points of convex-concave functions. [sent-385, score-0.306]

99 Martingales in banach spaces (in connection with type and cotype). [sent-398, score-0.257]

100 Randomized online pca algorithms with regret bounds that are logarithmic in the dimension, 2007. [sent-443, score-0.35]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mirror', 0.56), ('descent', 0.266), ('martingale', 0.259), ('online', 0.226), ('convex', 0.214), ('banach', 0.185), ('vn', 0.152), ('flin', 0.145), ('regret', 0.124), ('ft', 0.12), ('cp', 0.12), ('ball', 0.115), ('norm', 0.108), ('flip', 0.103), ('xn', 0.102), ('convexity', 0.102), ('learnable', 0.099), ('dual', 0.099), ('vp', 0.092), ('game', 0.092), ('martingales', 0.085), ('sup', 0.083), ('fsup', 0.083), ('hull', 0.082), ('wt', 0.079), ('amd', 0.078), ('fn', 0.076), ('type', 0.072), ('inf', 0.071), ('schatten', 0.067), ('completion', 0.066), ('minkowski', 0.062), ('pisier', 0.062), ('supw', 0.062), ('generating', 0.058), ('guarantee', 0.056), ('lemma', 0.056), ('mappings', 0.056), ('sridharan', 0.055), ('ambuj', 0.055), ('balls', 0.053), ('srebro', 0.052), ('universality', 0.05), ('nathan', 0.049), ('unit', 0.049), ('mdp', 0.048), ('notion', 0.048), ('uniformly', 0.047), ('guidelines', 0.047), ('certainly', 0.043), ('establishing', 0.042), ('supervised', 0.042), ('ttic', 0.041), ('nemirovski', 0.041), ('learner', 0.041), ('scenarios', 0.04), ('gradient', 0.04), ('existence', 0.039), ('appropriate', 0.038), ('dp', 0.037), ('pair', 0.037), ('stochastic', 0.037), ('relate', 0.036), ('corollary', 0.035), ('games', 0.034), ('constraint', 0.033), ('proposition', 0.032), ('distance', 0.031), ('karthik', 0.031), ('centrally', 0.031), ('domain', 0.031), ('argmin', 0.03), ('optimality', 0.03), ('geometry', 0.03), ('sanghavi', 0.03), ('bs', 0.028), ('commit', 0.028), ('rakhlin', 0.028), ('pradeep', 0.028), ('tightness', 0.027), ('get', 0.027), ('loss', 0.027), ('shall', 0.027), ('exponentiated', 0.026), ('functionals', 0.026), ('repeated', 0.026), ('class', 0.026), ('unavoidable', 0.025), ('saw', 0.025), ('subgradients', 0.025), ('uniform', 0.025), ('functions', 0.025), ('norms', 0.025), ('supremum', 0.025), ('adversary', 0.025), ('requirement', 0.024), ('exists', 0.024), ('matrix', 0.024), ('rn', 0.023), ('extending', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 202 nips-2011-On the Universality of Online Mirror Descent

Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1

2 0.16847658 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

3 0.167999 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

4 0.15224156 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

5 0.13966581 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

6 0.12120905 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems

7 0.1177827 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

8 0.11407811 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

9 0.11135869 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

10 0.10624897 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

11 0.10353959 220 nips-2011-Prediction strategies without loss

12 0.10308708 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

13 0.098819748 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

14 0.093758866 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

15 0.091913521 145 nips-2011-Learning Eigenvectors for Free

16 0.090288609 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

17 0.089622669 282 nips-2011-The Fast Convergence of Boosting

18 0.084773988 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

19 0.084245801 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

20 0.080908783 270 nips-2011-Statistical Performance of Convex Tensor Decomposition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.205), (1, -0.183), (2, -0.073), (3, -0.14), (4, 0.082), (5, 0.084), (6, 0.019), (7, 0.041), (8, -0.082), (9, 0.044), (10, 0.089), (11, -0.009), (12, -0.053), (13, -0.034), (14, -0.02), (15, 0.027), (16, -0.022), (17, 0.057), (18, -0.038), (19, 0.089), (20, -0.044), (21, -0.065), (22, -0.104), (23, -0.004), (24, -0.007), (25, 0.112), (26, -0.002), (27, 0.048), (28, -0.066), (29, 0.038), (30, 0.079), (31, -0.059), (32, -0.1), (33, 0.029), (34, 0.104), (35, -0.012), (36, 0.032), (37, -0.009), (38, -0.01), (39, 0.019), (40, -0.038), (41, -0.015), (42, 0.04), (43, 0.021), (44, -0.036), (45, 0.018), (46, -0.024), (47, 0.064), (48, -0.08), (49, -0.132)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95960605 202 nips-2011-On the Universality of Online Mirror Descent

Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1

2 0.68363923 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

3 0.67783314 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

Author: Elad Hazan, Tomer Koren, Nati Srebro

Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1

4 0.67675829 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

5 0.6527819 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

6 0.60525095 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems

7 0.56486022 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

8 0.52229774 218 nips-2011-Predicting Dynamic Difficulty

9 0.51383364 271 nips-2011-Statistical Tests for Optimization Efficiency

10 0.51300371 272 nips-2011-Stochastic convex optimization with bandit feedback

11 0.50897264 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

12 0.48296827 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

13 0.47765842 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

14 0.47688642 25 nips-2011-Adaptive Hedge

15 0.4751299 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

16 0.4663313 72 nips-2011-Distributed Delayed Stochastic Optimization

17 0.46397424 220 nips-2011-Prediction strategies without loss

18 0.45855844 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

19 0.45108327 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

20 0.44050601 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (4, 0.038), (14, 0.164), (20, 0.054), (22, 0.017), (26, 0.032), (31, 0.056), (33, 0.019), (43, 0.088), (45, 0.139), (48, 0.019), (57, 0.027), (74, 0.06), (83, 0.068), (99, 0.086)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86058211 202 nips-2011-On the Universality of Online Mirror Descent

Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1

2 0.85831708 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

3 0.79568338 168 nips-2011-Maximum Margin Multi-Instance Learning

Author: Hua Wang, Heng Huang, Farhad Kamangar, Feiping Nie, Chris H. Ding

Abstract: Multi-instance learning (MIL) considers input as bags of instances, in which labels are assigned to the bags. MIL is useful in many real-world applications. For example, in image categorization semantic meanings (labels) of an image mostly arise from its regions (instances) instead of the entire image (bag). Existing MIL methods typically build their models using the Bag-to-Bag (B2B) distance, which are often computationally expensive and may not truly reflect the semantic similarities. To tackle this, in this paper we approach MIL problems from a new perspective using the Class-to-Bag (C2B) distance, which directly assesses the relationships between the classes and the bags. Taking into account the two major challenges in MIL, high heterogeneity on data and weak label association, we propose a novel Maximum Margin Multi-Instance Learning (M3 I) approach to parameterize the C2B distance by introducing the class specific distance metrics and the locally adaptive significance coefficients. We apply our new approach to the automatic image categorization tasks on three (one single-label and two multilabel) benchmark data sets. Extensive experiments have demonstrated promising results that validate the proposed method.

4 0.78995872 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

5 0.77342814 29 nips-2011-Algorithms and hardness results for parallel large margin learning

Author: Phil Long, Rocco Servedio

Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1

6 0.77303839 198 nips-2011-On U-processes and clustering performance

7 0.7717787 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

8 0.77164364 80 nips-2011-Efficient Online Learning via Randomized Rounding

9 0.77130175 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

10 0.76982051 263 nips-2011-Sparse Manifold Clustering and Embedding

11 0.76877618 283 nips-2011-The Fixed Points of Off-Policy TD

12 0.7672478 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

13 0.76434344 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

14 0.76351082 271 nips-2011-Statistical Tests for Optimization Efficiency

15 0.75962025 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

16 0.75932562 265 nips-2011-Sparse recovery by thresholded non-negative least squares

17 0.75859207 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

18 0.75847054 25 nips-2011-Adaptive Hedge

19 0.75737625 109 nips-2011-Greedy Model Averaging

20 0.75696003 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data