nips nips2011 nips2011-187 knowledge-graph by maker-knowledge-mining

187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning


Source: pdf

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. [sent-5, score-0.128]

2 This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. [sent-6, score-0.354]

3 We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a. [sent-7, score-0.386]

4 Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. [sent-14, score-0.827]

5 This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. [sent-15, score-0.264]

6 1 Introduction The minimization of an objective function which is only available through unbiased estimates of the function values or its gradients is a key methodological problem in many disciplines. [sent-17, score-0.118]

7 Its analysis has been attacked mainly in three communities: stochastic approximation [1, 2, 3, 4, 5, 6], optimization [7, 8], and machine learning [9, 10, 11, 12, 13, 14, 15]. [sent-18, score-0.214]

8 The main algorithms which have emerged are stochastic gradient descent (a. [sent-19, score-0.322]

9 On the other hand, using slower decays together with averaging robustly leads to optimal convergence behavior (both in terms of rates and constants) [4, 5]. [sent-27, score-0.462]

10 The analysis from the convex optimization and machine learning literatures however has focused on differences between strongly convex and non-strongly convex objectives, with learning rates and roles of averaging being different in these two cases [11, 12, 13, 14, 15]. [sent-28, score-0.719]

11 A key desirable behavior of an optimization method is to be adaptive to the hardness of the problem, and thus one would like a single algorithm to work in all situations, favorable ones such as strongly convex functions and unfavorable ones such as non-strongly convex functions. [sent-29, score-0.33]

12 − We provide a non-asymptotic analysis of Polyak-Ruppert averaging [4, 5], with and without strong convexity (Sections 3. [sent-32, score-0.434]

13 In particular, we show that slower decays of the learning rate, together with averaging, are crucial to robustly obtain fast convergence rates. [sent-35, score-0.204]

14 2 Problem set-up We consider a sequence of convex differentiable random functions (fn )n 1 from H to R. [sent-44, score-0.183]

15 We consider the following recursion, starting from θ0 ∈ H: ′ (1) ∀n 1, θn = θn−1 − γn fn (θn−1 ), where (γn )n 1 is a deterministic sequence of positive scalars, which we refer to as the learning rate sequence. [sent-45, score-0.747]

16 The function fn is assumed to be differentiable (see, e. [sent-46, score-0.719]

17 , [16] for definitions and properties of differentiability for functions defined on Hilbert spaces), and its gradient is an unbiased estimate of the gradient of a certain function f we wish to minimize: (H1) Let (Fn )n 0 be an increasing family of σ-fields. [sent-48, score-0.207]

18 θ0 is F0 -measurable, and for each θ ∈ H, ′ the random variable fn (θ) is square-integrable, Fn -measurable and ′ ∀θ ∈ H, ∀n 1, E(fn (θ)|Fn−1 ) = f ′ (θ), w. [sent-49, score-0.637]

19 ′ Given only the noisy gradients fn (θn−1 ), the goal of stochastic approximation is to minimize the function f with respect to θ. [sent-58, score-0.921]

20 , potentially, active learning): − Stochastic approximation: in the so-called Robbins-Monro setting, for all θ ∈ H and n 1, fn (θ) may be expressed as fn (θ) = f (θ) + εn , θ , where (εn )n 1 is a square-integrable martingale difference (i. [sent-61, score-1.274]

21 observations: for all θ ∈ H and n 1, fn (θ) = ℓ(θ, zn ) where zn is an i. [sent-67, score-0.677]

22 Classical examples are leastsquares or logistic regression (linear or non-linear through kernel methods [18, 19]), where fn (θ) = 1 ( xn , θ − yn )2 , or fn (θ) = log[1 + exp(−yn xn , θ )], for xn ∈ H, and yn ∈ R 2 (or {−1, 1} for logistic regression). [sent-72, score-1.712]

23 Throughout this paper, unless otherwise stated, we assume that each function fn is convex and smooth, following the traditional definition of smoothness from the optimization literature, i. [sent-73, score-0.78]

24 However, we make two slightly different ′ assumptions: (H2) where the function θ → E(fn (θ)|Fn−1 ) is Lipschitz-continuous in quadratic ′ mean and a strengthening of this assumption, (H2’) in which θ → fn (θ) is almost surely Lipschitzcontinuous. [sent-78, score-0.763]

25 (H2) For each n 1, the function fn is almost surely convex, differentiable, and: ′ ′ ∀n 1, ∀θ1 , θ2 ∈ H, E( fn (θ1 ) − fn (θ2 ) 2 |Fn−1 ) L2 θ1 − θ2 2 , w. [sent-79, score-2.0]

26 (3) (H2’) For each n 1, the function fn is almost surely convex, differentiable with Lipschitz′ continuous gradient fn , with constant L, that is: ′ ′ ∀n 1, ∀θ1 , θ2 ∈ H, fn (θ1 ) − fn (θ2 ) L θ1 − θ2 , w. [sent-82, score-2.809]

27 (4) 2 If fn is twice differentiable, this corresponds to having the operator norm of the Hessian operator of fn bounded by L. [sent-85, score-1.421]

28 For least-squares or logistic regression, if we assume that (E xn 4 )1/4 R for all n ∈ N, then we may take L = R2 (or even L = R2 /4 for logistic regression) for assumption (H2), while for assumption (H2’), we need to have an almost sure bound xn R. [sent-86, score-0.414]

29 In the context of machine learning (least-squares or logistic regression), assumption (H3) is satisfied as soon as µ θ 2 is used as an additional regularizer. [sent-90, score-0.139]

30 For non-strongly convex losses such as the logistic loss, f can never be strongly convex unless we restrict the domain of θ (which we do in Section 3. [sent-97, score-0.376]

31 Alternatively to restricting the domain, replacing the logistic loss u → log(1 + e−u) by u → log(1 + e−u) + εu2/2, for some small ε > 0, makes it strongly convex in low-dimensional settings. [sent-99, score-0.275]

32 By strong convexity of f , if we assume (H3), then f attains its global minimum at a unique vector θ∗ ∈ H such that f ′ (θ∗ ) = 0. [sent-100, score-0.239]

33 Moreover, we make the following assumption (in the context of stochastic approximation, it corresponds to E( εn 2 |Fn−1 ) σ 2 ): (H4) There exists σ 2 ∈ R+ such that for all n ′ 1, E( fn (θ∗ ) 2 |Fn−1 ) σ 2 , w. [sent-101, score-0.808]

34 1 Stochastic gradient descent Before stating our first theorem (see proof in [23]), we introduce the following family of functions ϕβ : R+ \ {0} → R given by: tβ −1 if β = 0, β ϕβ (t) = log t if β = 0. [sent-105, score-0.263]

35 tβ β , while for Theorem 1 (Stochastic gradient descent, strong convexity) Assume (H1,H2,H3,H4). [sent-108, score-0.158]

36 Using this deterministic recursion, we then derive bounds using classical techniques from stochastic approximation [2], but in a non-asymptotic way, by deriving explicit upper-bounds. [sent-116, score-0.272]

37 Previous results from the stochastic approximation literature have focused mainly on almost sure convergence of the sequence of iterates. [sent-120, score-0.341]

38 (6) is an equality for quadratic functions fn , the result in Eq. [sent-129, score-0.674]

39 For α < 1, the asymptotic term in the bound is 4Cσ . [sent-135, score-0.132]

40 Therefore, for α = 1, the choice of C is nµC/2 critical, as already noticed by [8]: too small C leads to convergence at arbitrarily small rate of the form n−µC/2 , while too large C leads to explosion due to the initial condition. [sent-138, score-0.204]

41 Note that for α < 1, this catastrophic term is in front of a sub-exponentially decaying factor, so its effect is mitigated once the term in n1−α takes over ϕ1−2α (n), and the transient term stops increasing. [sent-144, score-0.172]

42 Note finally, that the asymptotic convergence rate in O(n−1 ) matches optimal asymptotic minimax rate for stochastic approximation [24, 25]. [sent-147, score-0.602]

43 2 Bounded gradients In some cases such as logistic regression, we also have a uniform upper-bound on the gradients, i. [sent-150, score-0.18]

44 (H5) For each n 1, almost surely, the function fn if convex, differentiable and has gradients uniformly bounded by B on the ball of center 0 and radius D, i. [sent-153, score-0.885]

45 Note that no function may be strongly convex and Lipschitz-continuous (i. [sent-156, score-0.186]

46 The next theorem shows that with a slight modification of the recursion in Eq. [sent-160, score-0.16]

47 (1), we get simpler bounds than the ones obtained in Theorem 1, obtaining a result which already appeared in a simplified form [8] (see proof in [23]): Theorem 2 (Stochastic gradient descent, strong convexity, bounded gradients) Assume (H1,H3,H5). [sent-161, score-0.279]

48 Denote δn = E θn − θ∗ 2 , where θn ∈ H is the n-th iterate of the following recursion: ′ ∀n 1, θn = ΠD [θn−1 − γn fn (θn−1 )], (7) where ΠD is the orthogonal projection operator on the ball {θ : θ D}. [sent-162, score-0.688]

49 If (8) The proof follows the same lines than for Theorem 1, but with the deterministic recursion δn 2 (1 − 2µγn )δn−1 + B 2 γn . [sent-165, score-0.131]

50 That is, for all θ1 , θ2 ∈ H and for all n 1, ′′ ′′ fn (θ1 ) − fn (θ2 ) M θ1 − θ2 , where · is the operator norm. [sent-172, score-1.325]

51 ′ (H7) There exists τ ∈ R+ , such that for each n 1, E( fn (θ∗ ) 4 |Fn−1 ) τ 4 almost surely. [sent-175, score-0.667]

52 ′ Moreover, there exists a nonnegative self-adjoint operator Σ such that for all n, E(fn (θ∗ ) ⊗ ′ ∗ fn (θ )|Fn−1 ) Σ almost-surely. [sent-176, score-0.688]

53 The operator Σ (which always exists as soon as τ is finite) is here to characterize precisely the variance term, which will be independent of the learning rate sequence (γn ), as we now show: ¯ Theorem 3 (Averaging, strong convexity) Assume (H1, H2’, H3, H4, H6, H7). [sent-177, score-0.227]

54 (1), write it as fn (θn−1 ) = γ1 (θn−1 − θn ), and n ′ ′ ∗ ′′ ∗ ∗ ′ notice that (a) fn (θn−1 ) ≈ fn (θ ) + f (θ )(θn−1 − θ ), (b) fn (θ∗ ) has zero mean and behaves n 1 like an i. [sent-181, score-2.568]

55 There is no sub-exponential forgetting of initial conditions, but rather a decay at rate O(n−2 ) (last two lines in Eq. [sent-188, score-0.124]

56 This is a known problem which may slow down the convergence, a common practice being to start averaging after a certain number of iterations [2]. [sent-190, score-0.283]

57 Moreover, the constant A may be large when LC is large, thus the catastrophic terms are more problematic than for stochastic gradient descent, because they do not appear in front of sub-exponentially decaying terms (see [23]). [sent-191, score-0.332]

58 Thus, averaging allows to get from the slow rate O(n−α ) to the optimal rate O(n−1 ). [sent-195, score-0.44]

59 When M = 0 (quadratic functions), the leading term has rate O(n−1 ) for all α ∈ (0, 1) (with then a contribution of the first term in the second line). [sent-197, score-0.176]

60 We get a simpler bound by directly averaging the bound in Theorem 1, which leads to an unchanged rate of n−1 , i. [sent-199, score-0.417]

61 , averaging is not key for α = 1, and does not solve the robustness problem related to the choice of C or the lack of strong convexity. [sent-201, score-0.353]

62 Moreover, as noticed in the stochastic approximation literature [4], in the context of learning from i. [sent-204, score-0.217]

63 5 4 Non-strongly convex objectives In this section, we do not assume that the function f is strongly convex, but we replace (H3) by: (H8) The function f attains its global minimum at a certain θ∗ ∈ H (which may not be unique). [sent-216, score-0.243]

64 Not assuming strong convexity is essential in practice to make sure that algorithms are robust and adaptive to the hardness of the learning or optimization problem (much like gradient descent is). [sent-219, score-0.496]

65 For α ∈ (1/2, 2/3), we get convergence at rate O(n−α/2 ), while for α ∈ (2/3, 1), we get a convergence rate of O(nα−1 ). [sent-224, score-0.336]

66 The rate of convergence changes at α = 2/3, where we get our best rate O(n−1/3 ), which does not match the minimax rate of O(n−1/2 ) for stochastic approximation in the non-strongly convex case [25]. [sent-226, score-0.671]

67 These rates for stochastic gradient descent without strong convexity assumptions are new and we conjecture that they are asymptotically minimax optimal (for stochastic gradient descent, not for stochastic approximation). [sent-227, score-1.062]

68 If we further assume that we have all gradients bounded by B (that is, we assume D = ∞ in (H5)), then, we have the following theorem, which allows α ∈ (1/3, 1/2) with rate O(n−3α/2+1/2 ): Theorem 5 (Stochastic gradient descent, no strong convexity, bounded gradients) Assume (H1, H2’, H5, H8). [sent-229, score-0.42]

69 2n (12) If α = 1/2, then we only have convergence under LC < 1/4 (as in Theorem 4), with potentially slow rate, while for α > 1/2, we have a rate of O(n−α ), with otherwise similar behavior than for the strongly convex case with no bounded gradients. [sent-234, score-0.437]

70 Here, averaging has allowed the rate to go from O(max{nα−1 , n−α/2 }) to O(n−α ). [sent-235, score-0.296]

71 1 For least-squares regression with kernels, where fn (θ) = 1 (yn − θ, Φ(xn ) )2 , with Φ(xn ) being the 2 feature map associated with a reproducing kernel Hilbert space H with universal kernel [28], then we need that x → E(Y |X = x) is a function within the RKHS. [sent-236, score-0.734]

72 α = 1/2 α=1 −5 0 2 sgd − C=1/5 ave − C=1/5 sgd − C=1 ave − C=1 sgd − C=5 ave − C=5 ∗ n n 0 5 log[f(θ )−f ] sgd − C=1/5 ave − C=1/5 sgd − C=1 ave − C=1 sgd − C=5 ave − C=5 ∗ log[f(θ )−f ] 5 0 −5 0 4 log(n) 2 4 log(n) Figure 2: Robustness to wrong constants for γn = Cn−α . [sent-241, score-4.528]

73 (13) E f (θn ) − f (θ∗ ) 2C 2n With the bounded gradient assumption (and in fact without smoothness), we obtain the minimax asymptotic rate O(n−1/2 ) up to logarithmic terms [25] for α = 1/2, and, for α < 1/2, the rate O(n−α ) while for α > 1/2, we get O(nα−1 ). [sent-247, score-0.458]

74 Here, averaging has also allowed to increase the range of α which ensures convergence, to α ∈ (0, 1). [sent-248, score-0.215]

75 For all q > 1, we have a convex function with Lipschitz-continuous gradient with constant L = q(q−1). [sent-251, score-0.191]

76 It is strongly convex around the origin for q ∈ (1, 2], but its second derivative vanishes for q > 2. [sent-252, score-0.186]

77 In Figure 1, we plot in log-log scale the average of f (θn ) − f (θ∗ ) over 100 replications of the stochastic approximation problem (with i. [sent-253, score-0.25]

78 For q = 2 (left plot), where we locally have a strongly convex case, all learning rates lead to good estimation with decay proportional to α (as shown in Theorem 1), while for the averaging case, all reach the exact same convergence rate (as shown in Theorem 3). [sent-257, score-0.619]

79 However, for q = 4 where strong convexity does not hold (right plot), without averaging, α = 1 is still fastest but becomes the slowest after averaging; on the contrary, illustrating Section 4, slower decays (such as α = 1/2) leads to faster convergence when averaging is used. [sent-258, score-0.595]

80 Note that for α = 1/2 (left plot), the 3 curves for stochastic gradient descent end up being aligned and equally spaced, corroborating a rate proportional to C (see Theorem 1). [sent-266, score-0.433]

81 Moreover, when averaging for α = 1/2, the error ends up 7 Selecting rate after n/10 iterations Selecting rate after n/10 iterations −1. [sent-267, score-0.433]

82 5 −2 ∗ −1 1/3 − sgd 1/3 − ave 1/2 − sgd 1/2 − ave 2/3 − sgd 2/3 − ave 1 − sgd 1 − ave 0 n n ∗ log[f(θ )−f ] −0. [sent-268, score-2.98]

83 5 log[f(θ )−f ] 1/3 − sgd 1/3 − ave 1/2 − sgd 1/2 − ave 2/3 − sgd 2/3 − ave 1 − sgd 1 − ave −0. [sent-269, score-2.98]

84 5 0 5 log(n) 1 2 3 log(n) 4 Figure 3: Comparison on non strongly convex logistic regression problems. [sent-272, score-0.322]

85 For α = 1 (right plot), if C is too small, convergence is very slow (and not at the rate n−1 ), as already observed (see, e. [sent-278, score-0.185]

86 6 Conclusion In this paper, we have provided a non-asymptotic analysis of stochastic gradient, as well as its averaged version, for various learning rate sequences of the form γn = Cn−α (see summary of results in Table 1). [sent-291, score-0.229]

87 Following earlier work from the optimization, machine learning and stochastic approximation literatures, our analysis highlights that α = 1 is not robust to the choice of C and to the actual difficulty of the problem (strongly convex or not). [sent-292, score-0.32]

88 However, when using averaging with α ∈ (1/2, 1), we get, both in strongly convex and non-strongly convex situation, close to optimal rates of convergence. [sent-293, score-0.545]

89 Moreover, we highlight the fact that problems with bounded gradients have better behaviors, i. [sent-294, score-0.136]

90 , logistic regression is easier to optimize than least-squares regression. [sent-296, score-0.136]

91 Second, we have focused on differentiable objectives, but the extension to objective functions with a differentiable stochastic part and a non-differentiable deterministic (in the line of [14]) would allow an extension to sparse methods. [sent-298, score-0.361]

92 B α α 1−α 1−α Table 1: Summary of results: For stochastic gradient descent (SGD) or Polyak-Ruppert averaging (Aver. [sent-305, score-0.537]

93 ), we provide their rates of convergence of the form n−β corresponding to learning rate sequences γn = Cn−α , where β is shown as a function of α. [sent-306, score-0.188]

94 For each method, we list the main assumptions (µ: strong convexity, L: bounded Hessian, B: bounded gradients). [sent-307, score-0.183]

95 General bounds and finite-time improvement for stochastic approximation algorithms. [sent-314, score-0.224]

96 An estimate for the rate of convergence of recurrent ı ı robust identification algorithms. [sent-328, score-0.171]

97 Dual averaging methods for regularized stochastic learning and online optimization. [sent-393, score-0.363]

98 Non-asymptotic analysis of stochastic approximation algorithms for machine learning. [sent-444, score-0.193]

99 Information-theoretic lower bounds on the oracle complexity of convex optimization, 2010. [sent-460, score-0.132]

100 Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. [sent-467, score-0.148]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fn', 0.637), ('ave', 0.402), ('sgd', 0.343), ('averaging', 0.215), ('convexity', 0.151), ('stochastic', 0.148), ('recursion', 0.102), ('convex', 0.101), ('gradients', 0.091), ('gradient', 0.09), ('logistic', 0.089), ('cn', 0.087), ('strongly', 0.085), ('descent', 0.084), ('differentiable', 0.082), ('rate', 0.081), ('strong', 0.068), ('asymptotic', 0.068), ('convergence', 0.064), ('surely', 0.059), ('theorem', 0.058), ('decays', 0.058), ('operator', 0.051), ('hilbert', 0.051), ('proportionality', 0.048), ('regression', 0.047), ('minimax', 0.047), ('bounded', 0.045), ('moreover', 0.045), ('catastrophic', 0.045), ('approximation', 0.045), ('xn', 0.044), ('leading', 0.043), ('forgetting', 0.043), ('robustly', 0.043), ('rates', 0.043), ('lc', 0.041), ('slow', 0.04), ('robustness', 0.04), ('duxbury', 0.04), ('slower', 0.039), ('bound', 0.038), ('objectives', 0.037), ('quadratic', 0.037), ('explosion', 0.035), ('sure', 0.034), ('constants', 0.033), ('literatures', 0.032), ('hessian', 0.032), ('log', 0.031), ('bounds', 0.031), ('replications', 0.03), ('sierra', 0.03), ('proportional', 0.03), ('lack', 0.03), ('almost', 0.03), ('deterministic', 0.029), ('alpha', 0.029), ('yn', 0.028), ('iterations', 0.028), ('unbiased', 0.027), ('francis', 0.027), ('decaying', 0.027), ('soon', 0.027), ('plot', 0.027), ('robust', 0.026), ('term', 0.026), ('exp', 0.025), ('assumptions', 0.025), ('kernel', 0.025), ('wrong', 0.025), ('noticed', 0.024), ('situations', 0.023), ('lipschitz', 0.023), ('get', 0.023), ('bach', 0.023), ('assumption', 0.023), ('hardness', 0.022), ('simpler', 0.022), ('front', 0.022), ('satis', 0.022), ('synthetic', 0.022), ('optimization', 0.021), ('iterates', 0.021), ('paris', 0.021), ('smoothness', 0.021), ('potentially', 0.021), ('replaces', 0.02), ('simulations', 0.02), ('zn', 0.02), ('convergent', 0.02), ('attains', 0.02), ('srebro', 0.02), ('asymptotically', 0.02), ('focused', 0.02), ('behaves', 0.02), ('france', 0.02), ('inverse', 0.019), ('suggesting', 0.019), ('explicit', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

2 0.24250574 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

3 0.17817393 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

4 0.167999 202 nips-2011-On the Universality of Online Mirror Descent

Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1

5 0.13495931 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

Author: Samory Kpotufe

Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1

6 0.1244158 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

7 0.12169828 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

8 0.11761653 72 nips-2011-Distributed Delayed Stochastic Optimization

9 0.10852741 162 nips-2011-Lower Bounds for Passive and Active Learning

10 0.091985658 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

11 0.076321073 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

12 0.069205582 282 nips-2011-The Fast Convergence of Boosting

13 0.066303812 109 nips-2011-Greedy Model Averaging

14 0.065712109 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

15 0.062427439 4 nips-2011-A Convergence Analysis of Log-Linear Training

16 0.061080117 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

17 0.055179965 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

18 0.054900281 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

19 0.054149568 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

20 0.050705884 59 nips-2011-Composite Multiclass Losses


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, -0.076), (2, -0.033), (3, -0.126), (4, -0.032), (5, 0.074), (6, -0.012), (7, -0.034), (8, -0.141), (9, 0.073), (10, 0.034), (11, -0.069), (12, -0.097), (13, -0.137), (14, -0.03), (15, 0.194), (16, -0.031), (17, 0.098), (18, -0.021), (19, 0.205), (20, -0.123), (21, -0.034), (22, -0.142), (23, -0.008), (24, -0.081), (25, 0.153), (26, 0.006), (27, 0.091), (28, -0.029), (29, 0.059), (30, 0.055), (31, 0.041), (32, -0.031), (33, -0.087), (34, 0.045), (35, -0.063), (36, -0.048), (37, 0.085), (38, 0.052), (39, 0.063), (40, 0.115), (41, 0.117), (42, -0.004), (43, 0.003), (44, -0.026), (45, -0.043), (46, 0.085), (47, -0.042), (48, -0.037), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95287079 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

2 0.79232824 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

3 0.74301225 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

4 0.7141484 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

5 0.67002797 72 nips-2011-Distributed Delayed Stochastic Optimization

Author: Alekh Agarwal, John C. Duchi

Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1

6 0.62292176 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

7 0.61250335 202 nips-2011-On the Universality of Online Mirror Descent

8 0.50333464 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

9 0.43876675 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

10 0.42974299 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

11 0.38386601 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

12 0.36869159 4 nips-2011-A Convergence Analysis of Log-Linear Training

13 0.36236501 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

14 0.34456357 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

15 0.34237215 109 nips-2011-Greedy Model Averaging

16 0.34225014 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

17 0.33970422 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

18 0.33876681 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

19 0.33572888 69 nips-2011-Differentially Private M-Estimators

20 0.32949564 80 nips-2011-Efficient Online Learning via Randomized Rounding


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.044), (4, 0.034), (20, 0.022), (26, 0.042), (31, 0.056), (33, 0.024), (43, 0.099), (45, 0.189), (48, 0.01), (57, 0.031), (74, 0.057), (82, 0.148), (83, 0.035), (99, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94812542 59 nips-2011-Composite Multiclass Losses

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1

same-paper 2 0.90904689 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

3 0.89293021 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh

Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1

4 0.85345793 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

5 0.84510565 202 nips-2011-On the Universality of Online Mirror Descent

Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1

6 0.84146398 80 nips-2011-Efficient Online Learning via Randomized Rounding

7 0.8408668 29 nips-2011-Algorithms and hardness results for parallel large margin learning

8 0.83828807 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

9 0.8372581 198 nips-2011-On U-processes and clustering performance

10 0.8372016 263 nips-2011-Sparse Manifold Clustering and Embedding

11 0.8371762 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

12 0.83594042 251 nips-2011-Shaping Level Sets with Submodular Functions

13 0.83389759 271 nips-2011-Statistical Tests for Optimization Efficiency

14 0.83386832 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

15 0.82977462 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

16 0.82822937 4 nips-2011-A Convergence Analysis of Log-Linear Training

17 0.82797146 109 nips-2011-Greedy Model Averaging

18 0.82697153 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

19 0.82656336 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

20 0.82589495 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity