nips nips2012 nips2012-325 knowledge-graph by maker-knowledge-mining

325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

Source: pdf

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also conﬁrmed in numerical simulations where we compare to several baselines on a least-squares regression problem.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. [sent-9, score-0.461]

2 Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. [sent-10, score-0.679]

3 Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. [sent-11, score-0.218]

4 By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. [sent-14, score-0.282]

5 The empirical efﬁciency of these methods is backed with strong theoretical guarantees on their convergence rates, which depend on various structural properties of the objective function. [sent-19, score-0.22]

6 More precisely, for an objective function that is strongly convex, stochastic gradient descent enjoys a convergence rate ranging from O(1/T ), when features vectors are extremely sparse, to O(d/T ), when feature vectors are dense [9, 14, 10]. [sent-20, score-0.58]

7 This strong convexity condition is satisﬁed for many common machine learning problems, including boosting, least squares regression, SVMs and generalized linear models among others. [sent-21, score-0.176]

8 It has been shown [15, 19] that when the optimal solution θ∗ is s-sparse, appropriate versions of the mirror descent algorithm converge at a rate O(s (log d)/T ). [sent-26, score-0.215]

9 [20] exploit the smoothness of common loss functions, and obtain improved rates of √ the form O(η (s log d)/T ), where η is the noise variance. [sent-28, score-0.359]

10 While the log d scaling makes these methods attractive in√ high dimensions, their scaling with respect to the iterations T is relatively slow—namely, O(1/ T ) as opposed to O(1/T ) for strongly convex problems. [sent-29, score-0.434]

11 Many optimization problems encountered in practice exhibit both features: the objective function is strongly convex, and the optimum is (approximately) sparse. [sent-30, score-0.307]

12 This fact leads to the natural question: is it possible to design algorithms for stochastic optimization that enjoy the best features of both types of structure? [sent-31, score-0.198]

13 The main contribution of this paper is to answer this question in the afﬁrmative, and to analyze a new algorithm that has convergence rate O((s log d)/T ) 1 for a strongly convex problem with an s-sparse optimum in d dimensions. [sent-33, score-0.57]

14 Our analysis also yields optimal rates when the optimum is only approximately sparse. [sent-35, score-0.268]

15 The algorithm proposed in this paper builds off recent work on multi-step methods for strongly convex problems [11, 10, 12], but involves some new ingredients so as to obtain optimal rates for statistical problems with sparse optima. [sent-36, score-0.337]

16 In particular, we form a sequence of objective functions by decreasing the amount of regularization as the optimization algorithm proceeds which is quite natural from a statistical viewpoint. [sent-37, score-0.233]

17 Numerical simulations conﬁrm our theoretical predictions regarding the convergence rate of the algorithm, and also establish its superiority compared to regularized dual averaging [22] and stochastic gradient descent algorithms. [sent-40, score-0.749]

18 2 Problem set-up and algorithm description Given a subset Ω ⊆ Rd and a random variable Z taking values in a space Z, we consider an optimization problem of the form θ∗ ∈ arg min E[L(θ; Z)], (1) θ∈Ω where L : Ω × Z → R is a given loss function. [sent-43, score-0.175]

19 As is standard in stochastic optimization, we do not have direct access to the expected loss function L(θ) := E[L(θ; Z)], nor to its subgradients. [sent-44, score-0.22]

20 Rather, for a given query point θ ∈ Ω, we observe a stochastic subgradient, meaning a random vector g(θ) ∈ Rd such that E[g(θ)] ∈ ∂L(θ). [sent-45, score-0.194]

21 Algorithm description: In order to solve a sparse version of the problem (1), our strategy is to consider a sequence of regularized problems of the form min L(θ) + λ θ 1 . [sent-47, score-0.213]

22 (2) ′ θ∈Ω Our algorithm involves a sequence of KT different epochs, where the regularization parameter λ > 0 and the constraint set Ω′ ⊂ Ω change from epoch to epoch. [sent-48, score-0.589]

23 We initialize the algorithm in the ﬁrst epoch with y1 = 0, and with any radius R1 that is an upper bound on θ∗ 1 . [sent-50, score-0.541]

24 The norm · p used in deﬁning the constraint set Ω(Ri ) is speciﬁed by p = 2 log d/(2 log d − 1), a choice that will be clariﬁed momentarily. [sent-51, score-0.296]

25 The goal of the ith epoch is to update yi → yi+1 , in such a way that we are guaranteed that 2 2 2 yi+1 − θ∗ 2 ≤ Ri+1 for each i = 1, 2, . [sent-52, score-0.663]

26 In order to update yi → yi+1 , we run Ti rounds 1 of the stochastic dual averaging algorithm [17] (henceforth DA) on the regularized objective (3) min L(θ) + λi θ 1 . [sent-57, score-0.659]

27 , Ti , we let g t be a t=0 stochastic subgradient of L at θt , and we let ν t be any element of the subdifferential of the ℓ1 -norm · 1 at θt . [sent-62, score-0.211]

28 In the stochastic dual averaging updates (4), we use the prox function 2 log d 1 θ − yi 2 , where p = . [sent-65, score-0.606]

29 (5) ψyi ,Ri (θ) = p 2 2Ri (p − 1) 2 log d − 1 This particular choice of the prox-function and the speciﬁc value of p ensure that the function ψ is strongly convex with respect to the ℓ1 -norm, and has been previously used for sparse stochastic optimization (see e. [sent-66, score-0.55]

30 Some algebra yields that the update (4) with Ω = Rd is equivalent to θt+1 = yi + 2 Ri αt+1 |µt+1 |(q−1) sign(µt+1 ) αt+1 µt+1 q Ri −1 . [sent-70, score-0.155]

31 Algorithm 1 Regularization Annealed epoch Dual AveRaging (RADAR) Require: Epoch length schedule {Ti }KT , initial radius R1 , step-size multiplier α, prox-function i=1 ψ, initial prox-center y1 , regularization parameters λi . [sent-74, score-0.704]

32 end for Return yKT +1 Conditions: Having deﬁned our algorithm, we now discuss the conditions on the objective function L(θ) and stochastic gradients that underlie our analysis. [sent-85, score-0.304]

33 As mentioned, our goal is to obtain fast rates for objectives satisfying a local strong convexity condition, deﬁned below. [sent-89, score-0.263]

34 (7) L(θ) ≥ L(θ) + ∇L(θ), θ − θ + 2 2 Some of our results regarding stochastic optimization from a ﬁnite sample will use a weaker form of the assumption, called local RSC, exploited in our recent work on statistics and optimization [1, 13]. [sent-92, score-0.293]

35 Our ﬁnal assumption is a tail condition on the error in stochastic gradients: e(θ) := g(θ) − E[g(θ)]. [sent-93, score-0.236]

36 Common choices for the loss function L(θ; z) are the hinge loss max(0, 1 − y θ, x ) or the logistic loss log(1 + exp(−y θ, x ). [sent-101, score-0.32]

37 More generally, tail conditions on the marginal √ distribution of each coordinate of x ensure G = O( log d)) is valid with high probability. [sent-106, score-0.193]

38 For a constant ω > 0 governing the error probability in our results, 2 we also deﬁne ωi = ω 2 + 24 log i at epoch i. [sent-128, score-0.704]

39 Our results assume that we run Algorithm 1 with T i ≥ c1 s2 2 γ 2 Ri 2 (G2 + σ 2 ) log d + ωi σ 2 + log d , (9) where c1 is a universal constant. [sent-129, score-0.348]

40 For a total of T iterations in Algorithm 1, we state our results for the parameter θT = y(KT +1) where KT is the last epoch completed in T iterations. [sent-130, score-0.57]

41 1 Main theorem and some remarks We start with our main result which shows an overall convergence rate of O(1/T ) after T iterations. [sent-132, score-0.276]

42 This O(1/T ) convergence is analogous to earlier work on multi-step methods for strongly convex 4 objectives [11, 12, 10]. [sent-133, score-0.295]

43 (10) ∗ 2 ∗ This quantity captures the degree of sparsity in the optimum θ ; for instance, ε (θ ; S) = 0 if and only if θ∗ is supported on S. [sent-138, score-0.167]

44 Suppose the expected loss L satisﬁes Assumptions 1— 3 with parameters G(R) ≡ G, γ and σ(R) ≡ σ, and we perform updates (4) with epoch lengths (9) and parameters Ri γ λ2 = √ i s Ti 2 (G2 + σ 2 ) log d + ωi σ 2 log d . [sent-141, score-1.007]

45 , d} of cardinality s and any T ≥ 2κT , there is a universal constant c0 such that with probability at least 1 − 6 exp(−ω 2 /12) we have θT − θ ∗ 2 2 ≤ c3 s κT ((G2 + σ 2 ) log d + σ 2 (ω 2 + log )) + ε2 (θ∗ ; S) . [sent-145, score-0.388]

46 γ2T log d (13) Consequently, the theorem predicts a convergence rate of O(1/γ 2 T ) which is the best possible under our assumptions. [sent-146, score-0.387]

47 γ2T (14) We note that for an approximately sparse θ∗ , Theorem 1 guarantees convergence only to a tolerance ε2 (θ∗ ; S) due to the error terms arising out of the approximate sparsity. [sent-148, score-0.251]

48 One approach to minimize the objective (1) is to perform stochastic gradient descent on the objective, which has a e(θ) 2 2 convergence rate of O((G2 + σ 2 )/(γ 2 T )) [10, 14], where ∇L(θ) 2 ≤ G and E exp ≤ σ2 exp(1). [sent-152, score-0.546]

49 An alternative is to perform mirror descent [15, 19] or regularized dual averaging [22] using the same prox-function as Algorithm 1 but without breaking it up into epochs. [sent-154, score-0.35]

50 As mentioned in the introduction, this single-step method fails to exploit the strong convexity of our problem and obtains inferior convergence rates of O(s log d/T ) [19, 22, 7]. [sent-155, score-0.528]

51 A proposal closer to our approach is to minimize the regularized objective (3), but with a ﬁxed value of λ instead of the decreasing schedule of λi used in Theorem 1. [sent-156, score-0.215]

52 However, with this ﬁxed setting of λ, the initial epochs tend to be much longer than needed for halving the error. [sent-158, score-0.257]

53 2 Some illustrative corollaries We now present some consequences of Theorem 1 by making speciﬁc assumptions regarding the sparsity of θ∗ . [sent-162, score-0.156]

54 To facilitate comparison with minimax lower bounds, we set ω 2 = δ log d in the corollaries. [sent-168, score-0.188]

55 q 2 γ2T ((1+δ) log d) 2 The ﬁrst part of the corollary follows directly from Theorem 1 by noting that ε2 (θ∗ ; S) = 0 under our assumptions. [sent-171, score-0.195]

56 Note that as q ranges over the interval [0, 1], reﬂecting the degree of sparsity, the √ convergence rate ranges from from O(1/T ) (for q = 0 corresponding to exact sparsity) to O(1/ T ) (for q = 1). [sent-172, score-0.177]

57 This is a rather interesting trade-off, showing in a precise sense how convergence rates vary quantitatively as a function of the underlying sparsity. [sent-173, score-0.202]

58 Concretely, ignoring factors of O(log T ), we get a parameter θT having error at most O(s log d/(γ 2 T ) with an error probability decyaing to zero with d. [sent-175, score-0.244]

59 Moreover, in doing so our algorithm only goes over at most T data samples, as each stochastic gradient can be evaluated with one fresh data sample drawn from the underlying distribution. [sent-176, score-0.199]

60 Since the statistical minimax lower bounds [13, 21] demonstrate that this is the smallest possible error that any method can attain from T samples, our method is statistically optimal in the scaling of the estimation error with the number of samples. [sent-177, score-0.246]

61 We also observe that it is easy to instead set the error probability to δ = ω 2 log T , if an error probability decaying with T is desired, incurring at most additional log T factors in the error bound. [sent-178, score-0.44]

62 Stochastic optimization over ﬁnite pools: A common setting for the application of stochastic optimization methods in machine learning is when one has a ﬁnite pool of examples, say {Z1 , . [sent-181, score-0.25]

63 , Zn }, and the objective (1) takes the form θ∗ = arg min θ∈Ω 1 n n i=1 L(θ; Zi ) (15) In this setting, a stochastic gradient g(θ) can be obtained by drawing a sample Zj at random with replacement from the pool {Z1 , . [sent-184, score-0.314]

64 In high-dimensional problems where d ≫ n, the sample loss is not strongly convex. [sent-188, score-0.168]

65 We will present this corollary only for settings where θ∗ is exactly sparse and also specialize to the logistic loss, L(θ; (x, y)) = log(1 + exp(−y θ, x )) to illustrate the key aspects of the result. [sent-190, score-0.159]

66 log d We observe that the bound only holds when the number of samples n in the objective (15) is large enough, which is necessary for the restricted form of the LSC condition to hold with non-trivial parameters in the ﬁnite sample setting. [sent-201, score-0.289]

67 A modiﬁed method with constant epoch lengths: Algorithm 1 as described is efﬁcient and simple to implement. [sent-202, score-0.508]

68 However, the convergence results critically rely on the epoch length Ti to be set appropriately in a doubling manner. [sent-203, score-0.659]

69 This could be problematic in practice, where it might be tricky to know when an epoch should be terminated. [sent-204, score-0.508]

70 Following Juditsky and Nesterov [11], we next demonstrate how a variant of our algorithm with constant epoch lengths enjoys similar rates of convergence. [sent-205, score-0.73]

71 The key challenge here is that unlike the previous set-up [11], our objective function changes at each epoch which leads to signiﬁcant technical difﬁculties. [sent-206, score-0.574]

72 At a very coarse level, if we have a total budget of T iterations, then this version of our algorithm allows us to set the epoch lengths to O(log T ), and guarantees convergence rates that are O((log T )/T ). [sent-207, score-0.839]

73 Suppose we run Algorithm 1 for a total of T iterations with epoch length Ti ≡ T log d/κT and with parameters as in Equation 12. [sent-214, score-0.76]

74 Assuming that this setting ensures Ti = O(log d), for any set S, with probability at least 1 − 3 exp(ω 2 /12) θT − θ ∗ 2 2 =O s (G2 + σ 2 ) log d + (ω 2 + log(κ/ log d))σ 2 log d T κ . [sent-215, score-0.444]

75 The theorem shows that up to logarithmic factors in T , setting the epoch lengths optimally is not critical. [sent-216, score-0.736]

76 Speciﬁcally, we generate samples (xt , yt ) with each coordinate of xt distributed as Unif[−B, B] and yt = θ∗ , xt + wt . [sent-220, score-0.21]

77 Given an iterate θt , we generate a stochastic gradient of the expected loss (1) at (xt , yt ). [sent-223, score-0.329]

78 Our ﬁrst set of results evaluate Algorithm 1 against other stochastic optimization baselines assuming a complete knowledge of problem parameters. [sent-225, score-0.236]

79 Speciﬁcally, we epoch i is terminated once yi+1 − θ∗ 2 ≤ yi − θ∗ 2 /2. [sent-226, score-0.612]

80 The ﬁrst baseline is the regularized dual averaging (RDA) algorithm [22], applied to the regularized objective (3) with λ = 4η log d/T , which is the statistically optimal regularization parameter with T samples. [sent-229, score-0.622]

81 We use the same prox-function ψ(θ) = θ 2 p 2(p−1) , so that the theory for RDA predicts a convergence rate of O(s log d/T ) [22]. [sent-230, score-0.325]

82 Our second baseline is the stochastic gradient (SGD) algorithm which exploits the strong convexity but not the sparsity of the problem (1). [sent-231, score-0.405]

83 Our epochs are again determined by halving of the squared ℓp -error measured relative to θ∗ . [sent-239, score-0.223]

84 Finally, we also evaluate the version of our algorithm with constant epoch lengths that we analyzed in Theorem 2 (henceforth RADAR-CONST), using epochs of length log(T ). [sent-240, score-0.84]

85 1 Even though the RADAR-CONST method does not use the knowledge of θ∗ to set epochs, all three methods exhibit the same eventual convergence rates, with RADAR (set with optimal epoch lengths) performing the best, as expected. [sent-242, score-0.657]

86 Although RADAR-CONST is very slow in initial iterations, its convergence rate remains competitive with EDA (even though EDA does exploit knowledge of θ∗ ), but is worse than RADAR as expected. [sent-243, score-0.255]

87 Although optimal epoch length setting is not too critical for our approach, better data-dependent empirical rules for determining epoch lengths remains an interesting question for future research. [sent-245, score-1.227]

88 A comparison of RADAR with other stochastic optimization algorithms for d = 40000 and s = ⌈log d⌉. [sent-254, score-0.198]

89 The left plot compares RADAR with the RDA and SGD algorithms, neither of which exploits both the sparsity and the strong convexity structures simultaneously. [sent-255, score-0.206]

90 5 Discussion In this paper we present an algorithm that is able to take advantage of the strong convexity and sparsity conditions that are satsiﬁed by many common problems in machine learning. [sent-258, score-0.251]

91 Our algorithm is simple and efﬁcient to implement, and for a d-dimensional objective with an s-sparse optima, it achieves the minimax-optimal convergence rate O(s log d/T ). [sent-259, score-0.391]

92 We also demonstrate optimal convergence rates for problems that have weakly sparse optima, with implications for problems such as sparse linear regression and sparse logistic regression. [sent-260, score-0.462]

93 It would be interesting to study similar developments for other algorithms such as mirror descent or Nesterov’s accelerated gradient methods, leading to multi-step variants of those methods with optimal convergence rates in our setting. [sent-262, score-0.402]

94 1 To clarify, the epoch lengths in RADAR-CONST are set large enough to guarantee that we can attain an overall error bound of O(1/T ), meaning that the initial epochs for RADAR-CONST are much longer than for RADAR. [sent-265, score-1.031]

95 Thus, after roughly 500 iterations, RADAR-CONST √ has done only 2 epochs and operates with a crude constraint set Ω(R1 /4). [sent-266, score-0.161]

96 During epoch i, the step size scales proportionally to Ri / t, where t is the iteration number within the epoch; hence the relatively large initial steps in an epoch can take us to a bad solution even when we start with a good solution yi when Ri is large. [sent-267, score-1.154]

97 Fast global convergence rates of gradient methods for high-dimensional statistical recovery. [sent-275, score-0.255]

98 Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. [sent-339, score-0.186]

99 Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, part ii: shrinking procedures and optimal algorithms. [sent-352, score-0.521]

100 Dual averaging methods for regularized stochastic learning and online optimization. [sent-408, score-0.312]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('epoch', 0.508), ('radar', 0.273), ('ti', 0.189), ('eda', 0.186), ('ri', 0.177), ('epochs', 0.161), ('lsc', 0.151), ('log', 0.148), ('stochastic', 0.146), ('kt', 0.145), ('lengths', 0.129), ('rq', 0.12), ('lipschitz', 0.111), ('convergence', 0.109), ('yi', 0.104), ('optimum', 0.095), ('juditsky', 0.094), ('strongly', 0.094), ('rates', 0.093), ('rda', 0.09), ('averaging', 0.089), ('convexity', 0.089), ('bq', 0.086), ('nesterov', 0.086), ('negahban', 0.078), ('dual', 0.077), ('regularized', 0.077), ('loss', 0.074), ('sparsity', 0.072), ('rate', 0.068), ('objective', 0.066), ('subgradient', 0.065), ('mirror', 0.063), ('halving', 0.062), ('rsc', 0.062), ('iterations', 0.062), ('theorem', 0.062), ('exp', 0.06), ('sgd', 0.058), ('logistic', 0.058), ('convex', 0.056), ('yt', 0.056), ('ykt', 0.055), ('radii', 0.055), ('sparse', 0.054), ('gradient', 0.053), ('henceforth', 0.053), ('universal', 0.052), ('optimization', 0.052), ('update', 0.051), ('xt', 0.049), ('min', 0.049), ('error', 0.048), ('meaning', 0.048), ('regularization', 0.048), ('gradients', 0.047), ('da', 0.047), ('corollary', 0.047), ('rd', 0.046), ('strong', 0.045), ('conditions', 0.045), ('descent', 0.044), ('exploit', 0.044), ('losses', 0.044), ('simulations', 0.043), ('regarding', 0.043), ('agarwal', 0.042), ('length', 0.042), ('condition', 0.042), ('prox', 0.042), ('assumptions', 0.041), ('cardinality', 0.04), ('sb', 0.04), ('optimal', 0.04), ('hinge', 0.04), ('minimax', 0.04), ('approximately', 0.04), ('recovery', 0.039), ('locally', 0.039), ('composite', 0.039), ('baselines', 0.038), ('schedule', 0.038), ('overall', 0.037), ('logarithmic', 0.037), ('lan', 0.037), ('nemirovski', 0.037), ('scaling', 0.037), ('objectives', 0.036), ('optima', 0.036), ('eecs', 0.035), ('url', 0.035), ('satis', 0.035), ('xiao', 0.034), ('srebro', 0.034), ('decreasing', 0.034), ('initial', 0.034), ('br', 0.034), ('bound', 0.033), ('attain', 0.033), ('sequence', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

2 0.22249562 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their ﬁnite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

3 0.16807932 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semideﬁnite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

4 0.14952594 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Author: Xi Chen, Qihang Lin, Javier Pena

Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the ﬁnal solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1

5 0.13177027 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classiﬁcation, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

6 0.12625009 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

7 0.11853036 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

8 0.11372777 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing

9 0.10704451 16 nips-2012-A Polynomial-time Form of Robust Regression

10 0.10217512 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

11 0.10216817 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

12 0.098911889 206 nips-2012-Majorization for CRFs and Latent Likelihoods

13 0.098774247 292 nips-2012-Regularized Off-Policy TD-Learning

14 0.097240724 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

15 0.097008251 275 nips-2012-Privacy Aware Learning

16 0.091597915 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

17 0.089795731 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

18 0.088622406 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

19 0.08601445 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

20 0.085797898 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.261), (1, 0.013), (2, 0.174), (3, 0.019), (4, 0.101), (5, 0.063), (6, -0.024), (7, -0.021), (8, 0.002), (9, -0.009), (10, 0.046), (11, 0.072), (12, -0.057), (13, 0.049), (14, 0.006), (15, 0.102), (16, -0.095), (17, -0.112), (18, 0.049), (19, -0.012), (20, 0.02), (21, -0.043), (22, 0.044), (23, 0.022), (24, -0.083), (25, -0.093), (26, 0.01), (27, -0.008), (28, -0.112), (29, -0.001), (30, 0.129), (31, -0.104), (32, -0.056), (33, -0.094), (34, -0.064), (35, 0.055), (36, 0.064), (37, -0.014), (38, 0.103), (39, 0.016), (40, 0.006), (41, -0.132), (42, 0.06), (43, -0.026), (44, -0.066), (45, -0.049), (46, -0.005), (47, 0.012), (48, 0.085), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95836687 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

2 0.91343784 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Author: Xi Chen, Qihang Lin, Javier Pena

3 0.9008252 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

4 0.79189068 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

Abstract: We propose a new stochastic gradient method for optimizing the sum of a ﬁnite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1

5 0.76372552 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi

Abstract: We study two communication-efﬁcient algorithms for distributed statistical optimization on large-scale data. The ﬁrst algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1

6 0.74966842 285 nips-2012-Query Complexity of Derivative-Free Optimization

7 0.70757705 161 nips-2012-Interpreting prediction markets: a stochastic approach

8 0.6946615 217 nips-2012-Mixability in Statistical Learning

9 0.63836098 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

10 0.63740963 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

11 0.61746061 16 nips-2012-A Polynomial-time Form of Robust Regression

12 0.61192864 324 nips-2012-Stochastic Gradient Descent with Only One Projection

13 0.59999967 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

14 0.58954883 206 nips-2012-Majorization for CRFs and Latent Likelihoods

15 0.58763617 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

16 0.56854767 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

17 0.55555141 275 nips-2012-Privacy Aware Learning

18 0.55332088 292 nips-2012-Regularized Off-Policy TD-Learning

19 0.53640324 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

20 0.53420454 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.038), (21, 0.028), (26, 0.132), (38, 0.212), (42, 0.105), (54, 0.017), (55, 0.02), (74, 0.055), (76, 0.17), (80, 0.083), (92, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92215943 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

2 0.91080737 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

Author: Won H. Kim, Deepti Pachauri, Charles Hatt, Moo. K. Chung, Sterling Johnson, Vikas Singh

Abstract: Hypothesis testing on signals deﬁned on surfaces (such as the cortical surface) is a fundamental component of a variety of studies in Neuroscience. The goal here is to identify regions that exhibit changes as a function of the clinical condition under study. As the clinical questions of interest move towards identifying very early signs of diseases, the corresponding statistical differences at the group level invariably become weaker and increasingly hard to identify. Indeed, after a multiple comparisons correction is adopted (to account for correlated statistical tests over all surface points), very few regions may survive. In contrast to hypothesis tests on point-wise measurements, in this paper, we make the case for performing statistical analysis on multi-scale shape descriptors that characterize the local topological context of the signal around each surface vertex. Our descriptors are based on recent results from harmonic analysis, that show how wavelet theory extends to non-Euclidean settings (i.e., irregular weighted graphs). We provide strong evidence that these descriptors successfully pick up group-wise differences, where traditional methods either fail or yield unsatisfactory results. Other than this primary application, we show how the framework allows performing cortical surface smoothing in the native space without mappint to a unit sphere. 1

3 0.90111947 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or inﬁnite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors ﬁxed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of ﬁnite dimension; and the two bounds tend to be equivalent as the dimension goes to inﬁnity. In addition, we show that the VC bound for linear classiﬁers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and ﬁnd that the new bound is useful for model selection and is usually signiﬁcantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classiﬁers.

4 0.90079868 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

5 0.89709496 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

6 0.89395297 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

7 0.89357293 275 nips-2012-Privacy Aware Learning

8 0.89333022 285 nips-2012-Query Complexity of Derivative-Free Optimization

9 0.89023799 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

10 0.88958794 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

11 0.88754249 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

12 0.88730645 358 nips-2012-Value Pursuit Iteration

13 0.88489777 179 nips-2012-Learning Manifolds with K-Means and K-Flats

14 0.88473809 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

15 0.8845498 34 nips-2012-Active Learning of Multi-Index Function Models

16 0.88393182 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

17 0.88202959 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

18 0.88153619 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

19 0.88128304 227 nips-2012-Multiclass Learning with Simplex Coding

20 0.88097262 361 nips-2012-Volume Regularization for Binary Classification