nips nips2012 nips2012-20 knowledge-graph by maker-knowledge-mining

20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Source: pdf

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

Abstract: We propose a new stochastic gradient method for optimizing the sum of a ﬁnite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract We propose a new stochastic gradient method for optimizing the sum of a ﬁnite set of smooth functions, where the sum is strongly convex. [sent-6, score-0.15]

2 While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. [sent-7, score-0.366]

3 (1) x∈Rp n i=1 In this work, we focus on such ﬁnite training data problems where each fi is smooth and the average function g is strongly-convex. [sent-14, score-0.154]

4 As an example, in the case of 2 -regularized logistic regression we have fi (x) := λ x 2 + log(1 + 2 exp(−bi aT x)), where ai ∈ Rp and bi ∈ {−1, 1} are the training examples associated with a i binary classiﬁcation problem and λ is a regularization parameter. [sent-15, score-0.188]

5 The standard full gradient (FG) method, which dates back to [4], uses iterations of the form n αk xk+1 = xk − αk g (xk ) = xk − f (xk ). [sent-18, score-0.457]

6 (3) n i=1 i Using x∗ to denote the unique minimizer of g, the FG method with a constant step size achieves a linear convergence rate: g(xk ) − g(x∗ ) = O(ρk ), 1 for some ρ < 1 which depends on the condition number of g [5, Theorem 2. [sent-19, score-0.207]

7 Despite the fast convergence rate of the FG method, it can be unappealing when n is large because its iteration cost scales linearly in n. [sent-23, score-0.256]

8 The basic SG method for optimizing (1) uses iterations of the form xk+1 = xk − αk fik (xk ), (4) where αk is a step-size and a training example ik is selected uniformly among the set {1, . [sent-25, score-0.443]

9 Under certain assumptions this convergence rate is optimal for strongly-convex optimization in a model of computation where the algorithm only accesses the function through unbiased measurements of its objective and gradient (see [6, 7, 8]). [sent-30, score-0.315]

10 Thus, we cannot hope to obtain a better convergence rate if the algorithm only relies on unbiased gradient measurements. [sent-31, score-0.268]

11 Nevertheless, by using the stronger assumption that the functions are sampled from a ﬁnite dataset, in this paper we show that we can achieve an exponential converengence rate while preserving the iteration cost of SG methods. [sent-32, score-0.182]

12 The SAG method uses iterations of the form n αk xk+1 = xk − yk , (5) n i=1 i where at each iteration a random training example ik is selected and we set fi (xk ) if i = ik , k−1 yi otherwise. [sent-34, score-0.565]

13 But, like the SG method, each iteration only computes the gradient with respect to a single training example and the cost of the iterations is independent of n. [sent-36, score-0.252]

14 Despite the low cost of the SAG iterations, in this paper we show that the SAG iterations have a linear convergence rate, like the FG method. [sent-37, score-0.224]

15 That is, by having access to ik and by keeping a memory of the most recent gradient value computed for each training example i, this iteration achieves a faster convergence rate than is possible for standard SG methods. [sent-38, score-0.383]

16 Further, in terms of effective passes through the data, we also show that for certain problems the convergence rate of SAG is faster than is possible for standard FG method. [sent-39, score-0.323]

17 Note that a linear convergence rate for the training cost does not translate into a similar rate for the testing cost, and an appealing propertly of SG methods is that they achieve the optimal O(1/k) rate for the testing cost as long as every datapoint is seen only once. [sent-41, score-0.62]

18 In this context, the analysis of SG methods only applies to the training cost and, although our analysis also focuses on the training cost, in our experiments the SAG method typically reached the optimal testing cost faster than both FG and SG methods. [sent-43, score-0.265]

19 However, despite 60 years of extensive research on SG methods, most of the applications focusing on ﬁnite datasets, we are not aware of any other SG method that achieves a linear convergence rate while preserving the iteration cost of standard SG methods. [sent-45, score-0.328]

20 Section 3 states the (standard) assumptions underlying our analysis and gives the main technical results; we ﬁrst give a slow linear convergence rate that applies for any problem, and then give a very fast linear convergence rate that applies when n is sufﬁciently large. [sent-46, score-0.366]

21 Section 4 discusses practical implementation issues, including how to reduce the storage cost from O(np) to O(n) when each fi only depends on a linear combination of x. [sent-47, score-0.183]

22 Section 5 presents a numerical comparison of an implementation based on SAG to SG and FG methods, indicating that the method may be very useful for problems where we can only afford to do a few passes through a data set. [sent-48, score-0.161]

23 Momentum: SG methods that incorporate a momentum term use iterations of the form xk+1 = xk − αk fik (xk ) + βk (xk − xk−1 ), see [10]. [sent-51, score-0.392]

24 It is common to set all βk = β for some constant β, and in this case we can rewrite the SG with momentum method as xk+1 = xk − k j=1 αj β k−j fij (xj ). [sent-52, score-0.294]

25 We can re-write the SAG updates (5) in a similar form as xk+1 = xk − k j=1 αk S(j, i1:k )fij (xj ), (6) where the selection function S(j, i1:k ) is equal to 1/n if j corresponds to the last iteration where j = ik and is set to 0 otherwise. [sent-53, score-0.234]

26 Thus, momentum uses a geometric weighting of previous gradients while the SAG iterations select and average the most recent evaluation of each previous gradient. [sent-54, score-0.183]

27 While momentum can lead to improved practical performance, it still requires the use of a decreasing sequence of step sizes and is not known to lead to a faster convergence rate. [sent-55, score-0.244]

28 Gradient Averaging: Closely related to momentum is using the sample average of all previous gradients, k xk+1 = xk − αk j=1 fij (xj ), k which is similar to the SAG iteration in the form (5) but where all previous gradients are used. [sent-56, score-0.296]

29 This approach is used in the dual averaging method [11], and while this averaging procedure leads to convergence for a constant step size and can improve the constants in the convergence rate [12], it does not improve on the O(1/k) rate. [sent-57, score-0.495]

30 Iterate Averaging: Rather than averaging the gradients, some authors use the basic SG iteration but take an average over xk values. [sent-58, score-0.241]

31 With a suitable choice of step-sizes, this gives the same asymptotic efﬁciency as Newton-like second-order SG methods and also leads to increased robustness of the convergence rate to the exact sequence of step sizes [13]. [sent-59, score-0.282]

32 The epoch SG method uses averaging to obtain the O(1/k) rate even for non-smooth objectives [15]. [sent-63, score-0.174]

33 However, the convergence rates of these averaging methods remain sublinear. [sent-64, score-0.178]

34 Alternately, if we split the convergence rate into a deterministic and stochastic part, these methods can improve the dependency on the deterministic part [19, 12]. [sent-68, score-0.243]

35 Constant step size: If the SG iterations are used with a constant step size (rather than a decreasing sequence), then the convergence rate of the method can be split into two parts [21, Proposition 2. [sent-70, score-0.42]

36 Thus, with a constant step size the SG iterations have a linear convergence rate up to some tolerance, and in general after this point the iterations do not make further progress. [sent-72, score-0.443]

37 Indeed, convergence of the basic SG method with a constant step size has only been shown under extremely strong assumptions about the relationship between the functions fi [22]. [sent-73, score-0.334]

38 This contrasts with the method we present in this work which converges to the optimal solution using a constant step size and does so with a linear rate (without additional assumptions). [sent-74, score-0.217]

39 Accelerated methods: Accelerated SG methods, which despite their name are not related to the aforementioned AFG method, take advantage of the fast convergence rate of SG methods with a constant step size. [sent-75, score-0.277]

40 In particular, accelerated SG methods use a constant step size by default, and only decrease the step size on iterations where the inner-product between successive gradient estimates 3 is negative [23, 24]. [sent-76, score-0.348]

41 This leads to convergence of the method and allows it to potentially achieve periods of linear convergence where the step size stays constant. [sent-77, score-0.306]

42 However, the overall convergence rate of the method remains sublinear. [sent-78, score-0.209]

43 Hybrid Methods: Some authors have proposed variants of the SG method for problems of the form (1) that seek to gradually transform the iterates into the FG method in order to achieve a linear convergence rate. [sent-79, score-0.196]

44 Bertsekas proposes to go through the data cyclically with a specialized weighting that allows the method to achieve a linear convergence rate for strongly-convex quadratic functions [25]. [sent-80, score-0.241]

45 However, the weighting is numerically unstable and the linear convergence rate treats full passes through the data as iterations. [sent-81, score-0.287]

46 A related strategy is to group the fi functions into ‘batches’ of increasing size and perform SG iterations on the batches [26]. [sent-82, score-0.236]

47 In both cases, the iterations that achieve the linear rate have a cost that is not independent of n, as opposed to SAG. [sent-83, score-0.24]

48 This method is identical to the SAG iteration (5), but uses a cyclic choice of ik rather than sampling the ik values. [sent-86, score-0.16]

49 are only able to show that the convergence rate is linear for strongly-convex quadratic functions (without deriving an explicit rate), and their analysis treats full passes through the data as iterations. [sent-89, score-0.274]

50 Further, as our analysis and experiments show, when the number of training examples is sufﬁciently large, the SAG iterations achieve a linear convergence rate under a much larger set of step sizes than the IAG method. [sent-91, score-0.385]

51 This leads to more robustness to the selection of the step size and also, if suitably chosen, leads to a faster convergence rate and improved practical performance. [sent-92, score-0.299]

52 3 Convergence Analysis In our analysis we assume that each function fi in (1) is differentiable and that each gradient fi is Lipschitz-continuous with constant L, meaning that for all x and y in Rp we have fi (x) − fi (y) ≤ L x − y . [sent-95, score-0.541]

53 This is a fairly weak assumption on the fi functions, and in cases where the fi are twicedifferentiable it is equivalent to saying that the eigenvalues of the Hessians of each fi are bounded n 1 above by L. [sent-96, score-0.336]

54 In addition, we also assume that the average function g = n i=1 fi is strongly-convex µ with constant µ > 0, meaning that the function x → g(x) − 2 x 2 is convex. [sent-97, score-0.14]

55 Our convergence results assume that we 0 initialize yi to a zero vector for all i, and our results depend on the variance of the gradient norms 1 at the optimum x∗ , denoted by σ 2 = n i fi (x∗ ) 2 . [sent-102, score-0.312]

56 1 We ﬁrst consider the convergence rate of the method when using a constant step size of αk = 2nL , which is similar to the step size needed for convergence of the IAG method in practice. [sent-104, score-0.476]

57 1 Proposition 1 With a constant step size of αk = 2nL , the SAG iterations satisfy for k ≥ 1: µ k 9σ 2 E xk − x∗ 2 1− 3 x0 − x∗ 2 + . [sent-105, score-0.327]

58 Note that the SAG iterations also trivially obtain the O(1/k) rate achieved by SG methods, since µ k kµ 8Ln 1− exp − = O(n/k), 8Ln 8Ln kµ albeit with a constant which is proportional to n. [sent-107, score-0.204]

59 Despite this constant, they are advantageous over SG methods in later iterations because they obtain an exponential convergence rate as in FG 4 methods. [sent-108, score-0.284]

60 We also note that an exponential convergence rate is obtained for any constant step size 1 smaller than 2nL . [sent-109, score-0.271]

61 In terms of passes through the data, the rate in Proposition 1 is similar to that achieved by the basic FG method. [sent-110, score-0.196]

62 Proposition 2 If n 8L µ , with a step size of αk = E g(xk ) − g(x∗ ) with C = 16L 0 x − x∗ 3n 1 2nµ C 1− 2 + the SAG iterations satisfy for k 1 8n n: k , µn 4σ 2 8 log 1 + +1 3nµ 4L . [sent-112, score-0.146]

63 We state this result for k n because we assume that the ﬁrst n iterations of the algorithm use an SG method and that we initialize the subsequent SAG iterations with the average of the iterates, which leads to an O((log n)/k) rate. [sent-113, score-0.213]

64 In contrast, using the SAG iterations from the beginning gives the same rate but with a constant proportional to n. [sent-114, score-0.204]

65 It is interesting to compare this convergence rate with the known convergence rates of ﬁrst-order methods [5, see §2]. [sent-117, score-0.316]

66 9996 and the ‘optimal’ AFG method has a faster rate of (1 − µ/L) = 0. [sent-120, score-0.142]

67 In contrast, running n iterations of SAG has a much faster rate of (1 − 1/8n)n = 0. [sent-122, score-0.202]

68 Even though n appears in the convergence rate, if we perform n iterations of SAG (i. [sent-127, score-0.179]

69 Further, while the step size in Proposition 2 depends on µ and n, we can obtain the same convergence rate by using a step size as large as µ 8 1 αk = 16L . [sent-131, score-0.303]

70 Thus, for certain problems the SAG iterations can tolerate a much larger step size, which leads to increased robustness to the selection of the step size. [sent-134, score-0.175]

71 While we have stated Proposition 1 in terms of the iterates and Proposition 2 in terms of the function values, the rates obtained on iterates and function values are equivalent because, by the Lipschitz and strong-convexity assumptions, we have µ xk − x∗ 2 g(xk ) − g(x∗ ) L xk − x∗ 2 . [sent-136, score-0.395]

72 k Structured gradients: For many problems the storage cost of O(np) for the yi vectors is prohibitive, but we can often use structure in the fi to reduce this cost. [sent-138, score-0.209]

73 For example, many loss functions fi take the form fi (aT x) for a vector ai . [sent-139, score-0.26]

74 Since ai is constant, for these problems we only i 1 While it may appear suboptimal to not use the gradients computed during the n iterations of stochastic gradient descent, using them only improves the bound by a constant. [sent-140, score-0.24]

75 2 Note that L in the SAG rates is based on the fi functions, while in the FG methods it is based on g which can be much smaller. [sent-141, score-0.152]

76 5 need to store the scalar fik (uk ) for uk = aTk xk rather than the full gradient aT fi (uk ), reducing the i i i i i storage cost to O(n). [sent-142, score-0.483]

77 k Step-size re-weighting: On early iterations of the SAG algorithm, when most yi are set to the uninformative zero vector, rather than dividing αk in (5) by n we found it was more effective to divide by m, the number of unique ik values that we have sampled so far (which converges to n). [sent-147, score-0.188]

78 These step sizes seem to work even when the additional assumption of Proposition 2 is not satisﬁed, and we conjecture that the convergence rates under these step sizes are much faster than the rate obtained in Proposition 1 for the general case. [sent-154, score-0.372]

79 Line search: Since L is generally not known, we experimented with a basic line-search, where we start with an initial estimate L0 , and we double this estimate whenever we do not satisfy the instantiated Lipschitz inequality fik (xk − (1/Lk )fik (xk )) fik (xk ) − 1 f (xk ) 2 . [sent-155, score-0.179]

80 – AFG: Nesterov’s accelerated full gradient method [16], where iterations of (3) with a ﬁxed step size are interleaved with an extrapolation step, and we use an adaptive line-search based on [27]. [sent-161, score-0.271]

81 – Pegasos: The state-of-the-art SG method described by iteration (4) with a step size of αk = 1/µk and a projection step onto a norm-ball known to contain the optimal solution [28]. [sent-164, score-0.164]

82 – ESG: The epoch SG method of [15], which runs SG with a constant step size and averaging in a series of epochs, and is optimal for non-smooth stochastic strongly-convex optimization. [sent-166, score-0.23]

83 – SAG: The proposed stochastic average gradient method described by iteration (5) using the modiﬁcations discussed in the previous section. [sent-192, score-0.164]

84 In all the experiments, we measure the training and testing costs as a function of the number of effective passes through the data, measured as the number of fi evaluations divided by n. [sent-195, score-0.307]

85 If we can afford to do many passes through the data (say, several hundred), then an FG method should be used. [sent-200, score-0.161]

86 We expect that the SAG iterations will be most useful between these two extremes, where we can afford to do more than one pass through the data but cannot afford to do enough passes to warrant using FG algorithms like L-BFGS. [sent-201, score-0.296]

87 We plot the training and testing costs of the different methods for 30 effective passes through the data in Figure 1. [sent-207, score-0.21]

88 , the covertype data), with a carefully-chosen step-size the SG methods do substantially better than FG methods on the ﬁrst few passes through the data (e. [sent-220, score-0.183]

89 In contrast, FG methods are not sensitive to the step size and because of their steady progress we also see that FG methods slowly catch up to the SG methods and eventually (or will eventually) pass them (e. [sent-223, score-0.153]

90 In some cases (protein and covertype), the signiﬁcant speed-up observed for SAG in reaching low training costs also translates to reaching the optimal testing cost more quickly than the other methods. [sent-229, score-0.139]

91 In a learning context, where each function fi is the loss associated to a single data point, L is equal to the largest value of the loss second derivative ξ (1 for the square loss, 1/4 for the logistic loss) times R2 , where R is a the uniform bound on the norm of each data µ 8ξR2 8 point. [sent-234, score-0.184]

92 testing cost: The theoretical contribution of this work is limited to the convergence rate of the training cost. [sent-241, score-0.25]

93 However, as shown in our experiments, the testing cost of the SAG iterates often reaches its minimum quicker than existing SG methods, and we could expect to improve the constant in the O(1/k) convergence rate, as is the case with online second-order methods [32]. [sent-246, score-0.252]

94 Step-size selection and termination criteria: The three major disadvantages of SG methods are: (i) the slow convergence rate, (ii) deciding when to terminate the algorithm, and (iii) choosing the step size while running the algorithm. [sent-247, score-0.199]

95 This paper showed that the SAG iterations achieve a much faster convergence rate, but the SAG iterations may also be advantageous in terms of tuning step sizes and designing termination criteria. [sent-248, score-0.397]

96 In particular, the SAG iterations suggest a natural termination k criterion; since the average of the yi variables converges to g (xk ) as xk −xk−1 converges to zero, k k we can use (1/n) i yi as an approximation of the optimality of x . [sent-249, score-0.309]

97 Further, while SG methods require specifying a sequence of step sizes and mispecifying this sequence can have a disastrous effect on the convergence rate [7, §2. [sent-250, score-0.267]

98 1], our theory shows that the SAG iterations iterations achieve a linear convergence rate for any sufﬁciently small constant step size and our experiments indicate that a simple line-search gives strong performance. [sent-251, score-0.462]

99 A convergent incremental gradient method with a constant step size. [sent-311, score-0.189]

100 A method for unconstrained convex minimization problem with the rate of convergence O(1/k2 ). [sent-345, score-0.223]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sag', 0.671), ('sg', 0.522), ('fg', 0.261), ('xk', 0.153), ('iag', 0.127), ('afg', 0.112), ('fi', 0.112), ('convergence', 0.093), ('passes', 0.091), ('rate', 0.09), ('iterations', 0.086), ('fik', 0.082), ('esg', 0.081), ('nosg', 0.081), ('gradient', 0.065), ('pegasos', 0.064), ('rda', 0.059), ('momentum', 0.056), ('ik', 0.053), ('covertype', 0.046), ('steepest', 0.046), ('cost', 0.045), ('stochastic', 0.045), ('averaging', 0.045), ('afford', 0.044), ('proposition', 0.043), ('bfgs', 0.042), ('ls', 0.04), ('testing', 0.039), ('step', 0.037), ('blatt', 0.035), ('accelerated', 0.034), ('incremental', 0.033), ('sizes', 0.032), ('logistic', 0.032), ('iterates', 0.032), ('pass', 0.031), ('fij', 0.031), ('gradients', 0.028), ('ens', 0.028), ('sierra', 0.028), ('training', 0.028), ('constant', 0.028), ('iteration', 0.028), ('faster', 0.026), ('storage', 0.026), ('method', 0.026), ('yi', 0.026), ('protein', 0.025), ('rates', 0.025), ('size', 0.023), ('effective', 0.023), ('des', 0.022), ('siam', 0.021), ('paris', 0.021), ('accesses', 0.02), ('loss', 0.02), ('lk', 0.02), ('unbiased', 0.02), ('achieve', 0.019), ('nicolas', 0.019), ('schmidt', 0.019), ('rp', 0.019), ('termination', 0.018), ('inria', 0.018), ('minus', 0.018), ('regularized', 0.017), ('steady', 0.017), ('material', 0.017), ('ai', 0.016), ('supplementary', 0.016), ('extensive', 0.016), ('aware', 0.016), ('optimum', 0.016), ('substantially', 0.016), ('france', 0.016), ('basic', 0.015), ('hybrid', 0.015), ('leads', 0.015), ('francis', 0.015), ('methods', 0.015), ('roux', 0.015), ('batches', 0.015), ('convex', 0.014), ('smooth', 0.014), ('kdd', 0.014), ('costs', 0.014), ('appealing', 0.014), ('mark', 0.014), ('despite', 0.014), ('lipschitz', 0.014), ('nemirovski', 0.014), ('optimization', 0.014), ('accelerate', 0.013), ('deciding', 0.013), ('weighting', 0.013), ('optimal', 0.013), ('regularizer', 0.013), ('epoch', 0.013), ('sublinear', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

2 0.16352074 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their ﬁnite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

3 0.14907669 300 nips-2012-Scalable nonconvex inexact proximal splitting

Author: Suvrit Sra

Abstract: We study a class of large-scale, nonsmooth, and nonconvex optimization problems. In particular, we focus on nonconvex problems with composite objectives. This class includes the extensively studied class of convex composite objective problems as a subclass. To solve composite nonconvex problems we introduce a powerful new framework based on asymptotically nonvanishing errors, avoiding the common stronger assumption of vanishing errors. Within our new framework we derive both batch and incremental proximal splitting algorithms. To our knowledge, our work is ﬁrst to develop and analyze incremental nonconvex proximalsplitting algorithms, even if we were to disregard the ability to handle nonvanishing errors. We illustrate one instance of our general framework by showing an application to large-scale nonsmooth matrix factorization. 1

4 0.10719752 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Author: Felipe Trevizan, Manuela Veloso

Abstract: Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artiﬁcial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP always ﬁnishes and is asymptotically optimal under sufﬁcient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. 1

5 0.10384099 282 nips-2012-Proximal Newton-type methods for convex optimization

Author: Jason Lee, Yuekai Sun, Michael Saunders

Abstract: We seek to solve convex optimization problems in composite form: minimize f (x) := g(x) + h(x), n x∈R where g is convex and continuously differentiable and h : Rn → R is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efﬁciently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. We prove such methods are globally convergent and achieve superlinear rates of convergence in the vicinity of an optimal solution. We also demonstrate the performance of these methods using problems of relevance in machine learning and statistics. 1

6 0.091597915 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

7 0.079066306 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

8 0.074140303 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

9 0.071881168 27 nips-2012-A quasi-Newton proximal splitting method

10 0.066654488 145 nips-2012-Gradient Weights help Nonparametric Regressors

11 0.059087403 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

12 0.055557806 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

13 0.054530594 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

14 0.048537821 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

15 0.048260447 324 nips-2012-Stochastic Gradient Descent with Only One Projection

16 0.042273026 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

17 0.037526127 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

18 0.037308607 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

19 0.036842719 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

20 0.036085822 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.111), (1, 0.007), (2, 0.088), (3, -0.042), (4, 0.054), (5, 0.069), (6, 0.005), (7, -0.065), (8, 0.107), (9, 0.024), (10, -0.041), (11, 0.069), (12, 0.026), (13, -0.001), (14, -0.061), (15, -0.021), (16, -0.086), (17, -0.027), (18, -0.022), (19, 0.029), (20, -0.014), (21, -0.048), (22, 0.036), (23, 0.006), (24, -0.047), (25, -0.059), (26, -0.018), (27, 0.001), (28, -0.026), (29, -0.004), (30, 0.09), (31, -0.112), (32, -0.002), (33, -0.017), (34, -0.034), (35, 0.019), (36, 0.091), (37, 0.021), (38, 0.017), (39, 0.042), (40, 0.022), (41, -0.095), (42, -0.013), (43, -0.103), (44, -0.01), (45, 0.039), (46, -0.075), (47, 0.06), (48, 0.085), (49, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9188717 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

2 0.67364556 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Author: Xi Chen, Qihang Lin, Javier Pena

Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the ﬁnal solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1

3 0.66263121 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

4 0.65766716 282 nips-2012-Proximal Newton-type methods for convex optimization

Author: Jason Lee, Yuekai Sun, Michael Saunders

5 0.64457071 300 nips-2012-Scalable nonconvex inexact proximal splitting

Author: Suvrit Sra

6 0.63023448 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

7 0.59912747 27 nips-2012-A quasi-Newton proximal splitting method

8 0.52224249 285 nips-2012-Query Complexity of Derivative-Free Optimization

9 0.49462521 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

10 0.48577139 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

11 0.46713921 217 nips-2012-Mixability in Statistical Learning

12 0.43815827 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

13 0.43258959 161 nips-2012-Interpreting prediction markets: a stochastic approach

14 0.42482737 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

15 0.4167845 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

16 0.40763941 16 nips-2012-A Polynomial-time Form of Robust Regression

17 0.40529847 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

18 0.38194928 170 nips-2012-Large Scale Distributed Deep Networks

19 0.37819827 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

20 0.37742996 324 nips-2012-Stochastic Gradient Descent with Only One Projection

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.036), (21, 0.034), (38, 0.156), (39, 0.011), (42, 0.071), (53, 0.01), (54, 0.027), (55, 0.034), (66, 0.234), (74, 0.025), (76, 0.12), (80, 0.078), (92, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77286696 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

2 0.69000411 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or inﬁnite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors ﬁxed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of ﬁnite dimension; and the two bounds tend to be equivalent as the dimension goes to inﬁnity. In addition, we show that the VC bound for linear classiﬁers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and ﬁnd that the new bound is useful for model selection and is usually signiﬁcantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classiﬁers.

3 0.68648577 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

Author: Jake Bouvrie, Jean-jeacques Slotine

Abstract: To learn reliable rules that can generalize to novel situations, the brain must be capable of imposing some form of regularization. Here we suggest, through theoretical and computational arguments, that the combination of noise with synchronization provides a plausible mechanism for regularization in the nervous system. The functional role of regularization is considered in a general context in which coupled computational systems receive inputs corrupted by correlated noise. Noise on the inputs is shown to impose regularization, and when synchronization upstream induces time-varying correlations across noise variables, the degree of regularization can be calibrated over time. The resulting qualitative behavior matches experimental data from visual cortex. 1

4 0.6859318 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

5 0.68528211 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables ﬁrst-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

6 0.68503511 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

7 0.6845839 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

8 0.68427759 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

9 0.68384796 227 nips-2012-Multiclass Learning with Simplex Coding

10 0.68351221 275 nips-2012-Privacy Aware Learning

11 0.6830619 358 nips-2012-Value Pursuit Iteration

12 0.6821844 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

13 0.68147302 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

14 0.68060726 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

15 0.68030798 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

16 0.68020844 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

17 0.68001342 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

18 0.6798377 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

19 0.67975575 179 nips-2012-Learning Manifolds with K-Means and K-Flats

20 0.6789341 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications