nips nips2011 nips2011-46 knowledge-graph by maker-knowledge-mining

46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods


Source: pdf

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. [sent-5, score-0.172]

2 We study how such algorithms can be improved using accelerated gradient methods. [sent-6, score-0.382]

3 We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. [sent-7, score-0.604]

4 1 Introduction We consider a stochastic convex optimization problem of the form minw∈W L(w), where L(w) = Ez [￿(w, z)], based on an empirical sample of instances z1 , . [sent-8, score-0.2]

5 We assume that W is a convex subset of some Hilbert space (which in this paper, we will take to be Euclidean space), and ￿ is non-negative, convex and smooth in its first argument (i. [sent-12, score-0.129]

6 In recent years, there has been much interest in developing efficient first-order stochastic optimization methods for these problems, such as mirror descent [2, 6] and dual averaging [9, 16]. [sent-16, score-0.251]

7 These methods are characterized by incremental updates based on subgradients ∂￿(w, zi ) of individual instances, and enjoy the advantages of being highly scalable and simple to implement. [sent-17, score-0.08]

8 A popular way to speed-up these algorithms, especially in a parallel setting, is via mini-batching, where the incremental update is performed on an average of the subgradients with respect to several instances at a time, rather than a single ￿b instance (i. [sent-19, score-0.133]

9 The gradient computations for each mini-batch can b be parallelized, allowing these methods to perform faster in a distributed framework (see for instance [11]). [sent-22, score-0.181]

10 Recently, [10] has shown that a mini-batching distributed framework is capable of attaining asymptotically optimal speed-up in general (see also [1]). [sent-23, score-0.072]

11 A parallel development has been the popularization of accelerated gradient descent methods [7, 8, 15, 5]. [sent-24, score-0.537]

12 In a deterministic optimization setting and for general smooth convex functions, these methods enjoy a rate of O(1/n2 ) (where n is the number of iterations) as opposed to O(1/n) using standard methods. [sent-25, score-0.185]

13 However, in a stochastic setting (which is the relevant √ one for learning problems), the rate of both approaches have an O(1/ n) dominant term in general, so the benefit of using accelerated methods for learning problems is not obvious. [sent-26, score-0.383]

14 1 Algorithm 1 Stochastic Gradient Descent with Mini-Batching (SGD) Parameters: Step size η, mini-batch size b. [sent-27, score-0.062]

15 , zm w1 = 0 for i = 1 to n = m/b do ￿bi Let ￿i (wi ) = 1 t=b(i−1)+1 ￿(wi , zt ) b ￿ wi+1 := wi − η∇￿i (wi )) ￿ wi+1 := PW (wi+1 ) end for ￿n 1 ¯ Return w = n i=1 wi Algorithm 2 Accelerated Gradient Method (AG) Parameters: Step sizes (γi , βi ), mini-batch size b Input: Sample z1 , . [sent-31, score-0.875]

16 The main resulting message is that by using an appropriate accelerated method, we obtain significantly better stochastic optimization algorithms in terms of convergence speed. [sent-35, score-0.398]

17 Moreover, in certain regimes acceleration is actually necessary in order to allow a significant speedups. [sent-36, score-0.114]

18 The potential benefit of acceleration to mini-batching has been briefly noted in [4], but here we study this issue in much more depth. [sent-37, score-0.034]

19 In particular, we make the following contributions: • We develop novel convergence bounds for the standard gradient method, which refines the result of [10, 4] by being dependent on L(w￿ ), the expected loss of the best predictor in our class. [sent-38, score-0.289]

20 For example, we show that in the regime where the desired suboptimality is comparable or larger than L(w￿ ), including in the separable case L(w￿ ) = 0, mini-batching does not lead to significant speed-ups with standard gradient methods. [sent-39, score-0.395]

21 • We develop a novel variant of the stochastic accelerated gradient method [5], which is optimized for a mini-batch framework and implicitly adaptive to L(w￿ ). [sent-40, score-0.512]

22 • We provide an analysis of our accelerated algorithm, refining the analysis of [5] by being dependent on L(w￿ ), and show how it always allows for significant speedups via mini-batching, in contrast to standard gradient methods. [sent-41, score-0.44]

23 2 Preliminaries As discussed in the introduction, we focus on a stochastic convex optimization problem, where we wish to minimize L(w) = Ez [￿(w, z)] over some convex domain W, using an i. [sent-44, score-0.223]

24 Throughout this paper we assume that the instantaneous loss ￿(·, z) is convex, non-negative and H-smooth for each z ∈ Z. [sent-51, score-0.044]

25 2 We discuss two stochastic optimization approaches to deal with this problem: stochastic gradient descent (SGD), and accelerated gradient methods (AG). [sent-53, score-0.813]

26 The stochastic gradient descent algorithm (which in more general settings is known as mirror descent, e. [sent-56, score-0.358]

27 1] ￿￿ ￿ 1 b ￿ ¯ E [L(w)] − L(w ) ≤ O + , m m whereas for an accelerated gradient algorithm, we have [5] ￿￿ ￿ 1 b2 ag ￿ E [L(wn )] − L(w ) ≤ O + . [sent-64, score-0.96]

28 m m2 √ Thus, as long as b = o( m), both methods allow us to use a large mini-batch size b without significantly degrading the performance of either method. [sent-65, score-0.031]

29 This allows the number of iterations n = m/b to be smaller, potentially resulting in faster convergence speed. [sent-66, score-0.026]

30 However, these bounds do not show that accelerated methods have a significant advantage over the √ √ SGD algorithm, at least when b = o( m), since both have the same first-order term 1/ m. [sent-67, score-0.28]

31 To understand the differences between these two methods better, we will need a more refined analysis, to which we now turn. [sent-68, score-0.045]

32 3 Convergence Guarantees The following theorems provide a refined convergence guarantee for the SGD algorithm and the AG algorithm, which improves on the analysis of [10, 4, 5] by being explicitly dependent on L(w￿ ), the expected loss of the best predictor w￿ in W. [sent-69, score-0.124]

33 For ￿ the bD 2 L(w￿ )Hn ￿ HD 2 1+ L(w￿ )bn Stochastic Gradient Descent algorithm with = , assuming L(0) ≤ HD2 , we get that ￿ ¯ E [L(w)] − L(w ) ≤ ￿ HD2 L(w￿ ) 2HD2 9HD2 + + 2bn n bn Theorem 2. [sent-73, score-0.135]

34 While L(w￿ ) may not be known in advance, it does have the practical implication that choosing γi ∝ ip for some p < 1, as opposed to just choosing γi ∝ i as in [5]), might yield superior results. [sent-75, score-0.11]

35 This self-bounding property has been used in [14] in the online setting and in [13] in the stochastic setting to get better (faster) rates of convergence for non-negative smooth losses. [sent-77, score-0.291]

36 The implication of this observation are that for any w ∈ W, ￿ ￿ ￿∇L(w)￿ ≤ 4HL(w) and ∀z ∈ Z, ￿￿(w, z)￿ ≤ 4H￿(w, z). [sent-78, score-0.045]

37 The proof for the stochastic gradient descent bound is mainly based on the proof techniques in [5] and its extension to the mini-batch case in [10]. [sent-80, score-0.353]

38 In our analysis we further use the self-bounding property to (4) and get that ￿ n ￿ n−1 ￿ ￿ 16Hη 1 D2 E n L(wi ) − L(w￿ ) ≤ b(n−1) E [L(wi )] + 2η(n−1) i=1 i=1 rearranging and setting η appropriately gives the final bound. [sent-83, score-0.103]

39 The proof of the accelerated method starts in a similar way as in [5]. [sent-85, score-0.251]

40 Furthermore, each = ￿ ￿ md md 2 E ￿∇L(wi ) − ∇￿i (wi )￿ is assumed to be bounded by some constant, and thus leads to the final bound provided in [5] by setting γ appropriately. [sent-87, score-0.35]

41 Hence we unroll the recursion to M steps and use this inequality for the remaining sum. [sent-90, score-0.079]

42 4 Optimizing with Mini-Batches To compare our two theorems and understand their implications, it will be convenient to treat H and D as constants, and focus on the more interesting parameters of sample size m, minibatch size b, and optimal expected loss L(w￿ ). [sent-92, score-0.464]

43 Also, we will ignore the logarithmic factor in Theorem 2, since we will mostly be interested in significant (i. [sent-93, score-0.029]

44 polynomial) differences between the two algorithms, and it is quite possible that this logarithmic factor is merely an artifact of our analysis. [sent-95, score-0.029]

45 Using m = nb, we get that the bound for the SGD algorithm is ￿￿ ￿ ￿￿ ￿ L(w￿ ) 1 L(w￿ ) b ￿ ˜ ˜ ¯ E [L(w)] − L(w ) ≤ O + = O + , (5) bn n m m and the bound for the accelerated gradient method we propose is ￿￿ ￿ ￿￿ ￿ √ L(w￿ ) 1 1 L(w￿ ) b b2 ag ￿ ˜ ˜ E [L(wn )] − L(w ) ≤ O +√ + 2 = O + + 2 . [sent-96, score-1.181]

46 (6) bn m m m bn n To understand the implication these bounds, we follow the approach described in [3, 12] to analyze large-scale learning algorithms. [sent-97, score-0.27]

47 First, we fix a desired suboptimality parameter ￿, which measures how close to L(w￿ ) we want to get. [sent-98, score-0.168]

48 Then, we assume that both algorithms are ran till the suboptimality of their outputs is at most ￿. [sent-99, score-0.207]

49 Our goal would be to understand the runtime each algorithm needs, till attaining suboptimality ￿, as a function of L(w￿ ), ￿, b. [sent-100, score-0.404]

50 To measure this runtime, we need to discern two settings here: a parallel setting, where we assume that the mini-batch gradient computations are performed in parallel, and a serial setting, where the gradient computations are performed one after the other. [sent-101, score-0.504]

51 In a parallel setting, we can take the number of iterations n as a rough measure of the runtime (note that in both algorithms, the runtime of a single iteration is comparable). [sent-102, score-0.285]

52 In a serial setting, the relevant parameter is m, the number of data accesses. [sent-103, score-0.098]

53 To analyze the dependence on m and n, we upper bound (5) and (6) by ￿, and invert them to get the bounds on m and n. [sent-104, score-0.117]

54 Ignoring logarithmic factors, for the SGD algorithm we get ￿ ￿ ￿ ￿ 1 L(w￿ ) 1 1 L(w￿ ) n≤ · +1 m≤ +b , (7) ￿ ￿ b ￿ ￿ 5 and for the AG algorithm we get ￿ ￿ ￿ ￿ √ √ 1 L(w￿ ) 1 1 1 L(w￿ ) √ n≤ · +√ + ￿ m≤ + b+b ￿ . [sent-105, score-0.119]

55 (8) ￿ ￿ b ￿ ￿ b First, let us compare the performance of these two algorithms in the parallel setting, where the relevant parameter to measure runtime is n. [sent-106, score-0.179]

56 Analyzing which of the terms in each bound dominates, we get that for the SGD algorithm, there are 2 regimes, while for the AG algorithm, there are 2-3 regimes depending on the relationship between L(w￿ ) and ￿. [sent-107, score-0.138]

57 However, in the AG algorithm, this linear speedup regime holds for much larger minibatch sizes1 . [sent-109, score-0.417]

58 Even beyond the linear speedup regime, the AG algorithm √ still maintains a b speedup, for the reasonable case where ￿ ≥ L(w￿ )2 . [sent-110, score-0.07]

59 Finally, in all regimes, the runtime bound of the AG algorithm is equal or significantly smaller than that of the SGD algorithm. [sent-111, score-0.149]

60 We now turn to discuss the serial setting, where the runtime is measured in terms of m. [sent-112, score-0.204]

61 Inspecting (7) and (8), we see that a larger size of b actually requires m to increase for both algorithms. [sent-113, score-0.061]

62 This is to be expected, since mini-batching does not lead to large gains in a serial setting. [sent-114, score-0.098]

63 However, using mini-batching in a serial setting might still be beneficial for implementation reasons, resulting in constant-factor improvements in runtime (e. [sent-115, score-0.239]

64 saving overhead and loop control, and via pipelining, concurrent memory accesses etc. [sent-117, score-0.023]

65 In that case, we can at least ask what is the largest mini-batch size that won’t degrade the runtime guarantee by more than a constant. [sent-119, score-0.17]

66 Using our bounds, the mini-batch size b for the SGD algorithm can scale as much as L/￿, vs. [sent-120, score-0.031]

67 Finally, an interesting point is that the AG algorithm is sometimes actually necessary to obtain significant speed-ups via a mini-batch framework (according to our bounds). [sent-122, score-0.03]

68 Based on the table above, this happens when the desired suboptimality ￿ is not much bigger then L(w￿ ), i. [sent-123, score-0.168]

69 This includes the “separable” case, L(w￿ ) = 0, and in general a regime where the “estimation error” ￿ and “approximation error” L(w￿ ) are roughly the same—an arguably very ￿ relevant one in machine learning. [sent-126, score-0.062]

70 So with SGD we get no√ non-constant parallel speedup. [sent-128, score-0.118]

71 However, with AG, we still enjoy a speedup of at least Θ( b), all the way up to mini-batch size b = m2/3 . [sent-129, score-0.149]

72 We used the smoothed hinge loss ￿(w; x, y), defined as 0. [sent-131, score-0.139]

73 0 p Figure 1: Left: Test smoothed hinge loss, as a function of p, after training using the AG algorithm on 6361 examples from astro-physics, for various batch sizes. [sent-154, score-0.095]

74 The circled points are the theoretically-derived values p = ln b/(2 ln(n − 1)) (see Theorem 2). [sent-157, score-0.031]

75 The validation set was used to determine the step sizes η and γi . [sent-161, score-0.048]

76 In the implementation, we neglected the projection step, as we found it does not significantly affect performance when the stepsizes are properly selected. [sent-163, score-0.023]

77 In our first set of experiments, we attempted to determine the relationship between the performance of the AG algorithm and the p parameter, which determines the rate of increase of the step sizes γi . [sent-164, score-0.048]

78 Furthermore, there appears to be a weak trend towards higher p performing better for larger minibatch sizes b, which corresponds neatly with our theoretical predictions. [sent-168, score-0.356]

79 To do so, we varied the minibatch size b while holding the total amount of training data (m = nb) fixed. [sent-170, score-0.316]

80 When L(w￿ ) > 0 (top row of Figure 5), the total sample size m is high and the suboptimality ￿ is low (red and black plots), we see that for small minibatch size, both methods do not degrade as we increase b, corresponding to a linear parallel speedup. [sent-171, score-0.59]

81 In fact, SGD is actually overall better, but as b increases, its performance degrades more quickly, eventually performing worse than AG. [sent-172, score-0.03]

82 That is, even in the least favorable scenario for AG (high L(w￿ ) and small ￿, see the tables in Sec. [sent-173, score-0.061]

83 4), it does give benefits with large enough minibatch sizes. [sent-174, score-0.285]

84 Further, we see that once the suboptimality ￿ is roughly equal to L∗ , AG significantly outperforms SGD, even with small minibatches, agreeing with theory. [sent-175, score-0.168]

85 6 Summary In this paper, we presented novel contributions to the theory of first order stochastic convex optimization (Theorems 1 and 2, generalizing results of [4] and [5] to be sensitive to L (w￿ )), 7 0. [sent-177, score-0.205]

86 075 astro-physics 1 5 10 50 100 500 1 b 10 100 1000 10000 b Figure 2: Test loss on astro-physics and CCAT as a function of mini-batch size b (in log-scale), where the total amount of training data m = nb is held fixed. [sent-208, score-0.126]

87 Solid lines and dashed lines are for SGD and AG respectively (for AG, we used p = ln b/(2 ln(n − 1)) as in Theorem 2). [sent-209, score-0.031]

88 The upper row shows the smoothed hinge loss on the test set, using the original (uncensored) data. [sent-210, score-0.139]

89 The bottom rows show the smoothed hinge loss and misclassification rate on the test set, using the modified data where L(w￿ ) = 0. [sent-211, score-0.139]

90 A remaining open practical and theoretical question is whether the bound of Theorem 2 is tight. [sent-214, score-0.043]

91 Following [5], the bound is tight for b = 1 and b → ∞, i. [sent-215, score-0.043]

92 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-227, score-0.156]

93 A method for unconstrained convex minimization problem with the rate of convergence o(1/k 2 ). [sent-254, score-0.077]

94 Dual averaging methods for regularized stochastic learning and online optimization. [sent-303, score-0.123]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ag', 0.578), ('wi', 0.327), ('sgd', 0.304), ('minibatch', 0.285), ('accelerated', 0.251), ('suboptimality', 0.168), ('md', 0.136), ('gradient', 0.131), ('wn', 0.126), ('runtime', 0.106), ('serial', 0.098), ('stochastic', 0.097), ('bn', 0.09), ('descent', 0.082), ('yw', 0.079), ('zt', 0.076), ('parallel', 0.073), ('speedup', 0.07), ('ccat', 0.068), ('di', 0.066), ('zm', 0.066), ('regime', 0.062), ('erent', 0.062), ('technological', 0.056), ('toyota', 0.056), ('unroll', 0.052), ('convex', 0.051), ('nb', 0.051), ('regimes', 0.05), ('smoothed', 0.05), ('pw', 0.05), ('sizes', 0.048), ('enjoy', 0.048), ('mirror', 0.048), ('chicago', 0.047), ('attaining', 0.046), ('cotter', 0.046), ('erences', 0.046), ('ib', 0.046), ('hinge', 0.045), ('implication', 0.045), ('get', 0.045), ('understand', 0.045), ('loss', 0.044), ('bound', 0.043), ('ip', 0.04), ('shamir', 0.04), ('karthik', 0.039), ('till', 0.039), ('srebro', 0.039), ('manipulations', 0.036), ('tables', 0.035), ('theorem', 0.035), ('setting', 0.035), ('sridharan', 0.034), ('acceleration', 0.034), ('separable', 0.034), ('novel', 0.033), ('dekel', 0.033), ('degrade', 0.033), ('bd', 0.032), ('ez', 0.032), ('speedups', 0.032), ('subgradients', 0.032), ('ln', 0.031), ('size', 0.031), ('actually', 0.03), ('bi', 0.029), ('logarithmic', 0.029), ('bounds', 0.029), ('theorems', 0.028), ('instances', 0.028), ('cant', 0.027), ('recursion', 0.027), ('smooth', 0.027), ('bene', 0.026), ('misclassi', 0.026), ('distributed', 0.026), ('online', 0.026), ('favorable', 0.026), ('convergence', 0.026), ('dependent', 0.026), ('superior', 0.025), ('computations', 0.024), ('signi', 0.024), ('optimization', 0.024), ('subgradient', 0.023), ('nal', 0.023), ('summarized', 0.023), ('appropriately', 0.023), ('re', 0.023), ('institute', 0.023), ('accesses', 0.023), ('discern', 0.023), ('ectively', 0.023), ('minibatches', 0.023), ('neatly', 0.023), ('stepsizes', 0.023), ('tradeo', 0.023), ('uncensored', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

2 0.21607222 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

3 0.18674001 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

Author: Elad Hazan, Tomer Koren, Nati Srebro

Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1

4 0.17817393 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

5 0.14266963 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

6 0.13979286 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

7 0.1177827 202 nips-2011-On the Universality of Online Mirror Descent

8 0.11154678 299 nips-2011-Variance Penalizing AdaBoost

9 0.10879168 72 nips-2011-Distributed Delayed Stochastic Optimization

10 0.10225647 127 nips-2011-Image Parsing with Stochastic Scene Grammar

11 0.091031224 87 nips-2011-Energetically Optimal Action Potentials

12 0.082175605 258 nips-2011-Sparse Bayesian Multi-Task Learning

13 0.077239551 217 nips-2011-Practical Variational Inference for Neural Networks

14 0.074109666 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

15 0.072701424 29 nips-2011-Algorithms and hardness results for parallel large margin learning

16 0.07241524 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows

17 0.071789339 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

18 0.066881642 78 nips-2011-Efficient Methods for Overlapping Group Lasso

19 0.064938441 178 nips-2011-Multiclass Boosting: Theory and Algorithms

20 0.062626071 198 nips-2011-On U-processes and clustering performance


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.167), (1, -0.052), (2, -0.059), (3, -0.087), (4, -0.005), (5, 0.089), (6, -0.015), (7, -0.022), (8, -0.202), (9, 0.093), (10, 0.083), (11, -0.071), (12, -0.159), (13, -0.177), (14, -0.058), (15, 0.213), (16, -0.086), (17, 0.068), (18, 0.001), (19, 0.119), (20, -0.066), (21, -0.022), (22, -0.158), (23, -0.016), (24, 0.013), (25, 0.176), (26, 0.036), (27, 0.158), (28, 0.005), (29, 0.059), (30, -0.042), (31, -0.008), (32, -0.005), (33, -0.042), (34, -0.065), (35, 0.077), (36, -0.094), (37, 0.014), (38, 0.05), (39, 0.014), (40, 0.032), (41, 0.008), (42, 0.015), (43, -0.006), (44, -0.1), (45, -0.061), (46, -0.017), (47, 0.061), (48, -0.112), (49, -0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95485669 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

2 0.75787741 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

3 0.72725362 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

4 0.71207452 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

5 0.66476238 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

Author: Elad Hazan, Tomer Koren, Nati Srebro

Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1

6 0.60011798 202 nips-2011-On the Universality of Online Mirror Descent

7 0.59990925 72 nips-2011-Distributed Delayed Stochastic Optimization

8 0.48525998 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

9 0.42580742 299 nips-2011-Variance Penalizing AdaBoost

10 0.35933027 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

11 0.3530097 78 nips-2011-Efficient Methods for Overlapping Group Lasso

12 0.32900754 282 nips-2011-The Fast Convergence of Boosting

13 0.32838899 80 nips-2011-Efficient Online Learning via Randomized Rounding

14 0.3277179 95 nips-2011-Fast and Accurate k-means For Large Datasets

15 0.32742804 4 nips-2011-A Convergence Analysis of Log-Linear Training

16 0.31851724 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

17 0.30763406 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

18 0.3065069 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

19 0.29614258 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

20 0.29523039 109 nips-2011-Greedy Model Averaging


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (4, 0.023), (20, 0.016), (26, 0.013), (31, 0.042), (33, 0.025), (43, 0.06), (45, 0.124), (57, 0.03), (74, 0.043), (83, 0.029), (86, 0.012), (99, 0.453)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97273445 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

Abstract: This paper studies the problem of semi-supervised learning from the vector field perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient field of a linear function has to be a parallel vector field. Consequently, we propose to find a function which minimizes the empirical error, and simultaneously requires its gradient field to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efficiently. The experimental results have demonstrated the effectiveness of our proposed approach. 1

2 0.9516567 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

3 0.95130229 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper defines a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both offline and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

4 0.90139014 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

Abstract: We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efficiently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. 1

5 0.88368404 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

Abstract: We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.

same-paper 6 0.86902612 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

7 0.6275391 186 nips-2011-Noise Thresholds for Spectral Clustering

8 0.61655247 263 nips-2011-Sparse Manifold Clustering and Embedding

9 0.6032607 102 nips-2011-Generalised Coupled Tensor Factorisation

10 0.60268128 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

11 0.59916478 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

12 0.59737289 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

13 0.59111774 29 nips-2011-Algorithms and hardness results for parallel large margin learning

14 0.5839318 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

15 0.5738185 72 nips-2011-Distributed Delayed Stochastic Optimization

16 0.57307541 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

17 0.56555045 198 nips-2011-On U-processes and clustering performance

18 0.56439203 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

19 0.56004417 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

20 0.55872953 157 nips-2011-Learning to Search Efficiently in High Dimensions