nips nips2011 nips2011-121 knowledge-graph by maker-knowledge-mining

121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent


Source: pdf

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. [sent-12, score-0.291]

2 which allows processors access to shared memory with the possibility of overwriting each other’s work. [sent-15, score-0.347]

3 We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! [sent-16, score-0.137]

4 outperforms alternative schemes that use locking by an order of magnitude. [sent-19, score-0.211]

5 Nevertheless, the recent emergence of inexpensive multicore processors and mammoth, web-scale data sets has motivated researchers to develop several clever parallelization schemes for SGD [4, 10, 12, 16, 27]. [sent-22, score-0.459]

6 In contrast, a typical MapReduce setup will read incoming data at rates less than tens of MB/s due to frequent checkpointing for fault tolerance. [sent-34, score-0.165]

7 The high rates achievable by multicore systems move the bottlenecks in parallel computation to synchronization (or locking) amongst the processors [2, 13]. [sent-35, score-0.527]

8 Thus, to enable scalable data analysis on a multicore machine, any performant solution must minimize the overhead of locking. [sent-36, score-0.176]

9 In this work, we propose a simple strategy for eliminating the overhead associated with locking: run SGD in parallel without locks, a strategy that we call H OGWILD ! [sent-37, score-0.181]

10 , processors are allowed equal access to shared memory and are able to update individual components of memory at will. [sent-40, score-0.465]

11 Such a lock-free scheme might appear doomed to fail as processors could overwrite each other’s progress. [sent-41, score-0.257]

12 We demonstrate both theoretically and experimentally a near linear speedup with the number of processors on commonly occurring sparse learning problems. [sent-43, score-0.47]

13 In Section 2, we formalize a notion of sparsity that is sufficient to guarantee such a speedup and provide canonical examples of sparse machine learning problems in classification, collaborative filtering, and graph cuts. [sent-44, score-0.272]

14 We demonstrate that robust 1/k convergence rates are possible with constant stepsize schemes that implement an exponential back-off in the constant over time. [sent-47, score-0.228]

15 We show that all methods that propose memory locking are significantly slower than their respective lock-free counterparts on a variety of machine learning applications. [sent-51, score-0.269]

16 , n} and xe denotes the values of the vector x on the coordinates indexed by e. [sent-57, score-0.214]

17 The key observation that underlies our lock-free approach is that the natural cost functions associated with many machine learning problems of interest are sparse in the sense that |E| and n are both very large but each individual fe acts only on a very small number of components of x. [sent-58, score-0.175]

18 That is, each subvector xe contains just a few components of x. [sent-59, score-0.281]

19 The cost function (1) induces a hypergraph G = (V, E) whose nodes are the individual components of x. [sent-60, score-0.121]

20 Each subvector xe induces an edge in the graph e 2 E consisting of some subset of nodes. [sent-61, score-0.274]

21 In the matrix completion problem, we are provided entries of a low-rank, nr ⇥ nc matrix Z from the index set E. [sent-77, score-0.168]

22 A popular heuristic recovers the estimate of Z as a product LR⇤ of factors obtained from the following minimization: X 2 2 ⇤ minimize(L,R) (Lu Rv Zuv )2 + µ kLkF + µ kRkF , (4) 2 2 (u,v)2E where L is nr ⇥ r, R is nc ⇥ r and Lu (resp. [sent-80, score-0.116]

23 Here, two-way cuts use D = 2, but multiway-cuts with tens of thousands of classes also arise in entity resolution problems [18]. [sent-97, score-0.123]

24 , [7]) propose to minimize the cost function X minimizex wuv kxu xv k1 subject to xv 2 SD for v = 1, . [sent-101, score-0.298]

25 (5) (u,v)2E In all three of the preceding examples, the number of components involved in a particular term fe is a small fraction of the total number of entries. [sent-105, score-0.138]

26 nr nr This follows from a coupon collector argument [8]. [sent-119, score-0.15]

27 We now describe a simple protocol that achieves a linear speedup in the number of processors when ⌦, , and ⇢ are relatively small. [sent-123, score-0.479]

28 We assume a shared memory model with p processors. [sent-126, score-0.118]

29 Each processor can read x, and can 3 Algorithm 1 H OGWILD ! [sent-128, score-0.148]

30 update for individual processors 1: loop 2: Sample e uniformly at random from E 3: Read current state xe and evaluate Ge (xe ) 4: for v 2 e do xv xv Gev (xe ) 5: end loop contribute an update vector to x. [sent-129, score-0.651]

31 The vector x is stored in shared memory, and we assume that the componentwise addition operation is atomic, that is xv xv + a can be performed atomically by any processor for a scalar a and v 2 {1, . [sent-130, score-0.369]

32 This operation does not require a separate locking structure on most modern hardware: such an operation is a single atomic instruction on GPUs and DSPs, and it can be implemented via a compare-and-exchange operation on a general purpose multicore processor like the Intel Nehalem. [sent-134, score-0.477]

33 In contrast, the operation of updating many components at once requires an auxiliary locking structure. [sent-135, score-0.224]

34 Let Ge (xe ) denote a gradient or subgradient of the function fe multiplied by |E|. [sent-137, score-0.209]

35 In Algorithm 1, each processor samples an term e 2 E uniformly at random, computes the gradient of fe at xe , and then writes xv xv Gev (xe ), for each v 2 e. [sent-141, score-0.718]

36 (7) We assume that the stepsize is a fixed constant. [sent-142, score-0.137]

37 Note that the processor modifies only the variables indexed by e, leaving all of the components in ¬e (i. [sent-143, score-0.125]

38 Even though the processors have no knowledge as to whether any of the other processors have modified x, we define xj to be the state of the decision variable x after j updates have been performed1 . [sent-146, score-0.486]

39 Since two processors can write to x at the same time, we need to be a bit careful with this definition, but we simply break ties at random. [sent-147, score-0.229]

40 Note that xj is generally updated with a stale gradient, which is based on a value of x read many clock cycles earlier. [sent-148, score-0.149]

41 We use xk(j) to denote the value of the decision variable used to compute the gradient or subgradient that yields the state xj . [sent-149, score-0.137]

42 Moreover, we show that if the hypergraph induced by f is isotropic and sparse, then this algorithm converges in nearly the same number of gradient steps as its serial counterpart. [sent-151, score-0.401]

43 Since we are running in parallel and without locks, this means that we get a nearly linear speedup in terms of the number of processors. [sent-152, score-0.362]

44 4 Fast Rates for Lock-Free Parallelism To state our theoretical results, we must describe several quantities that important in the analysis of our parallel stochastic gradient descent scheme. [sent-153, score-0.294]

45 (Indeed, when c > 1, even the ordinary gradient descent algorithms will diverge. [sent-167, score-0.144]

46 In the case that ⌧ = 0, this reduces to precisely the rate achieved by the serial SGD protocol. [sent-177, score-0.158]

47 In our setting, ⌧ is proportional to the number of processors, and hence as long as the number of processors is less n1/4 , we get nearly the same recursion as in the linear rate. [sent-179, score-0.28]

48 Note that up to the log(1/✏) term in (12), our analysis nearly provides a 1/k rate of convergence for a constant stepsize SGD scheme, both in the serial and parallel cases. [sent-180, score-0.453]

49 In contrast, Nemirovski et al demonstrate that when the stepsize is inversely proportional to the iteration counter, an overestimate of c can result in exponential slow-down [21]! [sent-182, score-0.164]

50 We note that a 1/k can be achieved by a slightly more complicated protocol where the stepsize is slowly decreased after a large number of iterations. [sent-184, score-0.183]

51 Suppose we run Algorithm 1 for a fixed number of gradient updates K with stepsize < 1/c. [sent-185, score-0.276]

52 Then, we wait for the 1 threads to coalesce, reduce by a constant factor 2 (0, 1), and run for K iterations. [sent-186, score-0.196]

53 In some sense, this piecewise constant stepsize protocol approximates a 1/k diminishing stepsize. [sent-188, score-0.183]

54 Always working with small stepsizes allows us to avoid the possible exponential slow-downs that occur with standard diminishing stepsize schemes. [sent-190, score-0.17]

55 5 Related Work Most schemes for parallelizing stochastic gradient descent are variants of ideas presented in the seminal text by Bertsekas and Tsitsiklis [4]. [sent-191, score-0.294]

56 For instance, in this text, they describe using stale gradient updates computed across many computers in a master-worker setting and describe settings where different processors control access to particular components of the decision variable. [sent-192, score-0.435]

57 Recently, a variety of parallel schemes have been proposed in a variety of contexts. [sent-195, score-0.246]

58 In MapReduce settings, Zinkevich et al proposed running many instances of stochastic gradient descent on different machines and averaging their output [27]. [sent-196, score-0.255]

59 Though the authors claim this method can reduce both the variance of their estimate and the overall bias, we show in our experiments that for the sorts of problems we are concerned with, this method does not outperform a serial scheme. [sent-197, score-0.158]

60 Schemes involving the averaging of gradients via a distributed protocol have also been proposed by several authors [10, 12]. [sent-198, score-0.153]

61 to implement efficiently on multicore machines as they require massive communication overhead. [sent-247, score-0.132]

62 Distributed averaging of gradients requires message passing between the cores, and the cores need to synchronize frequently in order to compute reasonable gradient averages. [sent-248, score-0.266]

63 In this scheme, the processors are ordered and each update the decision variable in order. [sent-250, score-0.257]

64 When the time required to lock memory for writing is dwarfed by the gradient computation time, this method results in a linear speedup, as the errors induced by the lag in the gradients are not too severe. [sent-251, score-0.255]

65 However, we note that in many applications of interest in machine learning, gradient computation time is incredibly fast, and we now demonstrate that in a variety of applications, H OGWILD ! [sent-252, score-0.148]

66 One notable change in RR from the Vowpal Wabbit software release is that we optimized RR’s locking and signaling mechanisms to use spinlocks and busy waits (there is no need for generic signaling to implement round robin). [sent-258, score-0.15]

67 We verified that this optimization results in nearly an order of magnitude increase in wall clock time for all problems that we discuss. [sent-259, score-0.143]

68 Our experiments demonstrate that even this fine-grained locking induces undesirable slow-downs. [sent-264, score-0.15]

69 This allows us to read data from the raid at a rate of nearly 1GB/s. [sent-272, score-0.169]

70 All of the experiments use a constant stepsize which is diminished by a factor at the end of each pass over the training set. [sent-273, score-0.137]

71 We swapped the training set and the test set for our experiments to demonstrate the scalability of the parallel multicore algorithms. [sent-282, score-0.239]

72 is able to achieve a factor of 3 speedup with while RR gets worse as more threads are added. [sent-289, score-0.37]

73 Indeed, for fast gradients, RR is worse than a serial implementation. [sent-290, score-0.158]

74 In Figure 3(b), we display at the train error of the ensemble average across parallel threads at the end of each pass over the data. [sent-292, score-0.273]

75 We note that the threads only communicate at the very end of the computation, but we want to demonstrate the effect of parallelization on train error. [sent-293, score-0.203]

76 Each of the parallel threads touches every data example in each pass. [sent-294, score-0.273]

77 Thus, the 10 thread run does 10x more gradient computations than the serial version. [sent-295, score-0.326]

78 Here, the error is the same whether we run in serial or with ten instances. [sent-296, score-0.188]

79 We conclude that on this problem, there is no advantage to running in parallel with this averaging scheme. [sent-297, score-0.148]

80 Similar to the other experiments with quickly computable gradients, RR does not show any improvement over a serial approach. [sent-314, score-0.158]

81 In fact, with 10 threads, RR is 12% slower than serial on KDD Cup and 62% slower on Netflix. [sent-315, score-0.158]

82 We computed a two-way cut of the abdomen data set [1]. [sent-319, score-0.117]

83 speeds up the cut problem by more than a factor of 4 with 10 threads, while RR is twice as slow as the serial version. [sent-324, score-0.244]

84 Our second graph cut problem sought a mulit-way cut to determine entity recognition in a large database of web data. [sent-325, score-0.196]

85 We created a data set of clean entity lists from the DBLife website and of entity mentions from the DBLife Web Crawl [11]. [sent-326, score-0.141]

86 In this problem each stochastic gradient step must compute a Euclidean projection onto a simplex of dimension 18,167. [sent-328, score-0.152]

87 As a result, the individual stochastic gradient steps are quite slow. [sent-329, score-0.152]

88 Since the gradients are slow, RR is able to achieve a parallel speedup for this problem, however the speedup with ten processors is only by a factor of 5. [sent-335, score-0.81]

89 In all three cases, massive speedup is achieved via parallelism. [sent-347, score-0.204]

90 (c) Speedup achieved over serial method for various levels of delays (measured in nanoseconds). [sent-349, score-0.158]

91 As we saw with the DBLIFE data set, the RR method does get a nearly linear speedup when the gradient computation is slow. [sent-351, score-0.364]

92 To answer this question, we ran the RCV1 experiment again and introduced an artificial delay at the end of each gradient computation to simulate a slow gradient. [sent-354, score-0.201]

93 In Figure 3(c), we plot the wall clock time required to solve the SVM problem as we vary the delay for both the RR and H OGWILD ! [sent-355, score-0.139]

94 At this rate, one is only able to compute about a million stochastic gradients per hour, so the gradient computations must be very labor intensive in order for the RR method to be competitive. [sent-362, score-0.218]

95 Moreover, our algorithms allow parallel speedup even when the gradients are computationally intensive. [sent-367, score-0.377]

96 For future work, it would be of interest to enumerate structures that allow for parallel gradient computations with no collisions at all. [sent-373, score-0.216]

97 Acknowledgements BR is generously supported by ONR award N00014-11-1-0723 and NSF award CCF-1139953. [sent-376, score-0.138]

98 FA8750-09-C-0181, the NSF CAREER award under IIS-1054009, ONR award N000141210041, and gifts or research awards from Google, LogicBlox, and Johnson Controls, Inc. [sent-378, score-0.117]

99 SJW is generously supported by NSF awards DMS-0914524 and DMS-0906818 and DOE award DE-SC0002283. [sent-379, score-0.123]

100 : A lock-free approach to parallelizing stochastic e gradient descent. [sent-527, score-0.198]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ogwild', 0.567), ('processors', 0.229), ('xe', 0.214), ('rr', 0.209), ('speedup', 0.204), ('sgd', 0.186), ('threads', 0.166), ('serial', 0.158), ('locking', 0.15), ('stepsize', 0.137), ('mapreduce', 0.133), ('aig', 0.132), ('multicore', 0.132), ('gradient', 0.109), ('parallel', 0.107), ('xv', 0.104), ('fe', 0.1), ('hogwild', 0.095), ('jumbo', 0.095), ('processor', 0.087), ('hypergraph', 0.083), ('memory', 0.08), ('abdomen', 0.076), ('dblife', 0.076), ('nr', 0.075), ('ge', 0.068), ('gradients', 0.066), ('schemes', 0.061), ('read', 0.061), ('speedups', 0.058), ('clock', 0.057), ('minimizex', 0.057), ('raid', 0.057), ('entity', 0.055), ('completion', 0.052), ('kdd', 0.052), ('nearly', 0.051), ('cores', 0.05), ('generously', 0.05), ('locks', 0.05), ('delay', 0.047), ('recht', 0.047), ('protocol', 0.046), ('parallelizing', 0.046), ('slow', 0.045), ('award', 0.044), ('overhead', 0.044), ('fault', 0.043), ('epoch', 0.043), ('stochastic', 0.043), ('averaging', 0.041), ('rv', 0.041), ('nc', 0.041), ('cut', 0.041), ('variety', 0.039), ('shared', 0.038), ('dremel', 0.038), ('gev', 0.038), ('nanoseconds', 0.038), ('zuv', 0.038), ('components', 0.038), ('parallelization', 0.037), ('nemirovski', 0.037), ('cuts', 0.037), ('sparse', 0.037), ('operation', 0.036), ('wall', 0.035), ('bertsekas', 0.035), ('descent', 0.035), ('wuv', 0.033), ('stepsizes', 0.033), ('vowpal', 0.033), ('wabbit', 0.033), ('splits', 0.033), ('svm', 0.031), ('graph', 0.031), ('tens', 0.031), ('stale', 0.031), ('niu', 0.031), ('mentions', 0.031), ('run', 0.03), ('rates', 0.03), ('awards', 0.029), ('thread', 0.029), ('string', 0.029), ('subvector', 0.029), ('robin', 0.029), ('synchronization', 0.029), ('afrl', 0.029), ('belmont', 0.029), ('scheme', 0.028), ('lu', 0.028), ('decision', 0.028), ('web', 0.028), ('zinkevich', 0.027), ('asynchronous', 0.027), ('coded', 0.027), ('al', 0.027), ('tsitsiklis', 0.026), ('parallelized', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

2 0.15591924 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

3 0.14266963 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

4 0.14081709 72 nips-2011-Distributed Delayed Stochastic Optimization

Author: Alekh Agarwal, John C. Duchi

Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1

5 0.12169828 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

6 0.11732866 29 nips-2011-Algorithms and hardness results for parallel large margin learning

7 0.082061604 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

8 0.061982248 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

9 0.058425233 8 nips-2011-A Model for Temporal Dependencies in Event Streams

10 0.053508032 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

11 0.050558619 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

12 0.048298668 165 nips-2011-Matrix Completion for Multi-label Image Classification

13 0.047929224 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

14 0.047389142 202 nips-2011-On the Universality of Online Mirror Descent

15 0.045632135 229 nips-2011-Query-Aware MCMC

16 0.045558482 55 nips-2011-Collective Graphical Models

17 0.045054857 95 nips-2011-Fast and Accurate k-means For Large Datasets

18 0.044958789 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

19 0.0444928 261 nips-2011-Sparse Filtering

20 0.043647416 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.151), (1, -0.014), (2, -0.039), (3, -0.05), (4, -0.037), (5, 0.077), (6, -0.042), (7, -0.02), (8, -0.118), (9, 0.018), (10, 0.068), (11, -0.056), (12, -0.088), (13, -0.128), (14, 0.001), (15, 0.145), (16, -0.038), (17, 0.056), (18, -0.023), (19, 0.056), (20, -0.043), (21, -0.012), (22, -0.092), (23, -0.123), (24, -0.032), (25, 0.072), (26, -0.015), (27, 0.047), (28, 0.03), (29, 0.091), (30, -0.017), (31, -0.042), (32, -0.047), (33, -0.077), (34, -0.094), (35, 0.025), (36, -0.013), (37, 0.033), (38, 0.039), (39, -0.033), (40, 0.036), (41, 0.108), (42, -0.073), (43, -0.009), (44, -0.08), (45, -0.025), (46, 0.023), (47, -0.001), (48, 0.053), (49, -0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91457295 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

2 0.79125524 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

3 0.74057692 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

4 0.73368543 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

5 0.70706242 72 nips-2011-Distributed Delayed Stochastic Optimization

Author: Alekh Agarwal, John C. Duchi

Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1

6 0.61466426 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

7 0.54495615 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

8 0.49639481 202 nips-2011-On the Universality of Online Mirror Descent

9 0.47067863 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference

10 0.42360771 95 nips-2011-Fast and Accurate k-means For Large Datasets

11 0.40376008 4 nips-2011-A Convergence Analysis of Log-Linear Training

12 0.40251592 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

13 0.36474717 29 nips-2011-Algorithms and hardness results for parallel large margin learning

14 0.36199331 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

15 0.35882005 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

16 0.3467052 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

17 0.34592411 165 nips-2011-Matrix Completion for Multi-label Image Classification

18 0.34475186 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

19 0.34384084 282 nips-2011-The Fast Convergence of Boosting

20 0.34364542 55 nips-2011-Collective Graphical Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (4, 0.033), (9, 0.013), (20, 0.043), (26, 0.027), (31, 0.056), (33, 0.024), (43, 0.044), (45, 0.114), (56, 0.279), (57, 0.047), (65, 0.012), (74, 0.045), (83, 0.032), (86, 0.023), (99, 0.107)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83151847 34 nips-2011-An Unsupervised Decontamination Procedure For Improving The Reliability Of Human Judgments

Author: Michael C. Mozer, Benjamin Link, Harold Pashler

Abstract: Psychologists have long been struck by individuals’ limitations in expressing their internal sensations, impressions, and evaluations via rating scales. Instead of using an absolute scale, individuals rely on reference points from recent experience. This relativity of judgment limits the informativeness of responses on surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that map stimuli to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, the problem is to infer latent (subjective) impressions from a sequence of stimulus labels (e.g., movie names) and responses. We describe an unsupervised approach that simultaneously recovers the impressions and parameters of a contamination model that predicts how recent judgments affect the current response. We test our iterated impression inference, or I3 , algorithm in three domains: rating the gap between dots, the desirability of a movie based on an advertisement, and the morality of an action. We demonstrate significant objective improvements in the quality of the recovered impressions. 1

2 0.76520056 60 nips-2011-Confidence Sets for Network Structure

Author: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi

Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximumlikelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. 1

same-paper 3 0.76359957 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

4 0.66635728 67 nips-2011-Data Skeletonization via Reeb Graphs

Author: Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang

Abstract: Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a lowdimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional ”skeleton” from unorganized data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs. It can also represent arbitrary graph structures in the data. We also give theoretical results to justify our method. We provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves. We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.

5 0.55383074 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

Abstract: We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.

6 0.55115509 186 nips-2011-Noise Thresholds for Spectral Clustering

7 0.55042881 263 nips-2011-Sparse Manifold Clustering and Embedding

8 0.54679877 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

9 0.54432607 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

10 0.54013914 72 nips-2011-Distributed Delayed Stochastic Optimization

11 0.53751391 54 nips-2011-Co-regularized Multi-view Spectral Clustering

12 0.53673345 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

13 0.5350346 29 nips-2011-Algorithms and hardness results for parallel large margin learning

14 0.53306198 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

15 0.53255248 157 nips-2011-Learning to Search Efficiently in High Dimensions

16 0.53184664 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database

17 0.52620643 271 nips-2011-Statistical Tests for Optimization Efficiency

18 0.52494961 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

19 0.52429134 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

20 0.52388972 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes