nips nips2011 nips2011-121 nips2011-121-reference knowledge-graph by maker-knowledge-mining

121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Source: pdf

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

reference text

[1] Max-ﬂow problem instances in vision. From http://vision.csd.uwo.ca/data/maxflow/.

[2] K. Asanovic and et al. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, Electrical Engineering and Computer Sciences, University of California at Berkeley, 2006.

[3] D. P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 2nd edition, 1999.

[4] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientiﬁc, Belmont, MA, 1997.

[5] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, 2008.

[6] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-ﬂow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1124– 1137, 2004.

[7] G. C˘ linescu, H. Karloff, and Y. Rabani. An improved approximation algorithm for multiway cut. In a Proceedings of the thirtieth annual ACM Symposium on Theory of Computing, pages 48–52, 1998.

[8] E. Cand` s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational e Mathematics, 9(6):717–772, 2009.

[9] J. Dean and S. Ghemawat. MapReduce: simpliﬁed data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

[10] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using minibatches. Technical report, Microsoft Research, 2011.

[11] A. Doan. http://dblife.cs.wisc.edu.

[12] J. Duchi, A. Agarwal, and M. J. Wainwright. Distributed dual averaging in networks. In Advances in Neural Information Processing Systems, 2010.

[13] S. H. Fuller and L. I. Millett, editors. The Future of Computing Performance: Game Over or Next Level. Committee on Sustaining Growth in Computing Performance. The National Academies Press, Washington, D.C., 2011.

[14] T. Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006.

[15] J. Langford. https://github.com/JohnLangford/vowpal_wabbit/wiki.

[16] J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Advances in Neural Information Processing Systems, 2009.

[17] J. Lee, , B. Recht, N. Srebro, R. R. Salakhutdinov, and J. A. Tropp. Practical large-scale optimization for max-norm regularization. In Advances in Neural Information Processing Systems, 2010.

[18] T. Lee, Z. Wang, H. Wang, and S. Hwang. Web scale entity resolution using relational evidence. Technical report, Microsoft Research, 2011. Available at http://research.microsoft.com/apps/ pubs/default.aspx?id=145839.

[19] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.

[20] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In Proceedings of VLDB, 2010.

[21] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

[22] F. Niu, B. Recht, C. R´ , and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic e gradient descent. Technical report, 2011. arxiv.org/abs/1106.5730.

[23] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.

[24] S. Shalev-Shwartz and N. Srebro. SVM Optimization: Inverse dependence on training set size. In Proceedings of the 25th Internation Conference on Machine Learning, 2008.

[25] N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factorization. In Advances in Neural Information Processing Systems, 2004.

[26] J. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, 1986.

[27] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. Advances in Neural Information Processing Systems, 2010. 9