nips nips2010 nips2010-202 nips2010-202-reference knowledge-graph by maker-knowledge-mining

202 nips-2010-Parallelized Stochastic Gradient Descent

Source: pdf

Author: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola

Abstract: With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the ﬁrst parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime [1, 8]. 1

reference text

[1] Shun-ichi Amari. A theory of adaptive pattern classiﬁers. IEEE Transactions on Electronic Computers, 16:299–307, 1967.

[2] L. Bottou and O. Bosquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, 2008.

[3] C.T. Chu, S.K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In B. Sch¨ lkopf, J. Platt, and T. Hofmann, editors, Advances o in Neural Information Processing Systems 19, 2007.

[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Computational Learning Theory, 2010.

[5] J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast. In Neural Information Processing Systems, 2009.

[6] J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast. arXiv:0911.0491, 2009.

[7] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efﬁcient large-scale distributed training of conditional maximum entropy models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1231–1239. 2009.

[8] N. Murata, S. Yoshizawa, and S. Amari. Network information criterion — determining the number of hidden units for artiﬁcial neural network models. IEEE Transactions on Neural Networks, 5:865–872, 1994.

[9] Choon Hui Teo, S. V. N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for regularized risk minimization. J. Mach. Learn. Res., 11:311–365, January 2010.

[10] U. von Luxburg and O. Bousquet. Distance-based classiﬁcation with lipschitz functions. Journal of Machine Learning Research, 5:669–695, 2004.

[11] M. Zinkevich. Online convex programming and generalised inﬁnitesimal gradient ascent. In Proc. Intl. Conf. Machine Learning, pages 928–936, 2003. 9