jmlr jmlr2013 jmlr2013-25 jmlr2013-25-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24, 2011. A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of convex optimization. IEEE Transactions on Information Theory, 58 (5):3235–3249, May 2012. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. 3361 Z HANG , D UCHI AND WAINWRIGHT C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011. R. Chen, A. Gittens, and J. A. Tropp. The masked sample covariance estimator: an analysis using matrix concentration inequalities. Information and Inference, to appear, 2012. A. de Acosta. Inequalities for b-valued random vectors with applications to the strong law of large numbers. The Annals of Probability, 9:157–161, 1981. O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13:165–202, 2012. J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012a. J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012b. B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993. P. Hall. The Bootstrap and Edgeworth Expansion. Springer, 1992. E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2006. B. Johansson, M. Rabi, and M. Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2009. R. W. Keener. Theoretical Statistics: Topics for a Core Course. Springer, 2010. M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991. E. L. Lehmann and G. Casella. Theory of Point Estimation, Second Edition. Springer, 1998. G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems 22, pages 1231–1239, 2009. C. Manning, P. Raghavan, and H. Sch¨ tze. Introduction to Information Retrieval. Cambridge u University Press, 2008. R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In North American Chapter of the Association for Computational Linguistics (NAACL), 2010. A. Nedi´ and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE c Transactions on Automatic Control, 54:48–61, 2009. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. 3362 C OMMUNICATION -E FFICIENT A LGORITHMS FOR S TATISTICAL O PTIMIZATION J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006. D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer, 1999. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. S. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, 2012. S. S. Ram, A. Nedi´ , and V. V. Veeravalli. Distributed stochastic subgradient projection algorithms c for convex optimization. Journal of Optimization Theory and Applications, 147(3):516–545, 2010. B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, 2011. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. arXiv preprint arXiv:1209.1873, 2012. G. Sun. KDD cup track 2 soso.com ads prediction challenge, 2012. http://www.kddcup2012.org/c/kddcup2012-track2. Accessed August 1, 2012. URL A. W. van der Vaart. Asymptotic Statistics. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, 1998. ISBN 0-521-49603-9. Y. Zhang, J. C. Duchi, and M. J. Wainwright. Divide and conquer kernel ridge regression. In Proceedings of the Twenty Sixth Annual Conference on Computational Learning Theory, Princeton, NJ, July 2013. M. A. Zinkevich, A. Smola, M. Weimer, and L. Li. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23, 2010. 3363