nips nips2012 nips2012-76 nips2012-76-reference knowledge-graph by maker-knowledge-mining

76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

Source: pdf

Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi

Abstract: We study two communication-efﬁcient algorithms for distributed statistical optimization on large-scale data. The ﬁrst algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1

reference text

[1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, May 2012.

[2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 25, 2011.

[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[4] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using minibatches. Journal of Machine Learning Research, 13:165–202, 2012.

[5] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012.

[6] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.

[7] P. Hall. The Bootstrap and Edgeworth Expansion. Springer, 1992.

[8] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2006.

[9] B. Johansson, M. Rabi, and M. Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2009.

[10] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efﬁcient Large-Scale Distributed Training of Conditional Maximum Entropy Models. In Advances in Neural Information Processing Systems 22, pages 1231–1239, 2009. u

[11] C. Manning, P. Raghavan, and H. Sch¨ tze. Introduction to Information Retrieval. Cambridge University Press, 2008.

[12] R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In North American Chapter of the Association for Computational Linguistics (NAACL), 2010.

[13] A. Nedi´ and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transacc tions on Automatic Control, 54:48–61, 2009.

[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

[15] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.

[16] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

[17] S. S. Ram, A. Nedi´ , and V. V. Veeravalli. Distributed stochastic subgradient projection algorithms for c convex optimization. Journal of Optimization Theory and Applications, 147(3):516–545, 2010.

[18] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 25, 2011.

[19] H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability, pages 131–148, 1951.

[20] G. Sun. KDD cup track 2 soso.com ads prediction challenge, 2012. Accessed August 1, 2012.

[21] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.

[22] Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efﬁcient algorithms for statistical optimization. arXiv:1209.4129 [stat.ML], 2012.

[23] M. A. Zinkevich, A. Smola, M. Weimer, and L. Li. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24, 2010. 9