jmlr jmlr2012 jmlr2012-85 jmlr2012-85-reference knowledge-graph by maker-knowledge-mining

85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Source: pdf

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

reference text

J. Abernethy, A. Agarwal, A. Rakhlin, and P. L. Bartlett. A stochastic view of optimal regret through minimax duality. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice Hall, 1989. J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer-Verlag, New York, 1997. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3):538–543, August 1993. 199 D EKEL , G ILAD -BACHRACH , S HAMIR AND X IAO A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems 24, 2011. O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 713– 720, 2011. J. Duchi and Y. Singer. Efﬁcient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2873–2898, 2009. J. Duchi, A. Agarwal, and M. Wainwright. Distributed dual averaging in networks. In Advances in Neural Information Processing Systems (NIPS), pages 550–558, 2010. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. Technical report, Department of Industrial and System Engineering, University of Florida, Gainesville, FL, 2010. K. Gimpel, D. Das, and N. A. Smith. Distributed asynchronous online learning for natural language processing. In Proceedings of the Fourth Conference on Computational Natural Language Learning, pages 213–222, 2010. R. Goebel and R. T. Rockafellar. Local strong convexity and local Lipschitz continuity of the gradient of convex functions. Journal of Convex Analysis, 15(2):263–270, 2008. J. L. Gustafson. Reevaluating Amdahl’s Law. Communications of the ACM, 31(5):532–533, 1988. E. Harrington, R. Herbrich, J. Kivinen, J. Platt, and R. C. Williamson. Online Bayes point machines. In Proceedings of the Seventh Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, pages 241–252, 2003. C. Hu, J. T. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and online learning. In Y. Bengio, D. Schuurmans, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 781–789, 2009. A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm. Stochastic Systems, 1:1–42, 2011. G. Lan. An optimal method for stochastic composite optimization. Technical report, Georgia Institute of Technology, 2009. G. Lan, Z. Lu, and R. D. C. Monteiro. Primal-dual ﬁrst-order methods with O(1/ε) iterationcomplexity for cone programming. Mathematical Programming, 126:1–29, 2011. J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Advances in Neural Information Processing Systems (NIPS) 22, pages 2331–2339, 2009. A. Nedi´ and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE c Transactions on Automatic Control, 54(1):48–61, 2009. 200 O PTIMAL D ISTRIBUTED O NLINE P REDICTION A. Nedi´ , D. P. Bertsekas, and V. S. Borkar. Distributed asynchronous incremental subgradient c methods. In D. Butnariu, Y. Censor, and S. Reich, editors, Inherently Parallel Algorithms in Feasibility and Optimization and Their Applications, Studies in Computational Mathematics, pages 381–407. Elsevier, 2001. A. Nemirovski and D. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Series in Discrete Mathematics. Wiley-Interscience, 1983. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 2004. Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103: 127–152, 2005. Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259, August 2009. Y. Nesterov and J.-Ph. Vial. Conﬁdence level solutions for stochastic programming. Automatica, 44(6):1559–1568, 2008. B. T. Polyak. Introduction to Optimization. Translations Series in Mathematics and Engineering. Optimization Software, Inc., New York, 1987. R. T. Rockafellar and R. J-B Wets. On the interchange of subdifferentiation and conditional expectation for convex functionals. Stochastics: An International Journal of Probability and Stochastic Processes, 7(3):173–182, 1982. S. Shalev-Shwartz and A. Tewari. Stochastic methods for ℓ1 -regularized loss minimization. Journal of Machine Learning Research, 12:1865–1892, 2011. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning (ICML), pages 807–814, 2007. P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Submitted to SIAM Journal on Optimization, 2008. J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, 1986. R. J-B Wets. Stochastic programming. In G. Nemhauser and A. Rinnnooy Kan, editors, Handbook for Operations Research and Management Sciences, volume 1, pages 573–629. Elsevier Science Publishers, Amsterdam, The Netherlands, 1989. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010. 201 D EKEL , G ILAD -BACHRACH , S HAMIR AND X IAO M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928–936, Washington DC, 2003. M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems (NIPS), pages 2595–2603, 2010. 202