jmlr jmlr2010 jmlr2010-18 jmlr2010-18-reference knowledge-graph by maker-knowledge-mining

18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Source: pdf

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E

reference text

N. Abe, J. Takeuchi, and M. K. Warmuth. Polynomial Learnability of Stochastic Rules with Respect to the KL-Divergence and Quadratic Distance. IEICE Transactions on Information and Systems, 84(3):299–316, 2001. G. Amdahl. Validity of the single processor approach to achieving large-scale computing capabilities. In AFIPS Conference Proceedings, volume 30, pages 483–485, 1967. M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the uniﬁcation of biology. the gene ontology consortium. Nature Genetics, 25:25–29, 2000. G. Bakir, T. Hofmann, B. Sch¨ lkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting o Structured Data. MIT Press, Cambridge, Massachusetts, 2007. O. E. Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. John Wiley and Sons, New York, 1978. J. Basilico and T. Hofmann. Unifying collaborative and content-based ﬁltering. In Proceedings of the International Conference on Machine Learning, pages 65–72, New York, NY, 2004. ACM Press. A. Belloni. Introduction to bundle methods. Technical report, Operation Research Center, M.I.T., 2005. S. Belongie, J. Malik, and J. Puzicha. Matching shapes. In Eighth IEEE International Conference on Computer Vision, Vancouver, Canada, July 2001. K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods Software, 1:23–34, 1992. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, England, 2004. J. S. Breese, D. Heckerman, and C. Kardie. Empirical analysis of predictive algorithms for collaborative ﬁltering. In Proceedings of the 14th Conference on Uncertainty in Artiﬁcial Intelligence, pages 43–52, 1998. T. Caetano, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. In Proceedings of the 11th International Conference On Computer Vision, pages 1–8, Los Alamitos, CA, 2007. IEEE Computer Society. T. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. submitted. L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In Proceedings of the Thirteenth ACM conference on Information and knowledge management, pages 78–87, New York, NY, USA, 2004. ACM Press. 360 B UNDLE M ETHODS FOR R EGULARIZED R ISK M INIMIZATION E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005. C. C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm. O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5):1155– 1178, 2007. M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. In Proceedings of the 13th Annual Conference on Computational Learning Theory, pages 158–169. Morgan Kaufmann, San Francisco, 2000. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20(3):273–297, 1995. R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter. Probabilistic Networks and Expert Sytems. Springer, New York, 1999. K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951–991, January 2003. K. Crammer and Y. Singer. Online ranking by projecting. Neural Computation, 17(1):145–175, 2005. N. A. C. Cressie. Statistics for Spatial Data. John Wiley and Sons, New York, 1993. L. Fahrmeir and G. Tutz. Multivariate Statistical Modelling Based on Generalized Linear Models. Springer, 1994. R.-E. Fan, J.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classiﬁcation. Journal of Machine Learning Research, 9:1871–1874, August 2008. S. Fine and K. Scheinberg. Efﬁcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243–264, Dec 2001. V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector machines. In A. McCallum and S. Roweis, editors, Proceedings of the International Conference on Machine Learning, pages 320–327. Omnipress, 2008. A. Frangioni. Dual-Ascent Methods and Multicommodity Flow Problems. PhD thesis, Dipartimento di Informatica, Universita’ di Pisa, 1997. TD 5/97. W. Gropp, E. Lusk, and R. Thakur. Using MPI-2. MIT Press, 1999. R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In A. J. Smola, P. L. Bartlett, B. Sch¨ lkopf, and D. Schuurmans, editors, Advances in Large Margin o Classiﬁers, pages 115–132, Cambridge, MA, 2000. MIT Press. J. B. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms, I and II, e volume 305 and 306. Springer-Verlag, 1993. 361 T EO , V ISHWANATHAN , S MOLA AND L E C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In W. Cohen, A. McCallum, and S. Roweis, editors, icml, pages 408–415. ACM, 2008. K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In ACM Special Interest Group in Information Retrieval (SIGIR), pages 41–48, New York, 2002. ACM. T. Joachims. A support vector method for multivariate performance measures. In Proceedings of the International Conference on Machine Learning, pages 377–384, San Francisco, California, 2005. Morgan Kaufmann Publishers. T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2006. T. Joachims. Making large-scale SVM learning practical. In B. Sch¨ lkopf, C. J. C. Burges, and o A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184, Cambridge, MA, 1999. MIT Press. T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane traning of structural SVMs. Machine Learning, 76(1), 2009. R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38:325–340, 1987. M. I. Jordan. An Introduction to Probabilistic Graphical Models. MIT Press, 2002. To Appear. R. M. Karp. An algorithm to solve the m × n assignment problem in expected time O(mn log n). Networks, 10(2):143–152, 1980. S. S. Keerthi and D. DeCoste. A modiﬁed ﬁnite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005. J. E. Kelly. The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4):703–712, December 1960. K. C. Kiwiel. An aggregate subgradient method for nonsmooth convex minimization. Mathematical Programming, 27:320–341, 1983. K. C. Kiwiel. Proximity control in bundle methods for convex nondifferentiable minimization. Mathematical Programming, 46:105–122, 1990. R. Koenker. Quantile Regression. Cambridge University Press, 2005. H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: probabilistic modeling for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, volume 18, pages 282–289, San Francisco, CA, 2001. Morgan Kaufmann. 362 B UNDLE M ETHODS FOR R EGULARIZED R ISK M INIMIZATION Q. V. Le and A. J. Smola. Direct optimization of ranking measures. Journal of Machine Learning Research, 2007. submitted. C. Lemar´ chal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Mathematical e Programming, 69:111–147, 1995. C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, April 2008. L. Lukˇan and J. Vlˇ ek. A bundle-Newton method for nonsmooth unconstrained minimization. s c Mathematical Programming, 83(3):373–391, 1998. O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research, 13:444–452, 1965. D. McAllester. Generalization bounds and consistency for structured labeling. In Predicting Structured Data, Cambridge, Massachusetts, 2007. MIT Press. T. P. Minka. A comparison of numerical optimizers for logistic regression. Technical report, Microsoft Research, 2007. K.-R. M¨ ller, A. J. Smola, G. R¨ tsch, B. Sch¨ lkopf, J. Kohlmorgen, and V. Vapnik. Predicting time u a o series with support vector machines. In W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud, editors, Artiﬁcial Neural Networks ICANN’97, volume 1327 of Lecture Notes in Comput. Sci., pages 999–1004, Berlin, 1997. Springer-Verlag. J. Munkres. Algorithms for the assignment and transportation problems. Journal of SIAM, 5(1): 32–38, 1957. A. Nedich and D. P Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263–304. Kluwer Academic Publishers, 2000. J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999. J. B. Orlin and Y. Lee. Quickmatch: a very fast algorithm for the assignment problem. Working Paper 3547-93, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, March 1993. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. N. Ratliff, J. Bagnell, and M. Zinkevich. Maximum margin planning. In Proceedings of the International Conference on Machine Learning, July 2006. N. Ratliff, J. Bagnell, and M. Zinkevich. (online) subgradient methods for structured prediction. In Eleventh International Conference on Artiﬁcial Intelligence and Statistics (AIStats), March 2007. 363 T EO , V ISHWANATHAN , S MOLA AND L E G. R¨ tsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. M¨ ller, R. J. Sommer, and B. Sch¨ lkopf. a u o Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Computational Biology, 3(2):e20 doi:10.1371/journal.pcbi.0030020, 2007. B. Sch¨ lkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support o of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. H. Schramm and J. Zowe. A version of the bundle idea for minimizing a nonsmooth function: conceptual idea, convergence analysis, numerical results. SIAM Journal of Optimization, 2:121– 152, 1992. F. Sha and F. Pereira. Shallow parsing with conditional random ﬁelds. In Proceedings of HLTNAACL, pages 213–220, Edmonton, Canada, 2003. Association for Computational Linguistics. S. Shalev-Schwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In W. Cohen, A. McCallum, and S. Roweis, editors, Proceedings of the International Conference on Machine Learning. Omnipress, 2008. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In H.U. Simon and G. Lugosi, editors, Computational Learning Theory (COLT), LNCS. Springer, 2006. extended version. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the International Conference on Machine Learning, 2007. Q. Shi, L. Wang, L. Cheng, and A. J. Smola. Discriminative human action segmentation and recognition using semi-markov model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, 2008. V. Sindhwani and S. S. Keerthi. Large scale semi-supervised linear SVMs. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 477–484, New York, NY, USA, 2006. ACM Press. A. J. Smola, S. V. N. Vishwanathan, and Q. V. Le. Bundle methods for machine learning. In D. Koller and Y. Singer, editors, Advances in Neural Information Processing Systems 20, Cambridge MA, 2007. MIT Press. I. Takeuchi, Q. V. Le, T. Sears, and A. J. Smola. Nonparametric quantile estimation. Journal of Machine Learning Research, 7, 2006. B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and B. Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16, pages 25–32, o Cambridge, MA, 2004. MIT Press. C.-H. Teo, Q. V. Le, A. J. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2007. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Section B (Statistical Methodology), 58:267–288, 1996. 364 B UNDLE M ETHODS FOR R EGULARIZED R ISK M INIMIZATION I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005. V. Vapnik, S. Golowich, and A. J. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press. E. Voorhees. Overview of the TREC 2001 question answering track. In TREC, 2001. G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Technical Report 984, Department of Statistics, University of Wisconsin, Madison, 1997. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, September 2003. C. K. I. Williams. Prediction with Gaussian processes: from linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models, pages 599–621. Kluwer Academic, 1998. J. Yu, S. V. N. Vishwanathan, S. G¨ nter, and N. N. Schraudolph. A quasi-Newton approach to u nonsmooth convex optimization. In A. McCallum and S. Roweis, editors, ICML, pages 1216– 1223. Omnipress, 2008. T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory, 49(3):682–691, 2003. 365