jmlr jmlr2012 jmlr2012-98 jmlr2012-98-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
A. Argyriou, R. Hauser, C. A. Micchelli, and M. Pontil. A dc-programming algorithm for kernel selection. In Proceedings of the Twenty-third International Conference on Machine Learning, pages 41–48, 2006. Y. Bengio and Y. Lecun. Scaling Learning Algorithms Towards AI. MIT Press, 2007. D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, Belmont, MA, 2003. L. Bottou. Stochastic gradient descent examples on toy problems, 2008. URL http://leon. bottou.org/projects/sgd. O. Chapelle, B. Sch¨ lkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, o MA, 2006. URL http://www.kyb.tuebingen.mpg.de/ssl-book. 3580 R EGULARIZED B UNDLE M ETHODS FOR C ONVEX AND N ON -C ONVEX R ISKS R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In Proceedings of the Twenty-third International Conference on Machine Learning, pages 201–208. ACM Press, 2006. T. M. T. Do and T. Arti` res. A fast method for training linear svm in the primal. In Proceedings of e the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases Part I, pages 272–287, Berlin, Heidelberg, 2008. Springer-Verlag. T. M. T. Do and T. Arti` res. Large margin training for hidden Markov models with partially obe served states. In L´ on Bottou and Michael Littman, editors, Proceedings of the Twenty-sixth e International Conference on Machine Learning, pages 265–272, Montreal, June 2009. Omnipress. T. M. T. Do and T. Arti` res. Neural conditional random fields. In Proceedings of the Thirteenth e International Conference on Artificial Intelligence and Statistics, 2010. T. M. T. Do and T. Artieres. Regularized bundle methods for convex and non convex risks: additional material to the jmlr paper. Technical report, LIP6, Pierre and Marie Curie University, Paris, 2012. URL http://www.lip6.fr/lip6/reports/2012/lip6-2012-001.pdf. V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector machines. In Proceedings of the Twenty-fifth International Conference on Machine learning, pages 320–327, New York, NY, USA, 2008. ACM. M. Gaudioso and M.F. Monaco. Variants to the cutting plane approach for convex nondifferentiable optimization. Optimization 25, pages 65–75, 1992. M. Haarala, K. Miettinen, and M. M. Makela. New limited memory bundle method for large-scale nonsmooth optimization. Optimization Methods and Software, 19:673–692(20), December 2004. G. E. Hinton, S. Osindero, and Y-W Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, July 2006. R. Horst and N. V. Thoai. Dc programming: overview. Optimization Theory and Applications, 103 (1):1–43, 1999. ISSN 0022-3239. J. Hu, S. Gek Lim, and M.K. Brown. Writer independent on-line handwriting recognition using an hmm approach. Pattern Recognition, 33(1):133–147, 2000. K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In Proceedings of the Twelfth International Conference on Computer Vision. IEEE, 2009. H. Jiang and X. Li. Incorporating training errors for large margin HMMs under semi-definite programming framework. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, volume 4, pages IV–629. IEEE, 2007. T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 200–209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. 3581 ` D O AND A RTI E RES T. Joachims. Training linear SVMs in linear time. In Proceedings of the Twelfth International Conference On Knowledge Discovery and Data Mining, pages 217 – 226, 2006. T. Joachims, T. Finley, and Chun-Nam Yu. Cutting-plane training of structural svms. Machine Learning, 77(1):27–59, 2009. R. H. Kassel. A Comparison of Approaches to On-line Handwritten Character Recognition. PhD thesis, Cambridge, MA, USA, 1995. K. Kiwiel. An aggregate subgradient method for nonsmooth convex minimization. Mathematical Programming, 27:320–341, 1983. K. C. Kiwiel. Methods of Descent for Nondifferentiable Optimization. Springer-Verlag, 1985. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. L. Luksan and J. Vlcek. Introduction to nonsmooth analysis. theory and algorithms. Technical report, Universita degli Studi di Bergamo, 2000. M. Makela. Survey of bundle methods for nonsmooth optimization. Optimization Methods and Software, 17:1–29, 2002. M. Makela and P. Neittaanmaki. Nonsmooth Optimization: Analysis and Algorithms with Applications to Optimal Control. World Scientific Publishing Co., 1992. L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. H. Schramm and J. Zowe. A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical results. SIAM Journal on Optimization, 2: 121–152, 1992. F. Sha and L.K. Saul. Large margin hidden markov models for automatic speech recognition. Advances in Neural Information Processing Systems, 19:1249, 2007. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the Twenty-fourth International Conference on Machine Learning, pages 807–814, New York, USA, 2007. ACM Press. A. Smola, S. V. N. Vishwanathan, and Q. V. Le. Bundle methods for machine learning. Advances in Neural Information Processing Systems, 20:1377–1384, 2008. B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. Advances in Neural Information Processing Systems, 16, 2004. C. H. Teo, Q. V. Le, A. Smola, and S. V.N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Proceedings of the Thirteenth International Conference On Knowledge Discovery and Data Mining, pages 727–736, New York, USA, 2007. ACM. 3582 R EGULARIZED B UNDLE M ETHODS FOR C ONVEX AND N ON -C ONVEX R ISKS M. Weimer, A. Karatzoglou, Q. V. Le, and A. Smola. Advances in Neural Information Processing Systems, 20. L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. Advances in Neural Information Processing Systems, 17:1537–1544, 2004. L. Xu, D. Wilkinson, F. Southey, and D. Schuurmans. Discriminative unsupervised learning of structured predictors. In Proceedings of the Twenty-third international conference on Machine learning, pages 1057–1064. ACM, 2006. Z. Xu, R. Jin, J. Zhu, I. King, and M. Lyu. Efficient convex relaxation for transductive support vector machine. Advances in Neural Information Processing Systems, 20:1641–1648, 2008. D. Yu and L. Deng. Large-margin discriminative training of hidden markov models for speech recognition. In Proceedings of the International Conference on Semantic Computing, pages 429– 438, Washington, DC, USA, 2007. IEEE Computer Society. A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15(4):915– 936, 2003. B. Zhao, F. Wang, and C. Zhang. Efficient multiclass maximum margin clustering. In Proceeding of the Twenty-fifth International Conference on Machine Learning, pages 1248–1255, Helsinki, Finland, 2008. 3583