nips nips2009 nips2009-180 nips2009-180-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gert R. Lanckriet, Bharath K. Sriperumbudur
Abstract: The concave-convex procedure (CCCP) is a majorization-minimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive SVMs, sparse principal component analysis, etc. Though widely used in many applications, the convergence behavior of CCCP has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper, however, we believe the analysis is not complete. Although the convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), its proof is more specialized and technical than actually required for the specific case of CCCP. In this paper, we follow a different reasoning and show how Zangwill’s global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP, allowing a more elegant and simple proof. This underlines Zangwill’s theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectation-maximization, generalized alternating minimization, etc. In this paper, we provide a rigorous analysis of the convergence of CCCP by addressing these questions: (i) When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? (ii) When does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP.
o
[1] D. B¨ hning and B. G. Lindsay. Monotonicity of quadratic-approximation algorithms. Annals of the Institute of Statistical Mathematics, 40(4):641–663, 1988.
[2] J. F. Bonnans, J. C. Gilbert, C. Lemar´ chal, and C. A. Sagastiz´ bal. Numerical Optimization: Theoretical e a and Practical Aspects. Springer-Verlag, 2006.
[3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In Proc. 15th International Conf. on Machine Learning, pages 82–90. Morgan Kaufmann, San Francisco, CA, 1998. 8
[4] E. J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted Anal. Appl., 2007. To appear. 1 minimization. J. Fourier
[5] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of Machine Learning Research, 7:1687–1712, 2006.
[6] J. deLeeuw. Applications of convex analysis to multidimensional scaling. In J. R. Barra, F. Brodeau, G. Romier, and B. Van Cutsem, editors, Recent advantages in Statistics, pages 133–146, Amsterdam, The Netherlands, 1977. North Holland Publishing Company.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B, 39:1–38, 1977.
[8] T. Pham Dinh and L. T. Hoai An. Convex analysis approach to d.c. programming: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289–355, 1997.
[9] T. Pham Dinh and L. T. Hoai An. D.c. optimization algorithms for solving the trust region subproblem. SIAM Journal of Optimization, 8:476–505, 1998.
[10] C. B. Do, Q. V. Le, C. H. Teo, O. Chapelle, and A. J. Smola. Tighter bounds for structured estimation. In Advances in Neural Information Processing Systems 21, 2009. To appear.
[11] G. Fung and O. L. Mangasarian. Semi-supervised support vector machines for unlabeled data classification. Optimization Methods and Software, 15:29–44, 2001.
[12] A. Gunawardana and W. Byrne. Convergence theorems for generalized alternating minimization procedures. Journal of Machine Learning Research, 6:2049–2073, 2005.
[13] W. J. Heiser. Correspondence analysis with least absolute residuals. Comput. Stat. Data Analysis, 5:337– 356, 1987.
[14] P. J. Huber. Robust Statistics. John Wiley, New York, 1981.
[15] D. R. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58:30–37, 2004.
[16] D. R. Hunter and R. Li. Variable selection using MM algorithms. Annals of Statistics, 33:1617–1642, 2005.
[17] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions with discussion. Journal of Computational and Graphical Statistics, 9(1):1–59, 2000.
[18] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 556–562. MIT Press, Cambridge, 2001.
[19] X.-L. Meng. Discussion on “optimization transfer using surrogate objective functions”. Journal of Computational and Graphical Statistics, 9(1):35–43, 2000.
[20] R. R. Meyer. Sufficient conditions for the convergence of monotonic mathematical programming algorithms. Journal of Computer and System Sciences, 12:108–121, 1976.
[21] M. Minoux. Mathematical Programming: Theory and Algorithms. John Wiley & Sons Ltd., 1986.
[22] J. Neumann, C. Schn¨ rr, and G. Steidl. Combined SVM-based feature selection and classification. Mao chine Learning, 61:129–150, 2005.
[23] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York, 1970.
[24] R. Salakhutdinov, S. Roweis, and Z. Ghahramani. On the convergence of bound optimization algorithms. In Proc. 19th Conference in Uncertainty in Artificial Intelligence, pages 509–516, 2003.
[25] F. Sha, Y. Lin, L. K. Saul, and D. D. Lee. Multiplicative updates for nonnegative quadratic programming. Neural Computation, 19:2004–2031, 2007.
[26] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In Proc. of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005.
[27] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet. Sparse eigen methods by d.c. programming. In Proc. of the 24th Annual International Conference on Machine Learning, 2007.
[28] L. Wang, X. Shen, and W. Pan. On transductive support vector machines. In J. Verducci, X. Shen, and J. Lafferty, editors, Prediction and Discovery. American Mathematical Society, 2007.
[29] C. F. J. Wu. On the convergence properties of the EM algorithm. Annals of Statistics, 11(1):95–103, 1983.
[30] A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003.
[31] W. I. Zangwill. Nonlinear Programming: A Unified Approach. Prentice-Hall, Englewood Cliffs, N.J., 1969. 9