nips nips2010 nips2010-70 nips2010-70-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
[1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 2003.
[2] David M. Blei and Jon D. Mcauliffe. Supervised topic models. In Advances in Neural Information Processing Systems (NIPS), 2007.
[3] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991.
[4] H. Larochelle and Y. Bengio. Classification using discriminative restricted boltzmann machines. In Proceedings of the international conference on Machine learning (ICML), 2008.
[5] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In Advances in Neural Information Processing Systems (NIPS), 2004.
[6] Linli Xu. Unsupervised and semi-supervised multi-class support vector machines. In AAAI, 2005.
[7] F. Bach and Z. Harchaoui. Diffrac : a discriminative and flexible framework for clustering. In Advances in Neural Information Processing Systems (NIPS), 2007.
[8] N. Quadrianto, T. Caetano, J. Lim, and D. Schuurmans. Convex relaxation of mixture regression with efficient algorithms. In Advances in Neural Information Processing Systems (NIPS), 2009.
[9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504, 2006.
[10] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001.
[11] David R Hunter and Kenneth Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30– 37, February 2004.
[12] J Shawe-Taylor and N Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ Press, 2004.
[13] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins University Press, 3rd edition, October 1996.
[14] Kurt Anstreicher and Samuel Burer. D.C. versus copositive bounds for standard QP. Journal of Global Optimization, 33(2):299–312, October 2005.
[15] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004.
[16] Samuel Burer. Optimizing a polyhedral-semidefinite relaxation of completely positive programs. Mathematical Programming Computation, 2(1):1–19, March 2010.
[17] M. Journ´ e, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization for semidefinite convex e problems. volume 20, pages 2327–2351. SIAM Journal on Optimization, 2010.
[18] A. Berman and N. Shaked-Monderer. Completely Positive Matrices. World Scientific Publishing Company, 2003.
[19] D. Bertsekas. Nonlinear programming. Athena Scientific, 1995.
[20] P. Brucker. An O(n) algorithm for quadratic knapsack problems. In Journal of Optimization Theory and Applications, volume 134, pages 549–554, 1984.
[21] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, August 2010.
[22] Patrick L. Combettes. Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization, 53:475–504, 2004.
[23] Simon Lacoste-Julien. Discriminative Machine Learning with Structure. PhD thesis, University of California, Berkeley, 2009.
[24] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Maximum margin clustering made practical. In Proceedings of the international conference on Machine learning (ICML), 2007.
[25] Thomas G. Dietterich and Richard H. Lathrop. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1997.
[26] P. Felzenszwalb, D. Mcallester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[27] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2010. 9