nips nips2007 nips2007-61 nips2007-61-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
[1] J. Puzicha, T. Hofmann, and J. M. Buhmann, “Theory of proximity based clustering: Structure detection by optimization,” Pattern Recognition, Vol. 33, No. 4, pp. 617–634, 2000.
[2] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On Spectral Clustering: Analysis and an Algorithml,” Advances in Neural Information Processing Systems, Vol. 14, pp. 849–856, 2001.
[3] I. Csisz´ r and P. Shields, “Information Theory and Statistics: A Tutorial,” Foundations and Trends in a Communications and Information Theory, Vol. 1, No. 4, pp. 417–528, 2004.
[4] M. Meil˘ , and D. Heckerman, “An Experimental Comparison of Model-Based Clustering Methods,” Maa chine Learning, Vol. 42, No. 1-2, pp. 9–29, 2001.
[5] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.
[6] B. J. Frey, and D. Dueck, “Clustering by Passing Messages Between Data Points,” Science, Vol. 315, No. 5814, pp. 972–976, 2007.
[7] A. Banerjee, S. Merugu, I. S.Dhillon, and J. Ghosh, “Clustering with Bregman Divergences,” Journal of Machine Learning Research, Vol. 6, No. 6, pp. 1705-1749, 2005.
[8] T. M. Cover, and J. A. Thomas, Elements of information theory, New York, Wiley, 1991.
[9] K. Rose, “Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems,” Proceedings of the IEEE, Vol. 86, No. 11, pp. 2210–2239, 1998.
[10] R. E. .Blahut, “Computation of Channel Capacity and Rate-Distortion Functions,” IEEE Transactions on Information Theory, Vol. IT-18, No. 4, pp. 460–473, 1974.
[11] M. Pardalos, and J. B. Rosen, “Methods for Global Concave Minimization: A Bibliographic Survey,” SIAM Review, Vol. 28, No. 3., pp. 367–379, 1986. 8