jmlr jmlr2005 jmlr2005-19 jmlr2005-19-reference knowledge-graph by maker-knowledge-mining

19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Source: pdf

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difﬁcult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA

reference text

M. Abramowitz and I. A. Stegun, editors. Handbook of Mathematical Functions. Dover Publ. Inc., New York, 1974. S. I. Amari. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8(9):1379–1408, 1995. A. Banerjee and J. Ghosh. Frequency sensitive competitive learning for clustering on highdimensional hyperspheres. In Proceedings International Joint Conference on Neural Networks, pages 1590–1595, May 2002. A. Banerjee and J. Ghosh. Frequency Sensitive Competitive Learning for Scalable Balanced Clustering on High-dimensional Hyperspheres. IEEE Transactions on Neural Networks, 15(3):702– 719, May 2004. A. Banerjee and J. Langford. An objective evaluation criterion for clustering. In Proc. 10th International Conference on Knowledge Discovery and Data Mining (KDD), pages 515–520, 2004. A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In Proc. of 4th SIAM International Conference on Data Mining, pages 234–245, 2004. J. Bilmes. A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report ICSI-TR-97-021, University of Berkeley, 1997. A. Blum and J. Langford. PAC-MDL bounds. In Proc. 16th Annual Conference on Learning Theory (COLT), 2003. 1378 C LUSTERING WITH VON M ISES -F ISHER D ISTRIBUTIONS P. S. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, Microsoft Research, May 2000. D. Coleman, X. Dong, J. Hardin, D. Rocke, and D. Woodruf. Some computational issues in cluster analysis with no apriori metric. Computional Statistics and Data Analysis, 31:1–11, 1999. M. Collins. The EM algorithm. In fulﬁllment of Written Preliminary Exam II requirement, September 1997. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. S. Dasgupta. Learning mixtures of Gaussians. In IEEE Symposium on Foundations of Computer Science, 1999. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977. I. S. Dhillon, J. Fan, and Y. Guan. Efﬁcient clustering of very large document collections. In V. Kumar R. Grossman, C. Kamath and R. Namburu, editors, Data Mining for Scientiﬁc and Engineering Applications. Kluwer Academic Publishers, 2001. I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In Proceedings of The 2002 IEEE International Conference on Data Mining, 2002a. I. S. Dhillon, E. M. Marcotte, and U. Roshan. Diametrical clustering for identifying anti-correlated gene clusters. Technical Report TR 2002-49, Department of Computer Sciences, University of Texas, September 2002b. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, 2001. I. S. Dhillon and S. Sra. Modeling data using directional distributions. Technical Report TR-03-06, Department of Computer Sciences, University of Texas at Austin, Austin, TX, 2003. B. E. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM Research Report, 2001. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. Proc. National Academy of Science, 95:14863–14868, 1998. N. I. Fisher. Statistical Analysis of Circular Data. Cambridge University Press, 1996. J. Ghosh. Scalable clustering methods for data mining. In Nong Ye, editor, Handbook of Data Mining, pages 247–277. Lawrence Erlbaum, 2003. GoMiner03. http://discover.nci.nih.gov/gominer/. T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend. Functional discovery via a compendium of expression proﬁles. Cell, 102:109–126, 2000. 1379 BANERJEE , D HILLON , G HOSH AND S RA P. Indyk. A sublinear-time approximation scheme for clustering in metric spaces. In 40th Symposium on Foundations of Computer Science, 1999. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classiﬁers. In M. S. Kearns, S. A. Solla, and D. D. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 487–493. MIT Press, 1999. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey, 1988. R. Kannan, S. Vempala, and A. Vetta. On clusterings—good, bad and spectral. In 41st Annual IEEE Symposium Foundations of Computer Science, pages 367–377, 2000. M. Kearns, Y. Mansour, and A. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In 13th Annual Conf. on Uncertainty in Artiﬁcial Intelligence (UAI97), 1997. I. Lee, S. V. Date, A. T. Adai, and E. M. Marcotte. A probabilistic functional network of yeast genes. Science, 306(5701):1555–1558, 2004. K. V. Mardia. Statistical Distributions in Scientiﬁc Work, volume 3, chapter “Characteristics of directional distributions”, pages 365–385. Reidel, Dordrecht, 1975. K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000. G. J. McLachlan. The classiﬁcation and mixture maximum likelihood approaches to cluster analysis. In P. R. Krishnaiah and L. N. Kanal, editors, Handbook of Statistics, volume 2, pages 199–208. North-Holland, 1982. G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, 1997. G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley series in Probability and Mathematical Statistics: Applied Probability and Statistics Section. John Wiley & Sons, 2000. M. Meil˘ . Comparing clusterings by the variation of information. In COLT, 2003. a J. A. Mooney, P. J. Helms, and I. T. Jolliffe. Fitting mixtures of von Mises distributions: a case study involving sudden infant death syndrome. Computational Statistics & Data Analysis, 41: 505–513, 2003. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justiﬁes incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. MIT Press, 1998. 20 Newsgroups. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html. K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell. Text classiﬁcation from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000. D. Peel, W. J. Whiten, and G. J. McLachlan. Fitting mixtures of Kent distributions to aid in joint set identiﬁcation. Journal of American Statistical Association, 96:56–63, 2001. 1380 C LUSTERING WITH VON M ISES -F ISHER D ISTRIBUTIONS J. H. Piater. Visual Feature Learning. PhD thesis, University of Massachussets, June 2001. C. R. Rao. Linear Statistical Inference and its Applications. Wiley, New York, 2nd edition, 1973. E. Rasmussen. Clustering algorithms. In W. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 419–442. Prentice Hall, New Jersey, 1992. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 4(5):513–523, 1988. G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983. B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Reidl. Item-based collaborative ﬁltering recommendation algorithms. In World Wide Web, 10, pages 285–295, 2001. B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2001. o E. Segal, A. Battle, and D. Koller. Decomposing gene expression into cellular processes. In Proc. of 8th Paciﬁc Symposium on Biocomputing (PSB), 2003. R. Sharan and R. Shamir. CLICK: A clustering algorithm with applications to gene expression analysis. In Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), pages 307–316. AAAI Press, 2000. K. Shimizu and K. Iida. Pearson type VII distributions on spheres. Communications in Statistics: Theory & Methods, 31(4):513–526, 2002. J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in an auxiliary space. Neural Computation, 14:217–239, 2001. P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing, volume 9, pages 648–654. MIT Press, 1997. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000. A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proc 7th Natl Conf on Artiﬁcial Intelligence : Workshop of AI for Web Search (AAAI 2000), pages 58–64. AAAI, July 2000. C. S. Wallace and D. L. Dowe. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing, 10(1):73–83, January 2000. G. N. Watson. A treatise on the theory of Bessel functions. Cambridge University Press, 2nd edition, 1995. A. T. A. Wood. Simulation of the von-Mises Distribution. Communications of Statistics, Simulation and Computation, 23:157–164, 1994. Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311–331, June 2004. 1381 BANERJEE , D HILLON , G HOSH AND S RA S. Zhong and J. Ghosh. A Uniﬁed Framework for Model-based Clustering. Journal of Machine Learning Research, 4:1001–1037, November 2003a. S. Zhong and J. Ghosh. A comparative study of generative models for document clustering. In Workshop on Clustering High Dimensional Data : Third SIAM Conference on Data Mining, April 2003b. 1382