jmlr jmlr2005 jmlr2005-20 jmlr2005-20-reference knowledge-graph by maker-knowledge-mining

20 jmlr-2005-Clustering with Bregman Divergences

Source: pdf

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by ﬁrst posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efﬁcient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

reference text

N. I. Akhizer. The Classical Moment Problem and some related questions in analysis. Hafner Publishing Company, 1965. S. Amari. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8(9):1379–1408, 1995. S. Amari and H. Nagaoka. Methods of Information Geometry. American Mathematical Society, 2001. S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18:14–20, 1972. K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211–246, 2001. A. Banerjee, I. Dhillon, J. Ghosh, and S. Merugu. An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proc. 21st International Conference on Machine Learning (ICML), 2004a. A. Banerjee, X. Guo, and H. Wang. Optimal Bregman prediction and Jensen’s equality. In Proc. International Symposium on Information Theory (ISIT), 2004b. A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7):2664–2669, July 2005. O. Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. Wiley Publishers, 1978. C. Berg, J. Christensen, and P. Ressel. Harmonic Analysis on Semigroups: Theory of Positive Deﬁnite and Related Functions. Springer-Verlag, 1984. T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall, 1971. T. Berger and J. D. Gibson. Lossy source coding. IEEE Transactions on Information Theory, 44 (6):2691–2723, 1998. J. Bilmes. A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden markov models. Technical Report ICSI-TR-97-02, University of Berkeley, 1997. R. E. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18:460–473, 1972. L. M. Bregman. The relaxation method of ﬁnding the common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967. 1746 C LUSTERING WITH B REGMAN D IVERGENCES A. Buzo, A. H. Gray, R. M. Gray, and J. D. Markel. Speech coding based upon vector quantization. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(5):562–574, 1980. Y. Censor and S. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, 1998. M. Collins. The EM algorithm. Technical report, Department of Computer and Information Science, University of Pennsylvania, 1997. M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to the exponential family. In Proc. of the 14th Annual Conference on Neural Information Processing Systems (NIPS), 2001. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. I. Csisz´ r. On the computation of rate distortion functions. IEEE Transactions of Information a Theory, IT-20:122:124, 1974. I. Csisz´ r. Why least squares and maximum entropy? An axiomatic approach to inference for linear a inverse problems. The Annals of Statistics, 19(4):2032–2066, 1991. I. Csisz´ r. Generalized projections for non-negative functions. Acta Mathematica Hungarica, 68 a (1-2):161–185, 1995. I. Csisz´ r and G. Tusn´ dy. Information geometry and alternating minimization procedures. Statistics a a and Decisions, Supplement Issue, 1(1):205–237, 1984. A. Devinatz. The representation of functions as Laplace-Stieltjes integrals. Duke Mathematical Journal, 24:481–498, 1955. I. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic feature clustering algorithm for text classiﬁcation. Journal of Machine Learning Research, 3(4):1265–1287, 2003. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁcation. John Wiley & Sons, 2001. W. Ehm, M. G. Genton, and T. Gneiting. Stationary covariances associated with exponentially convex functions. Bernoulli, 9(4):607–615, 2003. J. Forster and M. K. Warmuth. Relative expected instantaneous loss bounds. In Proc. of the 13th Annual Conference on Computational Learing Theory (COLT), pages 90–99, 2000. A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1992. R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6): 2325–2383, 1998. P. D. Gr¨ nwald and A. Dawid. Game theory, maximum entropy, minimum discrepancy, and robust u bayesian decision theory. Annals of Statistics, 32(4), 2004. 1747 BANERJEE , M ERUGU , D HILLON AND G HOSH P. D. Gr¨ nwald and P. Vit´ nyi. Kolmogorov complexity and information theory with an interpreu a tation in terms of questions and answers. Journal of Logic, Language and Information, 12(4): 497–529, 2003. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey, 1988. D. Kazakos and P.P. Kazakos. Spectral distance measures between Gaussian processes. IEEE Transactions on Automatic Control, 25(5):950–959, 1980. Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantizer design. IEEE Transactions on Communications, 28(1):84–95, 1980. J. MacQueen. Some methods for classiﬁcation and analysis of multivariate observations. In Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967. G. J. McLachlan and T. Krishnan. The EM algorithm and Extensions. Wiley-Interscience, 1996. D. Modha and S. Spangler. Feature weighting in k-means clustering. Machine Learning, 52(3): 217–237, 2003. K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Text classiﬁcation from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000. M. Palus. On entropy rates of dynamical systems and Gaussian processes. Physics Letters A, 227 (5-6):301–308, 1997. S. D. Pietra, V. D. Pietra, and J. Lafferty. Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, Carnegie Mellon University, 2001. C. R. Rao. Diversity and dissimilarity coefﬁcients: A uniﬁed approach. Journal of Theoretical Population Biology, 21:24–43, 1982. R. Redner and H. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984. R. T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathematics. Princeton University Press, 1970. K. Rose. A mapping approach to rate-distortion computation and analysis. IEEE Transactions on Information Theory, 40(6):1939–1952, 1994. N. Slonim and Y. Weiss. Maximum likelihood and the information bottleneck. In Proc. 16th Annual Conference on Neural Information Processing Systems (NIPS), pages 335–342, 2002. A. Strehl and J. Ghosh. Cluster ensembles – a knowledge reuse framework for combining partitionings. Journal of Machine Learning Research, 3(3):583–617, 2002. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999. 1748 C LUSTERING WITH B REGMAN D IVERGENCES M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report TR 649, Dept. of Statistics, University of California at Berkeley, 2003. S. Wang and D. Schuurmans. Learning continuous latent variable models with Bregman divergences. In Proc. IEEE International Conference on Algorithmic Learning Theory, 2003. 1749