nips nips2002 nips2002-142 nips2002-142-reference knowledge-graph by maker-knowledge-mining

142 nips-2002-Maximum Likelihood and the Information Bottleneck


Source: pdf

Author: Noam Slonim, Yair Weiss

Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other.   ©§ ¥£ ¨¦¤¢   


reference text

[1] N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method. In Proc. 37th Allerton Conference on Communication and Computation, 1999.

[2] N. Slonim. The Information Bottleneck: theory and applications. Ph.D. thesis, The Hebrew University, 2002.

[3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991.

[4] T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In Proc. of NIPS-11, 1998.

[5] J. Puzicha, T. Hofmann, and J. M. Buhmann. Histogram clustering for unsupervised segmentation and image retrieval. In Pattern Recognition Letters 20(9), 899-909, 1999.

[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977.

[7] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (editor), Learning in Graphical Models, pp. 355-368, 1998.

[8] L. Hermes, T. z¨ ller, and J. M. Buhmann. Parametric distributional clustering for image segmentation. In Proc. of European o Conference on Computer Vision (ECCV), 2002

[9] K. Lang. Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, 1995.

[10] N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proc. of SIGIR-25, 2002.

[11] N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate Information Bottleneck. In Proc. of UAI-17, 2001. § The KL with respect to is defined as the minimum over all the members in . Therefore, here, both arguments of the KL are changing during the process, and the distributions involved in the minimization are over all the three random variables. ¨ ¨