nips nips2002 nips2002-83 nips2002-83-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
[1] L.D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proc. of SIGIR, 1998.
[2] T.M. Cover and J.A. Thomas. The elements of information theory. Plenum Press, NY, 1991.
[3] I. Csiszar and J.Korner. Information theory: Coding Theorems for Discrete Memoryless Systems. Academic Press New York, 1997. 2nd edition.
[4] S. Dumais and H. Chen. Hierarchical classification of web content. In Proc. of SIGIR, pages 256–263, 2000.
[5] A. Globerson and N. Tishby. Sufficient dimentionality reduction. J. Mach. Learn. Res., 2003.
[6] T. Hoffman. Probabilistic latent semantic indexing. In Proc. of SIGIR, pages 50–57, 1999.
[7] K. Lang. Learning to filter netnews. In Proc. of 12th Int Conf. on machine Learning, 1995.
[8] N. Friedman O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In Proc of UAI, pages 152–161, 2001.
[9] F.C. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In Meeting of the Association for Computational Linguistics, pages 183–190, 1993.
[10] E. Schneidman, N. Slonim, N. Tishby, R. deRuyter van Steveninck, and W. Bialek. Analyzing neural codes using the information bottleneck method. Technical report, The Hebrew University, 2002.
[11] J. Sinkkonen and S. Kaski. Clustering based on conditional distribution in an auxiliary space. Neural Computation, 14:217–239, 2001.
[12] N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proc. of SIGIR, pages 129–136, 2002.
[13] N. Slonim and N. Tishby. Agglomerative information bottleneck. In Advances in Neural Information Processing Systems (NIPS), 1999.
[14] N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In Proc. of SIGIR, pages 208–215, 2000.
[15] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of 37th Allerton Conference on communication and computation, 1999.
[16] A. Vinokourov and M.Girolani. A probabilistic framework for the hierarchic organization and classification of document collections. J. Intell. Info Sys., 18(23):153–172, 2002.
[17] A. Wyner and J. Ziv. The rate distortion function for source coding with side information at the decoder. IEEE Trans. Information Theory, 22(1):1–10, 1976.