jmlr jmlr2005 jmlr2005-39 jmlr2005-39-reference knowledge-graph by maker-knowledge-mining

39 jmlr-2005-Information Bottleneck for Gaussian Variables

Source: pdf

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difﬁcult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis

reference text

H. Antti, E. Holmes, and J. Nicholson. Multivariate solutions to metabonomic proﬁling and functional genomics. part 2 - chemometric analysis, 2002. http://www.acc.umu.se/ tnkjtg/chemometrics/editorial/oct2002. S. Becker. Mutual information maximization: Models of cortical self-organization. Network: Computation in Neural Systems, pages 7,31, 1996. 3. The ﬁrst equation is the standard inversion lemma (see e.g., Kay, 1993, page 571). The second can be easily veriﬁed from the ﬁrst. 185 C HECHIK , G LOBERSON , T ISHBY AND W EISS S. Becker and G. E. Hinton. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161–163, 1992. T. Berger and R. Zamir. A semi-continuous version of the berger-yeung problem. IEEE Transactions on Information Theory, pages 1520–1526, 1999. M. Borga. Canonical correlation: a tutorial. http://people.imt.liu.se/magnus/cca, January 2001. M. Borga, H. Knutsson, and T. Landelius. Learning canonical correlations. In Proceedings of the 10th Scandinavian Conference on Image Analysis, Lappeenranta, Finland, June 1997. SCIA. G. Chechik and N. Tishby. Extracting relevant structures with side information. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, Vancouver, Canada, 2002. T. M. Cover and J. A. Thomas. The elements of information theory. Plenum Press, New York, 1991. J. W. Demmel. Applied numerical linear algebra. Society for Industrial and Applied Mathematics, Philadelphia, 1997. ISBN 0-89871-389-7. Alexander G. Dimitrov and John P. Miller. Neural coding and decoding: Communication channels and quantization. Network: Computation in Neural Systems, 12(4):441–472, 2001. URL citeseer.nj.nec.com/dimitrov01neural.html. N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In J. S. Breese and D. Koller, editors, Uncertainty in Artiﬁcial Intelligence: Proceedings of the Seventeenth Conference (UAI-2001), pages 152–161, San Francisco, CA, 2001. Morgan Kaufmann. O. Friman, M. Borga, P. Lundberg, and H. Knutsson. Adaptive analysis of fMRI data. NeuroImage, 19(3):837–845, 2003. Rimoldi Gastpar and Vetterli. To code, or not to code: lossy source-channel communication revisited. Information Theory, 49(5):1147–1158, 2003. R. Gilad-Bachrach, A. Navot, and N. Tishby. An information theoretic tradeoff between complexity and accuracy. In Proceedings of the COLT, Washington, 2003. A. Globerson and N. Tishby. Sufﬁcient dimensionality reduction. Journal of Macine Learning Research, 3:1307–1331, 2003. ISSN 1533-7928. A. Globerson and N. Tishby. Tbd. Technical report, Hebrew University, January 2004. G. H. Golub and C. F. V. Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, 1989. V. K. Goyal. Theoretical foundations of transform coding. Signal Processing Magazine, IEEE, 18 (5):9–21, 2001. H. Hotelling. The most predictable criterion. Journal of Educational Psychology,, 26:139–142, 1935. 186 G AUSSIAN I NFORMATION B OTTLENECK A. Jennings and G. W. Stewart. Simultaneous iteration for partial eigensolution of real matrices. J. Inst. Math Appl, 15:351–361, 1975. S. M. Kay. Fundamentals of Statistical Signal Processing Volume I Estimation Theory. PrenticeHall, 1993. R. Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988. R. Linsker. Local synaptic learning rules sufﬁce to maximize mutual information in a linear network. Neural Computation, 4:691–702, 1992. J. R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley and Sons, 1988. S. Mika, G. Ratsch, J. Weston, B. Scholkopf, A. Smola, and K. Muller. Invariant feature extraction and classiﬁcation in kernel spaces. In S. A. Solla, T. K. Leen, and K. R. Muller, editors, Advances in Neural Information Processing Systems 12, pages 526–532, Vancouver, Canada, 2000. S. S. Pradhan. On rate-distortion function of gaussian sources with memory with side information at the decoder. Technical report, Berkeley, 1998. S. S. Pradhan, J. Chou, and K. Ramchandran. Duality between source coding and channel coding and its extension to the side information case. Information Theory, 49(5):1181–1203, 2003. C. E. Shannon. Coding theorems for a discrete source with a ﬁdelity criterion. In Institute for Radio Engineers, International Convention Record, volume 7, part 4, pages 142–163, New York, NY, USA, March 1959. J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in an auxiliary space. Neural Computation, 14:217–239, 2001. N. Slonim. Information Bottleneck theory and applications. PhD thesis, Hebrew University of Jerusalem, 2003. N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In P. Ingwersen N. J. Belkin and M-K. Leong, editors, Research and Development in Information Retrieval (SIGIR-00), pages 208–215. ACM press, new york, 2000. N. Slonim and Y. Weiss. Maximum likelihood and the information bottleneck. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, Vancouver, Canada, 2002. B. Thompson. Canonical correlation analysis: Uses and interpretation., volume 47. Thousands Oak, CA Sage publications, 1984. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of 37th Allerton Conference on communication and computation, 1999. H. von Storch and F. W. Zwiers. Statistical Analysis in Climate Research. Cambridge University Press, 1999. 187 C HECHIK , G LOBERSON , T ISHBY AND W EISS A. D. Wyner. On source coding with side information at the decoder. IEEE Transactions on Information Theory, IT-21:294–300, 1975. A. D. Wyner. The rate distortion function for source coding with side information at the decoder ii: General sources. IEEE Transactions on Information Theory, IT-38:60–80, 1978. 188