jmlr jmlr2010 jmlr2010-89 jmlr2010-89-reference knowledge-graph by maker-knowledge-mining

89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond


Source: pdf

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily


reference text

Orly Alter, Patrick O. Brown, and David Botstein. Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. In Proceedings of the National Academy of Science, 2003. 3639 S ELDIN AND T ISHBY Amiran Ambroladze, Emilio Parrado-Hern´ ndez, and John Shawe-Taylor. Tighter PAC-Bayes a bounds. In Advances in Neural Information Processing Systems (NIPS), 2007. Surugu Arimoto. An algorithm for computing the capacity of discrete memoryless channel. IEEE Transactions on Information Theory, 18, 1972. Jean-Yves Audibert and Olivier Bousquet. Combining PAC-Bayesian and generic chaining bounds. Journal of Machine Learning Research, 2007. Arindam Banerjee. On Bayesian bounds. In Proceedings of the International Conference on Machine Learning (ICML), 2006. Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu, and Dhamendra S. Modha. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. Journal of Machine Learning Research, 8, 2007. Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. In Proceedings of the International Conference on Computational Learning Theory (COLT), 2001. Peter L. Bartlett, St´ phane Boucheron, and G´ bor Lugosi. Model selection and error estimation. e a Machine Learning, 2001. Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester. Exponentiated gradient algorithms for large-margin structured classification. In Advances in Neural Information Processing Systems (NIPS), 2005. Shai Ben-David and Ulrike von Luxburg. Relating clustering stability to properties of cluster boundaries. In Proceedings of the International Conference on Computational Learning Theory (COLT), 2008. Shai Ben-David, Ulrike von Luxburg, and David P´ l. A sober look on clustering stability. In a Advances in Neural Information Processing Systems (NIPS), 2006. Richard E. Blahut. Computation of channel capacity and rate distortion functions. IEEE Transactions on Information Theory, 18, 1972. Gilles Blanchard and Francois Fleuret. Occam’s hammer. In Proceedings of the International ¸ Conference on Computational Learning Theory (COLT), 2007. David Blei and John Lafferty. Topic models. In A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. Taylor and Francis, 2009. St´ phane Boucheron, G´ bor Lugosi, and Olivier Bousquet. Theory of classification: a survey of e a recent advances. ESAIM: Probability and Statistics, 2005. Olivier Catoni. PAC-Bayesian supervised classification: The thermodynamics of statistical learning. IMS Lecture Notes Monograph Series, 56, 2007. Yizong Cheng and George M. Church. Biclustering of expression data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2000. 3640 PAC-BAYESIAN A NALYSIS OF C O - CLUSTERING AND B EYOND Hyuk Cho and Inderjit S. Dhillon. Co-clustering of human cancer microarrays using minimum sumsquared residue co-clustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 5(3), 2008. Hyuk Cho, Inderjit S. Dhillon, Yuqiang Guan, and Suvrit Sra. Minimum sum-squared residue coclustering of gene expression data. In Proceedings of the Fourth SIAM International Conference on Data Mining, 2004. Thomas M. Cover. Admissibility properties of Gilberts encoding for unknown source probabilities. IEEE Transactions on Information Theory, 18, 1972. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. Koby Crammer, Mehryar Mohri, and Fernando Pereira. Gaussian margin machines. In Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. Philip Derbeko, Ran El-Yaniv, and Ron Meir. Explicit learning curves for transduction and application to clustering and compression algorithms. Journal of Artificial Intelligence Research, 22, 2004. Luc Devroye, L´ szl´ Gy¨ rfi, and G´ bor Lugosi. A Probabilistic Theory of Pattern Recognition. a o o a Springer, 1996. Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S. Modha. Information-theoretic coclustering. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2003. Chris Ding, Tao Li, Wei Peng, and Haesun Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2006. Miroslav Dud´k, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density estimation ı with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 2007. Ran El-Yaniv and Oren Souroujon. Iterative double clustering for unsupervised and semi-supervised learning. In Advances in Neural Information Processing Systems (NIPS), 2001. Dayane Freitag. Trained named entity recognition using distributional clusters. In Proceedings of EMNLP, 2004. Thomas George and Srujana Merugu. A scalable collaborative filtering framework based on coclustering. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05), 2005. Pascal Germain, Alexandre Lacasse, Francois Laviolette, and Mario Marchand. PAC-Bayesian ¸ learning of linear classifiers. In Proceedings of the International Conference on Machine Learning (ICML), 2009. Edgar N. Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17(3), 1971. 3641 S ELDIN AND T ISHBY Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 3rd edition, 1996. Peter Gr¨ nwald. The Minimum Description Length Principle. MIT Press, 2007. u Isabelle Guyon, Ulrike von Luxburg, and Robert C. Williamson. Clustering: Science or art? towards principled approaches. nips workshop, 2009. John A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337), 1972. Jonathan Herlocker, Joseph Konstan, Loren Terveen, and John Riedl. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1), 2004. Matthew Higgs and John Shawe-Taylor. A PAC-Bayes bound for tailored density estimation. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT), 2010. Edwin Thompson Jaynes. Information theory and statistical mechanics. Physical Review, 106, 1957. Michael Kearns, Yishay Mansour, Andrew Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. Machine Learning, 1997. Yong-Deok Kim and Seungjin Choi. Nonnegative Tucker decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007. Yuval Kluger, Ronen Basri, Joseph T. Chang, and Mark Gerstein. Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research, 2003. Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 2001. Rafail E. Krichevskiy. Laplaces law of succession and universal encoding. IEEE Transactions on Information Theory, 44(1), 1998. Eyal Krupka. Generalization from Observed to Unobserved Features. PhD thesis, The Hebrew University of Jerusalem, 2008. Eyal Krupka and Naftali Tishby. Generalization in clustering with unobserved features. In Advances in Neural Information Processing Systems (NIPS), 2005. Eyal Krupka and Naftali Tishby. Generalization from observed to unobserved features by clustering. Journal of Machine Learning Research, 9, 2008. Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability based validation of clustering solutions. Neural Computation, 2004. John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 2005. John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems (NIPS), 2002. 3642 PAC-BAYESIAN A NALYSIS OF C O - CLUSTERING AND B EYOND Danial Lashkari and Polina Golland. Co-clustering with generative models. Technical report, MITCSAIL-TR-2009-054, 2009. Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401, 1999. Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems (NIPS), 2001. Guy Lever, Francois Laviolette, and John Shawe-Taylor. Distribution-dependent PAC-Bayes priors. ¸ In Proceedings of the International Conference on Algorithmic Learning Theory (ALT), 2010. Hang Li and Naoki Abe. Word clustering and disambiguation based on co-occurrence data. In Proceedings of the 17th International Conference on Computational Linguistics, 1998. Sara C. Madeira and Arlindo L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(1), 2004. Yishay Mansour and David McAllester. Generalization bounds for decision trees. In Proceedings of the International Conference on Computational Learning Theory (COLT), 2000. Andreas Maurer. A note on the PAC-Bayesian theorem. www.arxiv.org, 2004. David McAllester. Some PAC-Bayesian theorems. In Proceedings of the International Conference on Computational Learning Theory (COLT), 1998. David McAllester. PAC-Bayesian model averaging. In Proceedings of the International Conference on Computational Learning Theory (COLT), 1999. David McAllester. Simplified PAC-Bayesian margin bounds. In Proceedings of the International Conference on Computational Learning Theory (COLT), 2003. David McAllester. Generalization bounds and consistency for structured labeling. In G¨ khan o Bakir, Thomas Hofmann, Bernhard Sch¨ lkopf, Alexander Smola, Ben Taskar, and S.V.N. Visho wanathan, editors, Predicting Structured Data. The MIT Press, 2007. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 2003. Liam Paninski. Variational minimax estimation of discrete distributions under kl loss. In Advances in Neural Information Processing Systems (NIPS), 2004. Richard Rohwer and Dayne Freitag. Towards full automation of lexicon construction. In Dan Moldovan and Roxana Girju, editors, HLT-NAACL 2004: Workshop on Computational Lexical Semantics, 2004. Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain monte carlo. In Proceedings of the International Conference on Machine Learning (ICML), 2008. 3643 S ELDIN AND T ISHBY Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classification. Journal of Machine Learning Research, 2002. Matthias Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalization Error Bounds and Sparse Approximations. PhD thesis, University of Edinburgh, 2003. Yevgeny Seldin. A PAC-Bayesian Approach to Structure Learning. PhD thesis, The Hebrew University of Jerusalem, 2009. Yevgeny Seldin. A PAC-Bayesian analysis of graph clustering and pairwise clustering. Technical report, http://arxiv.org/abs/1009.0499, 2010. Yevgeny Seldin and Naftali Tishby. Multi-classification by categorical features via clustering. In Proceedings of the International Conference on Machine Learning (ICML), 2008. Yevgeny Seldin and Naftali Tishby. PAC-Bayesian generalization bound for density estimation with application to co-clustering. In Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. Yevgeny Seldin, Noam Slonim, and Naftali Tishby. Information bottleneck for non co-occurrence data. In Advances in Neural Information Processing Systems (NIPS), 2007. M. Mahdi Shafiei and Evangelos E. Milios. Model-based overlapping co-clustering. In Proceeding of SIAM Conference on Data Mining, 2006. Ohad Shamir and Naftali Tishby. On the reliability of clustering stability in the large sample regime. In Advances in Neural Information Processing Systems (NIPS), 2009. Hanhuai Shan and Arindam Banerjee. Bayesian co-clustering. In IEEE International Conference on Data Mining (ICDM), 2008. John Shawe-Taylor and David Hardoon. Pac-bayes analysis of maximum entropy classification. In Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. John Shawe-Taylor and Robert C. Williamson. A PAC analysis of a Bayesian estimator. In Proceedings of the International Conference on Computational Learning Theory (COLT), 1997. John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 2000. Noam Slonim. The Information Bottleneck: Theory and Applications. PhD thesis, The Hebrew University of Jerusalem, 2002. Noam Slonim and Naftali Tishby. Document clustering using word clusters via the information bottleneck method. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000. 3644 PAC-BAYESIAN A NALYSIS OF C O - CLUSTERING AND B EYOND Noam Slonim, Nir Friedman, and Naftali Tishby. Unsupervised document classification using sequential information maximization. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002. Noam Slonim, Gurinder Singh Atwal, Gasper Tracik, and William Bialek. Information-based clustering. Proceedings of the National Academy of Science, 102(51), 2005. Noam Slonim, Nir Friedman, and Naftali Tishby. Multivariate information bottleneck. Neural Computation, 18, 2006. Nathan Srebro. Learning with Matrix Factorizations. PhD thesis, MIT, 2004. Nathan Srebro, Noga Alon, and Tommi S. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances in Neural Information Processing Systems (NIPS), 2005a. Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum margin matrix factorization. In Advances in Neural Information Processing Systems (NIPS), 2005b. Mark Steyvers and Tom Griffiths. Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2006. Gilbert Strang. Introduction to linear algebra. Wellesley-Cambridge Press, 4th edition, 2009. Ilya Sutskever, Ruslan Salakhutdinov, and Josh B. Tenenbaum. Modelling relational data using Bayesian clustered tensor factorization. In Advances in Neural Information Processing Systems (NIPS), 2009. Hiroya Takamura and Yuji Matsumoto. Co-clustering for text categorization. Information Processing Society of Japan Journal, 2003. Naftali Tishby, Fernando Pereira, and William Bialek. The information bottleneck method. In Allerton Conference on Communication, Control and Computation, 1999. Leslie G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11), 1984. Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Soviet Math. Dokl., 9, 1968. Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), 1971. Ulrike von Luxburg and Shai Ben-David. Towards a statistical theory of clustering. In PASCAL Workshop on Statistics and Optimization of Clustering, 2005. Pu Wang, Carlotta Domeniconi, and Kathryn Blackmond Laskey. Latent Dirichlet Bayesian coclustering. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2009. 3645 S ELDIN AND T ISHBY Elad Yom-Tov and Noam Slonim. Parallel pairwise clustering. In SIAM International Conference on Data Mining (SDM), 2009. Jiho Yoo and Seungjin Choi. Probabilistic matrix tri-factorization. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009a. Jiho Yoo and Seungjin Choi. Weighted nonnegative matrix co-tri-factorization for collaborative prediction. In Proceedings of the Asian Conference on Machine Learning (ACML), 2009b. 3646