jmlr jmlr2008 jmlr2008-38 jmlr2008-38-reference knowledge-graph by maker-knowledge-mining

38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering

Source: pdf

Author: Eyal Krupka, Naftali Tishby

Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove ﬁnite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative ﬁltering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck

reference text

R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 2005. J. Blitzer, A. Globerson, and F. Pereira. Distributed latent variable models of lexical co-occurrences. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2005. J.S. Breese, D. Heckerman, and K. Kadie. Empirical analysis of predictive algorithms for collaborative ﬁltering. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 1998. 368 G ENERALIZATION TO U NOBSERVED F EATURES R. Caruana and V. R. de Sa. Promoting poor features to supervisors: Some inputs work better as outputs. In Advances in Neural Information Processing Systems (NIPS), 1997. G. Chechik and N. Tishby. Extracting relevant structures with side information. In Advances in Neural Information Processing Systems (NIPS), 2002. T. M. Cover and J. A. Thomas. Elements Of Information Theory. Wiley Interscience, 1991. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 1977. G. Elidan and N. Friedman. The information bottleneck EM algorithm. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2003. N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2001. A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In International Conference on Machine Learning (ICML), 2006. D. Gondek and T. Hofmann. Non-redundant data clustering. In Fourth IEEE International Conference on Data Mining (ICDM). IEEE Computer Society, 2004. T. S. Han. Nonnegative entropy measures of multivariate symmetric correlations. Information and Control, 36:133–156, 1978. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31 (3):264–323, September 1999. J. Kleinberg. An impossibility theorem for clustering. In Advances in Neural Information Processing Systems (NIPS), 2002. E. Krupka and N. Tishby. Generalization in clustering with unobserved features. In Advances in Neural Information Processing Systems (NIPS), 2005. E. Krupka and N. Tishby. Incorporating prior knowledge on features into learnning. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2007. E. Krupka, A. Navot, and N. Tishby. Learning to select features using their properties. Journal of Machine Learning Research, submitted. K. Lang. Learning to ﬁlter netnews. Proc. 12th International Conf. on Machine Learning, pages 331–339, 1995. S. P. Lloyd. Least squares quantization in pcm. Technical report, Bell Laboratories, 1957. Published in 1982 in IEEE Transactions on Information Theory 28, 128-137. J. MacQueen. Some methods for classiﬁcation and analysis of multivariate observations. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. B. Marlin. Collaborative ﬁltering: A machine learning perspective. Master’s thesis, University of Toronto, 2004. 369 K RUPKA AND T ISHBY L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1101–1253, 2003. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2002. M. Szummer and T. Jaakkola. Information regularization with partially labeled data. In Advances in Neural Information Processing Systems (NIPS), 2003. N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. Proc. 37th Allerton Conf. on Communication and Computation, 1999. L. H. Ungar and D. P. Foster. Clustering methods for collaborative ﬁltering. In Workshop on Recommender Systems at the 15th National Conference on Artiﬁcial Intelligence., 1998. V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. U. von Luxburg and S. Ben-David. Towards a statistical theory of clustering. In PASCAL Workshop on Statistics and Optimization of Clustering, 2005. 370