nips nips2007 nips2007-58 nips2007-58-reference knowledge-graph by maker-knowledge-mining

58 nips-2007-Consistent Minimization of Clustering Objective Functions

Source: pdf

Author: Ulrike V. Luxburg, Stefanie Jegelka, Michael Kaufmann, Sébastien Bubeck

Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to ﬁnd, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the ﬁnite data set has been sampled from some underlying space, the goal is not to ﬁnd the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of “nearest neighbor clustering”. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. 1

reference text

M. Brusco and S. Stahl. Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, 2005. S. Bubeck and U. von Luxburg. Overﬁtting of clustering and how to avoid it. Preprint, 2007. Data repository by G. R¨ tsch. http://ida.ﬁrst.fraunhofer.de/projects/bench/benchmarks.htm. a Data repository by M. Newman. http://www-personal.umich.edu/˜mejn/netdata/. Data repository by UCI. http://www.ics.uci.edu/˜mlearn/MLRepository.html. Data repository COSIN. http://151.100.123.37/data.html. L. Devroye, L. Gy¨ rﬁ, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. o J. Fritz. Distribution-free exponential error bound for nearest neighbor pattern classiﬁcation. IEEE Trans. Inf. Th., 21(5):552 – 557, 1975. R. Guimer` , L. Danon, A. D´az-Guilera, F. Giralt, and A. Arenas. Self-similar community structure in a network a ı of human interactions. Phys. Rev. E, 68(6):065103, 2003. I. Gutman and W. Xiao. Generalized inverse of the Laplacian matrix and some applications. Bulletin de l’Academie Serbe des Sciences at des Arts (Cl. Math. Natur.), 129:15 – 23, 2004. H. Jeong, S. Mason, A. Barabasi, and Z. Oltvai. Centrality and lethality of protein networks. Nature, 411: 41 – 42, 2001. D. Pollard. Strong consistency of k-means clustering. Annals of Statistics, 9(1):135 – 140, 1981. P. Spellman, G. Sherlock, M. Zhang, V. Iyer, M. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher. Comprehensive identiﬁcation of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell, 9(12):3273–97, 1998. K. Tsuda, H. Shin, and B. Sch¨ lkopf. Fast protein classiﬁcation with multiple networks. Bioinformatics, 21 o (Supplement 1):ii59 – ii65, 2005. V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007. U. von Luxburg, S. Bubeck, S. Jegelka, and M. Kaufmann. Supplementary material to ”Consistent minimization of clustering objective functions”, 2007. http://www.tuebingen.mpg.de/˜ule. U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, to appear. D. Watts and S. Strogatz. Collective dynamics of small world networks. Nature, 393:440–442, 1998. 8