nips nips2008 nips2008-218 nips2008-218-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ling Huang, Donghui Yan, Nina Taft, Michael I. Jordan
Abstract: Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance. 1
[1] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in Advances in Neural Information Processing Systems 20, 2007.
[2] A. Silberstein, G. P. A. Gelfand, K. Munagala, and J. Yang, “Suppression and failures in sensor networks: A Bayesian approach,” in Proceedings of VLDB, 2007.
[3] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Nonparametric decentralized detection using kernel methods,” IEEE Transactions on Signal Processing, vol. 53, no. 11, pp. 4053–4066, 2005. 7 (b) Concentric Circle Data (a) Sin−sep Data 1.4 Upper Bound Test Value 0.15 0.1 0.05 0.037 0.018 3 4 0.009 0.005 0.002 0.001 5 6 7 8 Number of quantization bits 1 0.8 0.6 0.4 0 9 0.018 3 4 0.009 0.004 0.002 0.001 0.02 0.001 5 6 7 8 Number of quantization bits 0.06 0.036 0 9 0.018 3 4 0.009 0.005 0.002 0.001 5 6 7 8 Number of quantization bits 0.001 9 4 0.07 0.04 0.03 0.02 0.01 0.056 0.029 2 3 0.015 0.008 0.004 0.002 4 5 6 7 Number of quantization bits 0.001 0.062 0 8 0.030 2 3 0.015 0.008 0.004 0.002 4 5 6 7 Number of quantization bits 0.05 Mis−Clustering Rate 0.02 0.015 0.01 2 0.017 0.009 0.004 0.04 0.03 0.02 0 8 0.071 0.036 2 3 0.002 Upper Bound Test Value 0.04 0.03 0.02 3 4 5 6 7 Number of quantization bits 8 0 0.009 0.005 0.002 0.001 8 Upper Bound Test Value 0.08 0.06 0.04 0.02 0.074 0.001 0.018 4 5 6 7 Number of quantization bits (i) Waveform Data 0.01 0.005 0.037 0.05 0.01 0.001 Mis−Clustering Rate Upper Bound Test Value 0.025 0.070 0.06 (h) Wisconsin Breast Cancer Data (g) Iris Data 0.03 Upper Bound Test Value 0.08 Mis−Clustering Rate Mis−Clustering Rate 6 (f) Wine Data Upper Bound Test Value 0.05 2 Mis−Clustering Rate 0.03 (e) Pen−digits Data Upper Bound Test Value 8 0 0.04 (d) Image Segmentation Data −3 x 10 0 0.05 0.01 0.036 0.001 Upper Bound Test Value 0.06 0.2 0 Mis−Clustering Rate 0.07 Mis−Clustering Rate 0.2 (c) Gaussian Data Upper Bound Test Value 1.2 Mis−Clustering Rate Mis−Clustering Rate 0.25 0.036 2 3 0.018 0.009 0.005 0.002 4 5 6 7 Number of quantization bits 0.072 0.001 8 0 0.036 2 3 0.018 0.009 0.005 0.002 4 5 6 7 Number of quantization bits 0.001 8 Figure 7: Upper bounds of clustering error on approximate data obtained from quantization as a function of the number of bits. (a–c) Simulated data sets (1000 sample size, 2, 2, 10 features, respectively); (d) Statlog image segmentation data (2310 sample size, 19 features); (e) Handwritten digits data (10992 sample size, 16 features); (f) Wine data (178 sample size, 13 features); (g) Iris data (150 sample size, 4 features). (h) Wisconsin breast cancer data (569 sample size, 30 features); (i) Waveform data (5000 sample size, 21 features). The x-axis shows the number of quantization bits and (above the axis) the corresponding data perturbation error σ. Error bars are derived from 25 replications. In the experiments, all data values are normalized in range [0, 1]. For data sets with more than two clusters, we choose two of them for the experiments.
[4] L. Huang, X. Nguyen, M. Garofalakis, A. D. Joseph, M. I. Jordan, and N. Taft, “In-network PCA and anomaly detection,” in Advances in Neural Information Processing Systems (NIPS), 2006.
[5] R. Kannan, S. Vempala, and A. Vetta, “On clusterings: Good, bad and spectral,” Journal of the ACM, vol. 51, no. 3, pp. 497–515, 2004.
[6] A. Y. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in Neural Information Processing Systems (NIPS), 2002.
[7] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
[8] U. von Luxburg, M. Belkin, and O. Bousquet, “Consistency of spectral clustering,” Annals of Statistics, vol. 36, no. 2, pp. 555–586, 2008.
[9] P. Drineas and M. W. Mahoney, “On the Nystr¨m method for approximating a Gram matrix for improved o kernel-based learning,” in Proceedings of COLT, 2005, pp. 323–337.
[10] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the Nystr¨m method,” IEEE o Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, 2004.
[11] G. Cormode and M. Garofalakis, “Sketching streams through the net: Distributed approximate query tracking,” in Proceedings of VLDB, 2005, pp. 13–24.
[12] D. Kushnir, M. Galun, and A. Brandt, “Fast multiscale clustering and manifold identification,” Pattern Recognition, vol. 39, no. 10, pp. 1876–1891, 2006.
[13] G. W. Stewart and J. Guang Sun, Matrix Perturbation Theory. Academic Press, 1990.
[14] A. Asuncion and D. Newman, “UCI Machine Learning Repository, Department of Information and Computer Science,” 2007, http://www.ics.uci.edu/ mlearn/MLRepository.html. 8