nips nips2001 nips2001-100 nips2001-100-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
[1] 20 newsgroup data set. http://www.ai.mit.edu/˜ jrennie/20 newsgroups/.
[2] Libsvm. http://www.csie.ntu.edu.tw/˜jlin/libsvm. c
[3] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991.
[4] R. El-Yaniv, S. Fine, and N. Tishby. Agnostic classification of markovian sequences. In NIPS97, 1997.
[5] I.D. Guedalia, M. London, and M. Werman. A method for on-line clustering of nonstationary data. Neural Computation, 11:521–540, 1999.
[6] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall, New Jersey, 1988.
[7] J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, 1991.
[8] F.C. Pereira N. Tishby and W. Bialek. Information bottleneck method. In 37-th Allerton Conference on Communication and Computation, 1999.
[9] K. Rose. Deterministic annealing for clustering, compression, classification, regression and related optimization problems. Proceedings of the IEEE, 86(11):2210–2238, 1998.
[10] N. Slonim and N. Tishby. Agglomerative information bottleneck. In NIPS99, 1999.
[11] N. Slonim and N. Tishby. The power of word clustering for text classification. To appear in the European Colloquium on IR Research, ECIR, 2001.
[12] Noam Slonim and Naftali Tishby. Document clustering using word clusters via the information bottleneck method. In ACM SIGIR 2000, 2000. 100 100 First Iteration Accuracy Last Iteration Accuracy 90 90 80 80 Accuracy Accuracy 70 60 50 40 30 70 60 50 20 40 10 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 30 0 2 2 4 6 Feature Noise Amplitude (a) 10 12 14 16 18 20 (b) 100 100 90 90 80 80 70 70 60 Accuracy Accuracy 8 IDC Iteration 50 40 30 60 50 40 30 20 20 First Iteration Accuracy Last Iteration Accuracy 10 0 2 4 6 8 10 First Iteration Accuracy Last Iteration Accuracy 10 12 14 16 18 20 Number Of Feature Clusters 0 10 20 30 40 50 60 70 Training Set Size (c) (d) 100 90 80 Accuracy 70 60 50 40 30 20 First Iteration Accuracy Last Iteration Accuracy 10 0 100 150 200 250 300 350 400 450 500 Test Set Size (e) Figure 2: (a) Average accuracy over 10 trials for different amplitudes of proportional feature noise. Data set: A synthetically generated sample of 200 500-dimensional elements in 4 classes. (b) A trace of a single IDC run. The x-axis is the number of IDC iterations and the y-axis is accuracy achieved in each iteration. Data set: Synthetically generated sample of 500, 400-dimensional elements in 5 classes; Noise: Proportional feature noise with α = 1.0; (c) Average accuracy (10 trials) for different numbers of feature clusters. Data set: NG4. (d) Average accuracy of (10 trials of) transductive categorization of 5 newsgroups. Sample size: 80 documents per class, X-axis is training set size. Upper curve shows trans. IDC-15 and lower curve is trans. IDC-1. (e) Average accuracy of (10 trials of) transductive categorization of 5 newsgroups. Sample size: constant training set size of 50 documents from each class. The x-axis counts the number of unlabeled samples to be categorized. Upper curve is trans. IDC-15 and lower curve is trans. IDC-1. Each error bar (in all graphs) specifies one std.