nips nips2001 nips2001-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. [sent-5, score-0.141]
2 Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. [sent-6, score-0.165]
3 Our simulation results also show the effectiveness of IDC in text categorization problems. [sent-8, score-0.111]
4 Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. [sent-9, score-0.092]
5 Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. [sent-10, score-0.28]
6 1 Introduction Data clustering is a fundamental and challenging routine in information processing and pattern recognition. [sent-11, score-0.178]
7 Informally, when we cluster a set of elements we attempt to partition it into subsets such that points in the same subset are more “similar” to each other than to points in other subsets. [sent-12, score-0.101]
8 Typical clustering algorithms depend on a choice of a similarity measure between data points [6], and a “correct” clustering result depends on an appropriate choice of a similarity measure. [sent-13, score-0.432]
9 For instance, consider a hypothetical data set containing articles by each of two authors such that half of the articles authored by each author discusses one topic, and the other half discusses another topic. [sent-15, score-0.096]
10 When asked to cluster this set into two sub-clusters, one cannot successfully achieve the task without knowing the goal: Are we interested in clusters that reflect writing style or semantics? [sent-17, score-0.174]
11 Therefore, without a suitable target at hand and a principled method for choosing a similarity measure suitable for the target, it can be meaningless to interpret clustering results. [sent-18, score-0.213]
12 The information bottleneck (IB) method of Tishby, Pereira and Bialek [8] is a recent framework that can sometimes provide an elegant solution to this problematic “metric selection” aspect of data clustering (see Section 2). [sent-19, score-0.24]
13 The original IB method generates a soft clustering assignments for the data. [sent-20, score-0.178]
14 Employing this hard IB clustering, the same authors introduced an effective two-stage clustering procedure called Double Clustering (DC) [12]. [sent-22, score-0.235]
15 An experimental study of DC on text categorization tasks [12] showed a consistent advantage of DC over other clustering methods. [sent-23, score-0.289]
16 IDC performs iterations of DC and whenever the first DC iteration succeeds in extracting a meaningful structure of the data, a number of the next consecutive iterations can continually improve the clustering quality. [sent-26, score-0.318]
17 Not only that IDC can dramatically outperform DC whenever the data is noisy, our experiments indicate that IDC attains impressive categorization results on text categorization tasks. [sent-29, score-0.263]
18 In particular, we show that our unsupervised IDC procedure is competitive with an SVM (and Naive Bayes) trained over a small sized training set. [sent-30, score-0.112]
19 We also propose a natural extension of IDC for transductive semi-supervised transductive. [sent-31, score-0.15]
20 Our preliminary empirical results indicate that our transductive IDC can yield effective text categorization. [sent-32, score-0.174]
21 , fd ) ∈ X we consider the empirical conditional d distribution {p(fi |x)} of features given x, where p(fi |x) = fi / i=1 fi . [sent-38, score-0.106]
22 For instance, X can be a set of documents, each of which is represented as a vector of word-features where fi is the frequency of the ith word (in some fixed word enumeration). [sent-39, score-0.105]
23 Thus, we represent each element as a distribution over its features, and are interested in a partition of the data based on these feature conditional distributions. [sent-40, score-0.099]
24 Given a predetermined number of clusters, a straightforward approach to cluster the data using the above “distributional representation” would be to choose some (dis)similarity measure for distributions (e. [sent-41, score-0.145]
25 based on some Lp norm or some statistical measure such as the KL-divergence) and employ some “plug-in” clustering algorithm based on this measure (e. [sent-43, score-0.214]
26 This way, T can direct us to extract meaningful clustering from S where the meaning is determined by the target T . [sent-53, score-0.194]
27 1 Specifically, the DC method obtained in some cases accuracy close to that obtained by a naive Bayes classifier trained over a small sized sample [12]. [sent-55, score-0.243]
28 The IB method attempts to compute p(˜|s), a “soft” assignment of a data point s to s ˜ ˜ ˜ clusters s, so as to minimize I(S, S)−βI(S, T ), given the Markov condition T → S → S ˜ ˜ (i. [sent-56, score-0.126]
29 As shown in [8], this minimization yields a system of coupled equations for the clustering mapping p(˜|s) s in terms of the cluster representations p(t|˜) and the cluster weights p(˜). [sent-60, score-0.322]
30 They also devised a greedy agglomerative ˜ clustering algorithm that starts with the trivial clustering, where each data point s is a single cluster; then, at each step, the algorithm merges the two clusters that minimize ˜ ˜ the loss of mutual information I(S, T ). [sent-64, score-0.395]
31 This agglomerative algorithm is of course only locally optimal, since at each step it greedily merges the two most similar clusters. [sent-67, score-0.091]
32 The IB method can be viewed as a meta-clustering procedure that, given observations of the variables S and T (via their empirical co-occurrence samples p(s, t)), attempts to cluster s-elements represented as distributions over t-elements. [sent-69, score-0.164]
33 Using the merging cost of equation (1) one can approximate IB clustering based on other “plug-in” vectorial clustering routines applied within the simplex containing the s-elements distributional representations. [sent-70, score-0.406]
34 DC [12] is a two-stage procedure where during the first stage we IB-cluster features represented as distributions over elements, thus generating feature clusters. [sent-71, score-0.189]
35 During the second stage we IB-cluster elements represented as distributions over the feature clusters (a more formal description follows). [sent-72, score-0.267]
36 For instance, considering a document clustering domain, in the first stage we cluster words as distributions over documents to obtain word clusters. [sent-73, score-0.483]
37 Then in the second stage we cluster documents as distributions over word clusters, to obtain document clusters. [sent-74, score-0.305]
38 Intuitively, the first stage in DC generates more coarse pseudo features (i. [sent-75, score-0.084]
39 feature centroids), which can reduce noise and sparseness that might be exhibited in the original feature values. [sent-77, score-0.139]
40 Then, in the second stage, elements are clustered as distributions over the “distilled” pseudo features, and therefore can generate more accurate element clusters. [sent-78, score-0.105]
41 As reported in [12], this DC two-stage procedure outperforms various other clustering approaches as well as DC variants applied with other dissimilarity measures (such as the variational distance) different from the optimal JS-divergence of Equation (1). [sent-79, score-0.268]
42 It is most striking that in some cases, the accuracy achieved by DC was close to that achieved by a supervised Naive Bayes classifier. [sent-80, score-0.178]
43 3 Iterative Double Clustering (IDC) Denote by IBN (T |S) the clustering result, into N clusters, of the IB hard clustering procedure when the data is S and the target variable is T (see Section 2). [sent-81, score-0.437]
44 For instance, if T represents documents and S represents words, the application of IBN (T = documents|S = words) will cluster the words, represented as distributions over the documents, into N clusters. [sent-82, score-0.235]
45 Using the notation of our problem setup, with X denoting the data and F denoting the features, Figure 1 provides a pseudo-code of the IDC meta-clustering algorithm, which clusters X into NX clusters. [sent-83, score-0.126]
46 The code of Figure 1 requires to specify k, the number of IDC iterations to run, N X , ˜ the number of element clusters (e. [sent-85, score-0.157]
47 the desired number of of document clusters) and NF , the number of feature clusters to use during each iteration. [sent-87, score-0.181]
48 The “hard” IB-clustering originally preInput: sented by [12] uses an agglomerative proX (input data) cedure as its underlying clustering algoNX (number of element clusters) ˜ rithm (see Section 2). [sent-97, score-0.264]
49 The “soft” IB [8] NF (number of feature clusters to use) ˜ applies a deterministic annealing clusk (number of iterations) tering [9] as its underlying procedure. [sent-98, score-0.197]
50 Initialize: S ← F , T ← X, As already discussed, the IB method loop {k times} can be viewed as meta-clustering which N ← NF ˜ can employ many vectorial clustering ˜ F ← IBN (T |S) routines. [sent-99, score-0.205]
51 We implemented IDC us˜ N ← NX , S ← X, T ← F ˜ ing several routines including agglom˜ erative clustering and deterministic anX ← IBN (T |S) ˜ nealing. [sent-100, score-0.22]
52 Add-C is an online Figure 1: Pseudo-code for IDC greedy clustering algorithm with linear running time and can be viewed as a simple online approximation of k-means. [sent-103, score-0.178]
53 Following [12] we chose to evaluate the performance of IDC with respect to a labeled data set. [sent-106, score-0.077]
54 In order to better understand the properties of IDC, we first examined it within a controlled setup of synthetically generated data points whose feature values were generated by d-dimensional Gaussian distributions (for d features) of the form N (µ, Σ), where Σ = σ 2 I, with σ constant. [sent-108, score-0.182]
55 We introduced feature noise by distorting each entry with value v by adding a random sample from N (0, (α · v)2 ), where α is the “noise amplitude” (resulting negative values were rounded to zero). [sent-111, score-0.103]
56 In figure 2(a), we plot the average accuracy of 10 runs of IDC. [sent-112, score-0.142]
57 When the noise amplitude increases, both IDC and DC deteriorate but the multiple rounds of IDC can better resist the extra noise. [sent-114, score-0.084]
58 After observing the large accuracy gain between DC and IDC at a specific interval of noise amplitude within the feature noise setup, we set the noise amplitude to values in that interval and examined the behavior of the IDC run in more detail. [sent-115, score-0.358]
59 Figure 2(b) shows a typical trace of the accuracy obtained at each of the 20 iterations of an IDC run over noisy data. [sent-116, score-0.192]
60 This learning curve shows a quick improvement in accuracy during the first few rounds, and then reaches a plateau. [sent-117, score-0.162]
61 Following [12] we used the 20 Newsgroups (NG20) [1] data set to evaluate IDC on real, labeled data. [sent-118, score-0.077]
62 The accuracy ˜ deteriorates when NF is too small and we see a slight negative trend when it increases. [sent-132, score-0.163]
63 Indeed, these results indicate that after a plateau in the range of 10-20 there is a minor negative trend in the accuracy level. [sent-134, score-0.163]
64 Thus, with respect to this data set, the IDC algorithm is not too sensitive to an overestimation of the number NF of feature clusters. [sent-135, score-0.077]
65 ˜ Other experiments over the NG4 data set confirmed the results of [12] that the JSdivergence dissimilarity measure of Equation (1) outperforms other measures, such as the variational distance (L1 norm), the KL-divergence, the square-Euclidean distance and the ‘cosine’ distance. [sent-136, score-0.1]
66 In the next set of experiments we tested IDC’s performance on the same newsgroup subsets used in [12]. [sent-138, score-0.077]
67 Table 1(a) compares the accuracy achieved by DC to the the last (15th) round of IDC with respect to all data sets described in [12]. [sent-139, score-0.166]
68 In each of the 5 experiments the supervised classifiers were trained using 25 documents per class and tested on 475 documents per class. [sent-142, score-0.302]
69 The input for the unsupervised IDC was 500 unlabeled documents per class. [sent-143, score-0.215]
70 4 Learning from Labeled and Unlabeled Examples In this section, we present a natural extension of IDC for semi-supervised transductive learning that can utilize both labeled and unlabeled data. [sent-145, score-0.28]
71 In transductive learning, the testing is done on the unlabeled examples in the training data, while in semi-supervised Newsgroup Binary1 Binary2 Binary3 M ulti51 M ulti52 M ulti53 M ulti101 M ulti102 M ulti103 Average DC 0. [sent-146, score-0.197]
72 The IDC-15 column shows final accuracy achieved at iteration 15 of IDC; the IDC-1 column shows first iteration accuracy. [sent-194, score-0.254]
73 In all cases the SVM was trained and tested using the same training/test set sizes as described in [11] (25 documents per newsgroup for training and 475 for testing; the number of unlabeled documents fed to IDC was 500 per newsgroup). [sent-197, score-0.386]
74 For motivating the transductive IDC, consider a data set X that has emerged from a statistical mixture which includes several sources (classes). [sent-204, score-0.144]
75 During the first iteration of a standard IDC we cluster the features F so as to preserve I(F, X). [sent-206, score-0.178]
76 In cases where I(X, C) is sufficiently large, we expect that ˜ the feature clusters F will preserve some information about C as well. [sent-208, score-0.178]
77 Having available ˜ some labeled data points, we may attempt to generate feature clusters F which preserve more information about class labels. [sent-209, score-0.255]
78 During the first IB-stage of the IDC first iteration, we cluster the features F as distributions over class labels (given by the labeled data). [sent-211, score-0.183]
79 Then we continue as usual; that is, in the second IB-phase of the first IDC ˜ iteration we cluster X, represented as distributions over F . [sent-213, score-0.177]
80 In Figure 2(d) we show the accuracy obtained by DC and IDC in categorizing 5 newsgroups as a function of the training (labeled) set size. [sent-215, score-0.192]
81 For instance, we see that when the algorithm has 10 documents available from each class it can categorize the entire unlabeled set, containing 90 unlabeled documents in each of the classes, with accuracy of about 80%. [sent-216, score-0.524]
82 The benchmark accuracy of IDC with no labeled examples obtained about 73%. [sent-217, score-0.195]
83 In Figure 2(e) we see the accuracy obtained by DC and transductive IDC trained with a constant set of 50 labeled documents, on different unlabeled (test) sample sizes. [sent-218, score-0.43]
84 The graph shows that the accuracy of DC significantly degrades, while IDC manages to sustain an almost constant high accuracy. [sent-219, score-0.142]
85 First, we present a natural extension of the successful double clustering algorithm of [12]. [sent-221, score-0.258]
86 Second, we applied the unsupervised IDC on text categorization problems which are typically dealt with by supervised learning algorithms. [sent-223, score-0.171]
87 Our results indicate that it is possible to achieve performance competitive to supervised classifiers that were trained over small samples. [sent-224, score-0.078]
88 Finally, we present a natural extension of IDC that allows for transductive learning. [sent-225, score-0.15]
89 Our preliminary empirical evaluation of this scheme over text categorization appears to be promising. [sent-226, score-0.128]
90 Finally, we believe it would be of great interest to better understand and characterize the performance of transductive IDC in settings having both labeled and unlabeled data. [sent-230, score-0.25]
91 This research was supported by the Israeli Ministry of Science References [1] 20 newsgroup data set. [sent-233, score-0.084]
92 Document clustering using word clusters via the information bottleneck method. [sent-298, score-0.346]
93 Data set: A synthetically generated sample of 200 500-dimensional elements in 4 classes. [sent-309, score-0.099]
94 The x-axis is the number of IDC iterations and the y-axis is accuracy achieved in each iteration. [sent-311, score-0.175]
95 Data set: Synthetically generated sample of 500, 400-dimensional elements in 5 classes; Noise: Proportional feature noise with α = 1. [sent-312, score-0.132]
96 0; (c) Average accuracy (10 trials) for different numbers of feature clusters. [sent-313, score-0.195]
97 (d) Average accuracy of (10 trials of) transductive categorization of 5 newsgroups. [sent-315, score-0.358]
98 Sample size: 80 documents per class, X-axis is training set size. [sent-316, score-0.114]
99 (e) Average accuracy of (10 trials of) transductive categorization of 5 newsgroups. [sent-320, score-0.358]
100 Sample size: constant training set size of 50 documents from each class. [sent-321, score-0.114]
wordName wordTfidf (topN-words)
[('idc', 0.851), ('dc', 0.185), ('clustering', 0.178), ('accuracy', 0.142), ('transductive', 0.12), ('documents', 0.114), ('clusters', 0.102), ('ib', 0.092), ('nf', 0.082), ('unlabeled', 0.077), ('categorization', 0.074), ('cluster', 0.072), ('slonim', 0.07), ('agglomerative', 0.064), ('newsgroup', 0.06), ('tishby', 0.06), ('iteration', 0.056), ('erent', 0.056), ('synthetically', 0.053), ('labeled', 0.053), ('feature', 0.053), ('di', 0.052), ('double', 0.05), ('newsgroups', 0.05), ('ibn', 0.047), ('nb', 0.047), ('naive', 0.043), ('djs', 0.041), ('bottleneck', 0.038), ('classi', 0.037), ('text', 0.037), ('supervised', 0.036), ('stage', 0.034), ('svm', 0.033), ('iterations', 0.033), ('noise', 0.033), ('dkl', 0.032), ('nx', 0.032), ('amplitude', 0.032), ('fi', 0.031), ('hard', 0.031), ('distributions', 0.031), ('extension', 0.03), ('elements', 0.029), ('word', 0.028), ('features', 0.027), ('comp', 0.027), ('guedalia', 0.027), ('merges', 0.027), ('vectorial', 0.027), ('document', 0.026), ('procedure', 0.026), ('rst', 0.025), ('unsupervised', 0.024), ('data', 0.024), ('pseudo', 0.023), ('routines', 0.023), ('attribute', 0.023), ('dissimilarity', 0.023), ('reported', 0.023), ('bayes', 0.023), ('annealing', 0.023), ('preserve', 0.023), ('element', 0.022), ('trials', 0.022), ('setup', 0.021), ('noam', 0.021), ('trend', 0.021), ('ltered', 0.021), ('trained', 0.021), ('competitive', 0.021), ('iterative', 0.02), ('sized', 0.02), ('curve', 0.02), ('divergence', 0.02), ('discusses', 0.019), ('attains', 0.019), ('naftali', 0.019), ('rounds', 0.019), ('deterministic', 0.019), ('represented', 0.018), ('speci', 0.018), ('outperforms', 0.018), ('amplitudes', 0.018), ('pereira', 0.018), ('recovering', 0.018), ('whenever', 0.018), ('measure', 0.018), ('empirical', 0.017), ('similarity', 0.017), ('experiments', 0.017), ('noisy', 0.017), ('inductive', 0.017), ('articles', 0.017), ('instance', 0.017), ('ers', 0.017), ('cally', 0.017), ('sample', 0.017), ('classes', 0.017), ('extract', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
2 0.13538875 30 nips-2001-Agglomerative Multivariate Information Bottleneck
Author: Noam Slonim, Nir Friedman, Naftali Tishby
Abstract: The information bottleneck method is an unsupervised model independent data organization technique. Given a joint distribution peA, B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative about B. In a recent paper, we introduced a general principled framework for multivariate extensions of the information bottleneck method that allows us to consider multiple systems of data partitions that are inter-related. In this paper, we present a new family of simple agglomerative algorithms to construct such systems of inter-related clusters. We analyze the behavior of these algorithms and apply them to several real-life datasets.
3 0.13057679 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
4 0.10389309 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
5 0.088345587 144 nips-2001-Partially labeled classification with Markov random walks
Author: Martin Szummer, Tommi Jaakkola
Abstract: To classify a large number of unlabeled examples we combine a limited number of labeled examples with a Markov random walk representation over the unlabeled examples. The random walk representation exploits any low dimensional structure in the data in a robust, probabilistic manner. We develop and compare several estimation criteria/algorithms suited to this representation. This includes in particular multi-way classification with an average margin criterion which permits a closed form solution. The time scale of the random walk regularizes the representation and can be set through a margin-based criterion favoring unambiguous classification. We also extend this basic regularization by adapting time scales for individual examples. We demonstrate the approach on synthetic examples and on text classification problems.
6 0.083104074 185 nips-2001-The Method of Quantum Clustering
7 0.076406285 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
8 0.075474486 24 nips-2001-Active Information Retrieval
9 0.063478738 170 nips-2001-Spectral Kernel Methods for Clustering
10 0.060339808 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures
11 0.059722949 107 nips-2001-Latent Dirichlet Allocation
12 0.055841818 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation
13 0.053861119 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
14 0.051423576 58 nips-2001-Covariance Kernels from Bayesian Generative Models
15 0.046700276 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
16 0.046246868 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
17 0.045683265 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
18 0.045651134 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
19 0.044954192 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
20 0.044413619 46 nips-2001-Categorization by Learning and Combining Object Parts
topicId topicWeight
[(0, -0.142), (1, 0.053), (2, -0.018), (3, -0.066), (4, 0.028), (5, -0.074), (6, -0.085), (7, -0.032), (8, -0.03), (9, -0.05), (10, 0.177), (11, -0.103), (12, -0.137), (13, 0.078), (14, -0.035), (15, 0.011), (16, 0.03), (17, -0.077), (18, -0.03), (19, -0.057), (20, 0.01), (21, -0.008), (22, -0.008), (23, -0.048), (24, 0.245), (25, 0.084), (26, 0.115), (27, 0.113), (28, 0.052), (29, -0.007), (30, -0.036), (31, -0.073), (32, -0.051), (33, 0.026), (34, -0.038), (35, -0.033), (36, -0.058), (37, -0.089), (38, 0.169), (39, -0.046), (40, 0.098), (41, 0.056), (42, 0.074), (43, 0.024), (44, 0.014), (45, -0.061), (46, 0.014), (47, -0.026), (48, -0.054), (49, 0.107)]
simIndex simValue paperId paperTitle
same-paper 1 0.93920231 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
2 0.72293973 30 nips-2001-Agglomerative Multivariate Information Bottleneck
Author: Noam Slonim, Nir Friedman, Naftali Tishby
Abstract: The information bottleneck method is an unsupervised model independent data organization technique. Given a joint distribution peA, B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative about B. In a recent paper, we introduced a general principled framework for multivariate extensions of the information bottleneck method that allows us to consider multiple systems of data partitions that are inter-related. In this paper, we present a new family of simple agglomerative algorithms to construct such systems of inter-related clusters. We analyze the behavior of these algorithms and apply them to several real-life datasets.
3 0.5478242 185 nips-2001-The Method of Quantum Clustering
Author: David Horn, Assaf Gottlieb
Abstract: We propose a novel clustering method that is an extension of ideas inherent to scale-space clustering and support-vector clustering. Like the latter, it associates every data point with a vector in Hilbert space, and like the former it puts emphasis on their total sum, that is equal to the scalespace probability function. The novelty of our approach is the study of an operator in Hilbert space, represented by the Schr¨ dinger equation of o which the probability function is a solution. This Schr¨ dinger equation o contains a potential function that can be derived analytically from the probability function. We associate minima of the potential with cluster centers. The method has one variable parameter, the scale of its Gaussian kernel. We demonstrate its applicability on known data sets. By limiting the evaluation of the Schr¨ dinger potential to the locations of data points, o we can apply this method to problems in high dimensions.
4 0.52805018 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
5 0.50156891 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
6 0.48551303 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
7 0.44799936 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation
8 0.40349919 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
9 0.40159348 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
10 0.39911258 107 nips-2001-Latent Dirichlet Allocation
11 0.39396635 144 nips-2001-Partially labeled classification with Markov random walks
12 0.39156261 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures
13 0.3152861 24 nips-2001-Active Information Retrieval
14 0.29158592 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
15 0.27806208 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
16 0.26447469 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
17 0.25740862 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
18 0.25291812 149 nips-2001-Probabilistic Abstraction Hierarchies
19 0.2477096 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
20 0.23830974 25 nips-2001-Active Learning in the Drug Discovery Process
topicId topicWeight
[(14, 0.025), (15, 0.035), (17, 0.014), (19, 0.022), (20, 0.012), (27, 0.107), (30, 0.107), (38, 0.026), (59, 0.024), (72, 0.079), (79, 0.034), (83, 0.027), (91, 0.19), (98, 0.017), (99, 0.161)]
simIndex simValue paperId paperTitle
1 0.91652864 185 nips-2001-The Method of Quantum Clustering
Author: David Horn, Assaf Gottlieb
Abstract: We propose a novel clustering method that is an extension of ideas inherent to scale-space clustering and support-vector clustering. Like the latter, it associates every data point with a vector in Hilbert space, and like the former it puts emphasis on their total sum, that is equal to the scalespace probability function. The novelty of our approach is the study of an operator in Hilbert space, represented by the Schr¨ dinger equation of o which the probability function is a solution. This Schr¨ dinger equation o contains a potential function that can be derived analytically from the probability function. We associate minima of the potential with cluster centers. The method has one variable parameter, the scale of its Gaussian kernel. We demonstrate its applicability on known data sets. By limiting the evaluation of the Schr¨ dinger potential to the locations of data points, o we can apply this method to problems in high dimensions.
same-paper 2 0.91271996 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
3 0.82009363 182 nips-2001-The Fidelity of Local Ordinal Encoding
Author: Javid Sadr, Sayan Mukherjee, Keith Thoresz, Pawan Sinha
Abstract: A key question in neuroscience is how to encode sensory stimuli such as images and sounds. Motivated by studies of response properties of neurons in the early cortical areas, we propose an encoding scheme that dispenses with absolute measures of signal intensity or contrast and uses, instead, only local ordinal measures. In this scheme, the structure of a signal is represented by a set of equalities and inequalities across adjacent regions. In this paper, we focus on characterizing the fidelity of this representation strategy. We develop a regularization approach for image reconstruction from ordinal measures and thereby demonstrate that the ordinal representation scheme can faithfully encode signal structure. We also present a neurally plausible implementation of this computation that uses only local update rules. The results highlight the robustness and generalization ability of local ordinal encodings for the task of pattern classification. 1
4 0.81904554 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
Author: Yun Gao, Michael J. Black, Elie Bienenstock, Shy Shoham, John P. Donoghue
Abstract: Statistical learning and probabilistic inference techniques are used to infer the hand position of a subject from multi-electrode recordings of neural activity in motor cortex. First, an array of electrodes provides training data of neural firing conditioned on hand kinematics. We learn a nonparametric representation of this firing activity using a Bayesian model and rigorously compare it with previous models using cross-validation. Second, we infer a posterior probability distribution over hand motion conditioned on a sequence of neural test data using Bayesian inference. The learned firing models of multiple cells are used to define a nonGaussian likelihood term which is combined with a prior probability for the kinematics. A particle filtering method is used to represent, update, and propagate the posterior distribution over time. The approach is compared with traditional linear filtering methods; the results suggest that it may be appropriate for neural prosthetic applications.
5 0.81475282 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
Author: Bram Bakker
Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1
6 0.81376052 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
7 0.81369877 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
8 0.81069016 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
9 0.8102057 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
10 0.80920756 149 nips-2001-Probabilistic Abstraction Hierarchies
11 0.80856752 46 nips-2001-Categorization by Learning and Combining Object Parts
12 0.80653179 56 nips-2001-Convolution Kernels for Natural Language
14 0.80623955 143 nips-2001-PAC Generalization Bounds for Co-training
15 0.80616891 183 nips-2001-The Infinite Hidden Markov Model
16 0.80450225 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
17 0.80401194 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
18 0.8037076 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
19 0.80224991 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models
20 0.80209291 95 nips-2001-Infinite Mixtures of Gaussian Process Experts