nips nips2001 nips2001-171 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. [sent-11, score-0.306]
2 A coordinate descend method is then used to find local minima. [sent-12, score-0.06]
3 In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. [sent-13, score-0.584]
4 As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. [sent-15, score-0.12]
5 1 Introduction K-means is a very popular method for general clustering [6]. [sent-16, score-0.232]
6 Despite the popularity of Kmeans clustering, one of its major drawbacks is that the coordinate descend search method is prone to local minima. [sent-18, score-0.06]
7 Much research has been done on computing refined initial points and adding explicit constraints to the sum-of-squares cost function for K-means clustering so that the search can converge to better local minimum [1 ,2]. [sent-19, score-0.306]
8 In this paper we tackle the problem from a different angle: we find an equivalent formulation of the sum-of-squares minimization as a trace maximization problem with special constraints; relaxing the constraints leads to a maximization problem that possesses optimal global solutions. [sent-20, score-0.582]
9 As a by-product we also have an easily computable lower bound for the minimum of the sum-of-squares cost function. [sent-21, score-0.12]
10 Our work is inspired by [9, 3] where connection to Gram matrix and extension of Kmeans method to general Mercer kernels were investigated. [sent-22, score-0.161]
11 The rest of the paper is organized as follows: in section 2, we derive the equivalent trace maximization formulation and discuss its spectral relaxation. [sent-23, score-0.487]
12 In section 3, we discuss how to assign cluster membership using pivoted QR decomposition, taking into account the special structure of the partial eigenvector matrix. [sent-24, score-0.672]
13 Finally, in section 4, we illustrate the performance of the clustering algorithms using document clustering as an example. [sent-25, score-0.548]
14 The Frobenius norm of a matrix IIAIIF = Jtrace(AT A). [sent-32, score-0.239]
15 2 Spectral Relaxation Given a set of m-dimensional data vectors ai, i = 1, . [sent-34, score-0.077]
16 ,n, we form the m-by-n data matrix A = [a1,"" an]. [sent-37, score-0.161]
17 A partition II of the date vectors can be written in the following form (1) where E is a permutation matrix, and Ai is m-by-si, i. [sent-38, score-0.194]
18 For a given partition II in (1), the associated sum-of-squares cost function is defined as k Si Si ss(II) = Ila~i) - mi11 2 , m· = "a(i)ls· 'l ~ S 2, s=l i=l s=l LL i. [sent-41, score-0.141]
19 , mi is the mean vector of the data vectors in cluster i. [sent-43, score-0.425]
20 Let the n-by-k orthonormal matrix X be X = :~ (e lVsl Sk elVSi. [sent-46, score-0.204]
21 (2) The sum-of-squares cost function can now be written as ss(II) = trace(A T A) - trace(XT AT AX), and its minimization is equivalent to max{ trace(XT AT AX) I X of the form in (2)}. [sent-47, score-0.074]
22 '---v-----" Si Then it is easy to see that trace(XT AT AX) = t t xT AT AXi = II Ax il1 2 i=l XTXi i=l II x il1 2 Using the partition in (1), the right-hand side of the above can be written as a weighted sum of the squared Euclidean norms of the mean vector of each clusters. [sent-62, score-0.067]
23 If we consider the elements of the Gram matrix AT A as measuring similarity between data vectors, then we have shown that Euclidean distance leads to Euclidean inner-product similarity. [sent-64, score-0.161]
24 Ignoring the special structure of X and let it be an arbitrary orthonormal matrix, we obtain a relaxed maximization problem max trace(XT AT AX) (3) XTX=h It turns out the above trace maximization problem has a closed-form solution. [sent-66, score-0.664]
25 (Ky Fan) Let H be a symmetric matrix with eigenvalues Al ::::: A2 ::::: . [sent-68, score-0.207]
26 ::::: An, and the corresponding eigenvectors U = [Ul, . [sent-71, score-0.086]
27 It follows from the above theorem that we need to compute the largest k eigenvectors of the Gram matrix AT A. [sent-83, score-0.372]
28 As a by-product, we have min{m ,n} minss(II) ::::: trace(A T A) n max trace(XT AT AX) = XT X=h L i=k+l 0-; (A), (4) where oi(A) is the i largest singular value of A. [sent-84, score-0.137]
29 This gives a lower bound for the minimum of the sum-of-squares cost function. [sent-85, score-0.12]
30 IIAi - mi eT2 = 2S i " , IIF 1 ~ '" ~ Ilaj - aj'11 2 . [sent-90, score-0.058]
31 aj EAi aj' EAi Let W = ( Ilai - ajl12 )i,j=l' and and Xi = [Xij]j=l with 1 Xij = { o if aj E Ai otherwise Then k ss(II) = T ~ ' " Xi WXi > ~ min 2 ~ XT Xi - 2 ZT Z=h i=l " ZTWZ = ~ 2 n '" ~ Ai(W). [sent-91, score-0.213]
32 i=n-k+l Unfortunately, some of the smallest eigenvalues of W can be negative. [sent-92, score-0.046]
33 Let X k be the n-by-k matrix consisting of the k largest eigenvectors of AT A. [sent-93, score-0.331]
34 Each row of X k corresponds to a data vector , and the above process can be considered as transforming the original data vectors which live in a m-dimensional space to new data vectors which now live in a k-dimensional space. [sent-94, score-0.256]
35 One might be attempted to compute the cluster assignment by applying the ordinary K-means method to those data vectors in the reduced dimension space. [sent-95, score-0.47]
36 In the next section, we discuss an alternative that takes into account the structure of the eigenvector matrix X k [5]. [sent-96, score-0.302]
37 The similarity of the projection process to principal component analysis is deceiving: the goal here is not to reconstruct the data matrix using a low-rank approximation but rather to capture its cluster structure. [sent-98, score-0.451]
38 3 Cluster Assignment Using Pivoted QR Decomposition Without loss of generality, let us assume that the best partition of the data vectors in A that minimizes ss(II) is given by A = [AI"'" A k], each submatrix Ai corresponding to a cluster. [sent-99, score-0.144]
39 Now write the Gram matrix of A as ATA=[A~A' ~ ArA, 1+ E=:B+E. [sent-100, score-0.161]
40 o 0 ArAk If the overlaps among the clusters represented by the submatrices Ai are small, then the norm of E will be small as compare with the block diagonal matrix B in the above equation. [sent-101, score-0.351]
41 Let the largest eigenvector of AT Ai be Yi , and AT AiYi = fJiYi , then the columns of the matrix IIYil1 = 1, i = 1, . [sent-102, score-0.442]
42 Let the eigenvalues and eigenvectors of AT A be A1:::: A2:::: . [sent-106, score-0.132]
43 , Xk] < IIEII/J [11, = YkV + O(IIEII) , where V is an k-by-k orthogonal matrix. [sent-127, score-0.068]
44 Ignoring the O(IIEII) term, we see that v v cluster 1 cluster k where we have used y'[ = [Yil , . [sent-128, score-0.58]
45 A key observation is that all the Vi are orthogonal to each other: once we have selected a Vi, we can jump to other clusters by looking at the orthogonal complement of Vi' Also notice that IIYil1 = 1, so the elements of Yi can not be all small. [sent-136, score-0.188]
46 A robust implementation of the above idea can be obtained as follows: we pick a column of X k T which has the lar;est norm, say, it belongs to cluster i , we orthogonalize the rest of the columns of X k against this column. [sent-137, score-0.477]
47 For the columns belonging to cluster i the residual vector will have small norm, and for the other columns the residual vectors will tend to be not small. [sent-138, score-0.655]
48 We then pick another vector with the largest residual norm, and orthogonalize the other residual vectors against this residual vector. [sent-139, score-0.494]
49 The process can be carried out k steps, and it turns out to be exactly QR decomposition with column pivoting applied to X k T [4], i. [sent-140, score-0.204]
50 , we find a permutation matrix P such that X'[P = QR = Q[Rl1,Rd, where Q is a k-by-k orthogonal matrix, and Rl1 is a k-by-k upper triangular matrix. [sent-142, score-0.279]
51 We then compute the matrix R= Rj} [Rl1 ' Rd pT = [Ik' Rj} R12]PT. [sent-143, score-0.202]
52 Then the cluster membership of each data vector is determined by the row index of the largest element in absolute value of the corresponding column of k REMARK. [sent-144, score-0.54]
53 Sometimes it may be advantageous to include more than k eigenvectors to form Xs T with s > k. [sent-145, score-0.086]
54 We can still use QR decomposition with column pivoting to select k columns of Xs T to form an s-by-k matrix, say X. [sent-146, score-0.26]
55 Then for each column z of Xs T we compute the least squares solution of t* = argmintERk li z - Xtll. [sent-147, score-0.103]
56 Then the cluster membership of z is determined by the row index of the largest element in absolute value of t* . [sent-148, score-0.478]
57 4 Experimental Results In this section we present our experimental results on clustering a dataset of newsgroup articles submitted to 20 newsgroups. [sent-149, score-0.526]
58 1 This dataset contains about 20,000 articles (email messages) evenly divided among the 20 newsgroups. [sent-150, score-0.091]
59 We list the names of the news groups together with the associated group labels. [sent-151, score-0.215]
60 lThe newsgroup dataset together with the bow toolkit for processing it can be downloadedfrorn http : //www . [sent-152, score-0.431]
61 95 0·1 L , -~--,c-----O~_--,c-'-_~-----' ' p-Kmeans p-{)R Figure 1: Clustering accuracy for five newsgroups NG2/NG9/NG10/NG15/NG18: p-QR vs. [sent-167, score-0.201]
62 misc We used the bow toolkit to construct the term-document matrix for this dataset, specifically we use the tokenization option so that the UseNet headers are stripped, and we also applied stemming [8]. [sent-204, score-0.341]
63 idf weighting scheme; 2) we delete words that appear too few times; 3) we normalized each document vector to have unit Euclidean length. [sent-206, score-0.084]
64 For both K-means methods, we start with a set of cluster centers chosen randomly from the (projected) data vectors, and we aslo make sure that the same random set is used for both for comparison. [sent-208, score-0.29]
65 To assess the quality of a clustering algorithm, we take advantage of the fact that the news group data are already labeled and we measure the performance by the accuracy of the clustering algorithm against the document category labels [10]. [sent-209, score-0.867]
66 In particular, for a k cluster case, we compute a k-by-k confusion matrix C = [Cij] with Cij the number of documents in cluster i that belongs to newsgroup category j. [sent-210, score-1.033]
67 It is actually quite subtle to compute the accuracy using the confusion matrix because we do not know which cluster matches which newsgroup category. [sent-211, score-0.843]
68 An optimal way is to solve the following maximization problem max{ trace(CP) IP is a permutation matrix}, and divide the maximum by the total number of documents to get the accuracy. [sent-212, score-0.155]
69 In all our experiments, we used a greedy algorithm to compute a sub-optimal solution. [sent-214, score-0.041]
70 Table 1: Comparison of p-QR, p-Kmeans, and K-means for two-way clustering Newsgroups NG1/NG2 NG2/NG3 NG8/NG9 NG10/NG11 NG1/NG 15 NG18/NG19 89. [sent-215, score-0.232]
71 We choose 50 random document vectors each from two newsgroups. [sent-289, score-0.161]
72 We tested 100 runs for each pair of newsgroups, and list the means and standard deviations in Table 1. [sent-290, score-0.11]
73 The two clustering algorithms p-QR and p-Kmeans are comparable to each other, and both are better and sometimes substantially better t han K-means. [sent-291, score-0.308]
74 In this example, we consider k-way clustering with k = 5 and k = 10. [sent-293, score-0.232]
75 Three news group sets are chosen with 50 and 100 random samples from each newsgroup as indicated in the parenthesis. [sent-294, score-0.368]
76 Again 100 runs are used for each tests and the means and standard deviations are listed in Table 2. [sent-295, score-0.06]
77 Moreover, in Figure 1, we also plot the accuracy for the 100 runs for the test NG2/NG9/NG10/NG15/NG18 (50). [sent-296, score-0.16]
78 Both p-QR and p-Kmeans perform better t han Kmeans. [sent-297, score-0.076]
79 For news group sets with small overlaps, p-QR performs better t han p-Kmeans. [sent-298, score-0.241]
80 This might be explained by t he fact t hat p-QR explores the special structure of t he eigenvector matrix and is therefore more efficient. [sent-299, score-0.344]
81 As a less thorough comparison wit h the information bottleneck method used in [10], there for 15 runs of NG2/NG9/NGlO/NG15/NG 18 (100) mean accuracy 56. [sent-300, score-0.199]
82 For 15 runs of the 10 newsgroup set with 50 samples, mean accuracy 35. [sent-303, score-0.363]
83 We only list a typical sample from NG2/NG9/NGlO/NG15/NG18 (50). [sent-308, score-0.05]
84 The column with "NG labels" indicates clustering using the newsgroup labels and by definition has 100% accuracy. [sent-309, score-0.551]
85 It is quite clear that the news group categories are not completely captured by t he sum-of-squares cost function because p-QR and "NG labels" both have higher accuracy but also larger sum-of-squares values. [sent-310, score-0.339]
86 Interestingly, it seems t hat p-QR captures some of this information of the newsgroup categories. [sent-311, score-0.245]
87 Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. [sent-368, score-0.09]
88 Document clustering using word clusters via the information bottleneck method. [sent-385, score-0.323]
wordName wordTfidf (topN-words)
[('trace', 0.318), ('cluster', 0.29), ('clustering', 0.232), ('newsgroup', 0.203), ('xt', 0.163), ('matrix', 0.161), ('qr', 0.146), ('eigenvector', 0.141), ('pivoted', 0.137), ('isi', 0.136), ('ai', 0.133), ('si', 0.126), ('gram', 0.125), ('news', 0.119), ('ss', 0.108), ('ax', 0.106), ('maximization', 0.105), ('membership', 0.104), ('ee', 0.103), ('iiai', 0.103), ('ssi', 0.103), ('ii', 0.102), ('newsgroups', 0.101), ('accuracy', 0.1), ('toolkit', 0.09), ('kmeans', 0.09), ('bow', 0.09), ('berkeley', 0.088), ('residual', 0.088), ('eigenvectors', 0.086), ('largest', 0.084), ('document', 0.084), ('aj', 0.083), ('norm', 0.078), ('vectors', 0.077), ('lsi', 0.076), ('han', 0.076), ('cost', 0.074), ('decomposition', 0.073), ('xs', 0.072), ('aet', 0.069), ('axi', 0.069), ('ding', 0.069), ('eai', 0.069), ('hongyuan', 0.069), ('horst', 0.069), ('iieii', 0.069), ('ila', 0.069), ('ming', 0.069), ('orthogonalize', 0.069), ('pivoting', 0.069), ('xiaofeng', 0.069), ('xtx', 0.069), ('zha', 0.069), ('euclidean', 0.068), ('orthogonal', 0.068), ('partition', 0.067), ('spectral', 0.064), ('assignment', 0.062), ('column', 0.062), ('runs', 0.06), ('descend', 0.06), ('xij', 0.06), ('overlaps', 0.06), ('mi', 0.058), ('ng', 0.058), ('mercer', 0.056), ('columns', 0.056), ('possesses', 0.054), ('bradley', 0.054), ('labels', 0.054), ('max', 0.053), ('clusters', 0.052), ('live', 0.051), ('cij', 0.051), ('list', 0.05), ('permutation', 0.05), ('dataset', 0.048), ('confusion', 0.048), ('min', 0.047), ('bound', 0.046), ('group', 0.046), ('eigenvalues', 0.046), ('gu', 0.045), ('ul', 0.043), ('pennsylvania', 0.043), ('orthonormal', 0.043), ('articles', 0.043), ('ignoring', 0.042), ('hat', 0.042), ('rj', 0.042), ('chris', 0.042), ('compute', 0.041), ('uc', 0.04), ('relaxed', 0.04), ('xi', 0.04), ('ak', 0.039), ('bottleneck', 0.039), ('relaxation', 0.039), ('vi', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
2 0.30701569 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
3 0.19826205 170 nips-2001-Spectral Kernel Methods for Clustering
Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola
Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1
4 0.13777491 164 nips-2001-Sampling Techniques for Kernel Methods
Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf
Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.
5 0.13544728 35 nips-2001-Analysis of Sparse Bayesian Learning
Author: Anita C. Faul, Michael E. Tipping
Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1
6 0.13205059 185 nips-2001-The Method of Quantum Clustering
7 0.13134669 136 nips-2001-On the Concentration of Spectral Properties
8 0.13057679 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
9 0.12759559 24 nips-2001-Active Information Retrieval
10 0.12112883 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
11 0.11539267 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
12 0.10977282 58 nips-2001-Covariance Kernels from Bayesian Generative Models
13 0.09640383 105 nips-2001-Kernel Machines and Boolean Functions
14 0.094928563 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
15 0.090810284 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
16 0.089664884 89 nips-2001-Grouping with Bias
17 0.081967615 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures
18 0.081136473 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
19 0.07803335 134 nips-2001-On Kernel-Target Alignment
20 0.073450893 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
topicId topicWeight
[(0, -0.235), (1, 0.089), (2, 0.001), (3, -0.304), (4, 0.208), (5, -0.091), (6, -0.095), (7, 0.028), (8, -0.026), (9, -0.059), (10, 0.188), (11, -0.051), (12, -0.126), (13, 0.209), (14, -0.059), (15, -0.021), (16, 0.031), (17, -0.053), (18, 0.056), (19, -0.118), (20, 0.072), (21, -0.036), (22, 0.142), (23, 0.033), (24, 0.177), (25, -0.032), (26, 0.016), (27, 0.036), (28, -0.039), (29, -0.054), (30, 0.013), (31, -0.021), (32, 0.035), (33, -0.143), (34, 0.013), (35, 0.056), (36, -0.007), (37, -0.046), (38, 0.023), (39, -0.021), (40, 0.044), (41, -0.019), (42, -0.018), (43, -0.1), (44, -0.037), (45, -0.012), (46, 0.004), (47, -0.049), (48, -0.07), (49, -0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.97456974 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
2 0.88727397 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
3 0.6222418 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
4 0.58378834 185 nips-2001-The Method of Quantum Clustering
Author: David Horn, Assaf Gottlieb
Abstract: We propose a novel clustering method that is an extension of ideas inherent to scale-space clustering and support-vector clustering. Like the latter, it associates every data point with a vector in Hilbert space, and like the former it puts emphasis on their total sum, that is equal to the scalespace probability function. The novelty of our approach is the study of an operator in Hilbert space, represented by the Schr¨ dinger equation of o which the probability function is a solution. This Schr¨ dinger equation o contains a potential function that can be derived analytically from the probability function. We associate minima of the potential with cluster centers. The method has one variable parameter, the scale of its Gaussian kernel. We demonstrate its applicability on known data sets. By limiting the evaluation of the Schr¨ dinger potential to the locations of data points, o we can apply this method to problems in high dimensions.
5 0.49287224 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
6 0.49063396 24 nips-2001-Active Information Retrieval
7 0.49062887 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
8 0.46409357 120 nips-2001-Minimax Probability Machine
9 0.4566358 30 nips-2001-Agglomerative Multivariate Information Bottleneck
10 0.44837409 136 nips-2001-On the Concentration of Spectral Properties
11 0.4478569 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
12 0.43793091 170 nips-2001-Spectral Kernel Methods for Clustering
13 0.42542744 35 nips-2001-Analysis of Sparse Bayesian Learning
14 0.41563654 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
15 0.38675955 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
16 0.38392207 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures
17 0.37208107 180 nips-2001-The Concave-Convex Procedure (CCCP)
18 0.36307997 164 nips-2001-Sampling Techniques for Kernel Methods
19 0.35982475 154 nips-2001-Products of Gaussians
20 0.31100675 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation
topicId topicWeight
[(14, 0.05), (17, 0.055), (19, 0.042), (27, 0.107), (30, 0.062), (38, 0.038), (59, 0.041), (72, 0.08), (79, 0.074), (83, 0.011), (91, 0.118), (98, 0.257)]
simIndex simValue paperId paperTitle
same-paper 1 0.84580207 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
2 0.7227304 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
Author: Marzia Polito, Pietro Perona
Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1
3 0.63446152 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
4 0.62296969 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
6 0.61786735 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
7 0.61767042 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
8 0.61702174 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
9 0.61679494 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
10 0.61678457 36 nips-2001-Approximate Dynamic Programming via Linear Programming
11 0.6146909 121 nips-2001-Model-Free Least-Squares Policy Iteration
12 0.61110705 58 nips-2001-Covariance Kernels from Bayesian Generative Models
13 0.61031735 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
14 0.60965049 89 nips-2001-Grouping with Bias
15 0.60897017 59 nips-2001-Direct value-approximation for factored MDPs
16 0.60756534 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
17 0.60682356 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
18 0.60638303 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
19 0.60603243 155 nips-2001-Quantizing Density Estimators
20 0.60584581 56 nips-2001-Convolution Kernels for Natural Language