nips nips2001 nips2001-135 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 On Spectral Clustering: Analysis and an algorithm Andrew Y. [sent-1, score-0.054]
2 il Abstract Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. [sent-19, score-1.074]
3 First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. [sent-20, score-0.358]
4 In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. [sent-22, score-0.318]
5 Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. [sent-23, score-0.163]
6 We also show surprisingly good experimental results on a number of challenging clustering problems. [sent-24, score-0.132]
7 1 Introduction The task of finding good clusters has been the focus of considerable research in machine learning and pattern recognition. [sent-25, score-0.448]
8 For clustering points in Rn-a main application focus of this paper- one standard approach is based on generative models, in which algorithms such as EM are used to learn a mixture density. [sent-26, score-0.281]
9 Second, the log likelihood can have many local minima and therefore multiple restarts are required to find a good solution using iterative algorithms. [sent-31, score-0.127]
10 A promising alternative that has recently emerged in a number of fields is to use spectral methods for clustering. [sent-33, score-0.132]
11 Here, one uses the top eigenvectors of a matrix derived from the distance between points. [sent-34, score-0.392]
12 But despite their empirical successes, different authors still disagree on exactly which eigenvectors to use and how to derive clusters from them (see [11] for a review). [sent-36, score-0.81]
13 Also, the analysis of these algorithms, which we briefly review below, has tended to focus on simplified algorithms that only use one eigenvector at a time. [sent-37, score-0.13]
14 One line of analysis makes the link to spectral graph partitioning, in which the sec- ond eigenvector of a graph's Laplacian is used to define a semi-optimal cut. [sent-38, score-0.36]
15 Here, the eigenvector is seen as a solving a relaxation of an NP-hard discrete graph partitioning problem [3], and it can be shown that cuts based on the second eigenvector give a guaranteed approximation to the optimal cut [9, 3]. [sent-39, score-0.407]
16 This analysis can be extended to clustering by building a weighted graph in which nodes correspond to datapoints and edges are related to the distance between the points. [sent-40, score-0.248]
17 Since the majority of analyses in spectral graph partitioning appear to deal with partitioning the graph into exactly two parts, these methods are then typically applied recursively to find k clusters (e. [sent-41, score-1.017]
18 Experimentally it has been observed that using more eigenvectors and directly computing a k way partitioning is better (e. [sent-44, score-0.427]
19 Here, we build upon the recent work of Weiss [11] and Meila and Shi [6], who analyzed algorithms that use k eigenvectors simultaneously in simple settings. [sent-47, score-0.358]
20 We propose a particular manner to use the k eigenvectors simultaneously, and give conditions under which the algorithm can be expected to do well. [sent-48, score-0.407]
21 ,8 n } in jRl that we want to cluster into k subsets: 1. [sent-52, score-0.285]
22 Form the affinity matrix A E R nx n defined by A ij i # j , and A ii = O. [sent-53, score-0.248]
23 Define D to be the diagonal matrix whose (i , i)-element is the sum of A's i-th row, and construct the matrix L = D-l / 2AD-l / 2 . [sent-55, score-0.186]
24 , Xk , the k largest eigenvectors of L (chosen to be orthogonal to each other in the case of repeated eigenvalues), and form the matrix X = [XIX2 . [sent-60, score-0.491]
25 Xk) E R n xk by stacking the eigenvectors in columns. [sent-63, score-0.461]
26 Form the matrix Y from X by renormalizing each of X's rows to have unit length (i. [sent-65, score-0.303]
27 Treating each row of Y as a point in Rk , cluster them into k clusters via K-means or any other algorithm (that attempts to minimize distortion). [sent-70, score-0.832]
28 Finally, assign the original point Si to cluster j if and only if row i of the matrix Y was assigned to cluster j. [sent-72, score-0.723]
29 Here, the scaling parameter a 2 controls how rapidly the affinity Aij falls off with the distance between 8i and 8j, and we will later describe a method for choosing it automatically. [sent-73, score-0.091]
30 At first sight, this algorithm seems to make little sense. [sent-77, score-0.054]
31 The natural clusters in jR2 do not correspond to convex regions, and Kmeans run directly finds the unsatisfactory clustering in Figure li. [sent-80, score-0.717]
32 But once we map the points to jRk (Y 's rows) , they form tight clusters (Figure lh) from which our method obtains the good clustering shown in Figure Ie. [sent-81, score-0.746]
33 We note that the clusters in Figure lh lie at 90 0 to each other relative to the origin (cf. [sent-82, score-0.465]
34 lReaders familiar with spectral graph theory [3) may be more familiar with the Laplacian 1- L. [sent-84, score-0.291]
35 But as replacing L with 1- L would complicate our later discussion, and only changes the eigenvalues (from Ai to 1 - Ai ) and not the eigenvectors, we instead use L . [sent-85, score-0.129]
36 1 Analysis of algorithm Informal discussion: The "ideal" case To understand the algorithm, it is instructive to consider its behavior in the "ideal" case in which all points in different clusters are infinitely far apart. [sent-87, score-0.626]
37 Moving the clusters "infinitely" far apart corresponds to zeroing all the efements Aij corresponding to points Si and Sj in different clusters. [sent-90, score-0.523]
38 Here, A(ii) = A(ii) E jRni xni is the matrix of "intra-cluster" affinities for cluster i. [sent-95, score-0.358]
39 For future use, also define d(i) E jRni to be the vector containing D(ii) 's diagonal elements, and dE jRn to contain D's diagonal elements. [sent-96, score-0.134]
40 Since t is block diagonal, its eigenvalues and eigenvectors are the union of the ei~envalues and eigenvectors of its blocks (the latter padded appropriately with zeros). [sent-98, score-0.834]
41 It is straightforward to show that Lrii) has a strictly positive principal eigenvector xii) E jRni with eigenvalue 1. [sent-99, score-0.212]
42 Also, since A)~) > 0 (j i:- k), the next eigenvalue is strictly less than 1. [sent-100, score-0.121]
43 Thus, stacking t 's eigenvectors in columns to obtain X= xi 2) 0 0 o 0 0 0 xi1) [ X, we have: 1 xi 3) E jRnx3. [sent-104, score-0.448]
44 Since 1 is a repeated eigenvalue in t, we could just as easily have picked any other 3 orthogonal vectors spanning the same subspace as X's columns, and defined them to be our first 3 eigenvectors. [sent-106, score-0.269]
45 That is, X could have been replaced by XR for any orthogonal matrix R E jR3X3 (RT R = RRT = 1). [sent-107, score-0.172]
46 Instead, what we can reasonably hope to guarantee about the algorithm will be arrived at not by considering the (unstable) individual columns of X, but instead the subspace spanned by the columns of X, which can be considerably more stable. [sent-109, score-0.247]
47 Next, when we renormalize each of X's rows to have unit length, we obtain: y= [ y(l) y(2) jRni xk 0 0 r 0 1 R (3) to denote the i-th subblock of Y. [sent-110, score-0.363]
48 Letting fiji) y(3) where we have used y(i) E 1 [r 0 0 0 r denote the j-th row of17(i) , we therefore see that fjY) is the i-th row ofthe orthogonal matrix R. [sent-111, score-0.332]
49 Proposition 1 Let A's off-diagonal blocks A(i j ) , i =I- j, be zero. [sent-113, score-0.067]
50 ,1' k (1'; l' j = 1 if i = j, 0 otherwise) so that Y's rows satisfy , (i) ~ ( ) 4 =G for all i = 1, . [sent-118, score-0.195]
51 In other words , there are k mutually orthogonal points on the surface of the unit k-sphere around which Y 's rows will cluster. [sent-125, score-0.487]
52 Moreover, these clusters correspond exactly to the true clustering of the original data. [sent-126, score-0.625]
53 2 The general case In the general case, A's off-diagonal blocks are non-zero, but we still hope to recover guarantees similar to Proposition 1. [sent-128, score-0.067]
54 Viewing E = A - A as a perturbation to the "ideal" A that results in A = A+E, we ask: When can we expect the resulting rows of Y to cluster similarly to the rows of Y? [sent-129, score-0.731]
55 Specifically, when will the eigenvectors of L, which we now view as a perturbed version of L, be "close" to those of L? [sent-130, score-0.319]
56 Matrix perturbation theory [10] indicates that the stability of the eigenvectors of a matrix is determined by the eigengap. [sent-131, score-0.448]
57 More precisely, the subspace spanned by L's first 3 eigenvectors will be stable to small changes to L if and only if the eigengap 8 = IA3 - A41, the difference between the 3rd and 4th eigenvalues of L, is large. [sent-132, score-0.596]
58 As discussed previously, the eigenvalues of L is the union of the eigenvalues of D11), D22), and D33), and A3 = 1. [sent-133, score-0.223]
59 Letting Ay) be the j-th largest eigenvalue of Dii), we therefore see that A4 = maxi A~i). [sent-134, score-0.157]
60 Hence, the assumption that IA3 - A41 be large is exactly the assumption that maXi A~i) be bounded away from 1. [sent-135, score-0.293]
61 Note that A~i) depends only on Dii), which in turn depends only on A(ii) = A(ii) , the matrix of intra-cluster similarities for cluster Si' The assumption on A~i) has a very natural interpretation in the context of clustering. [sent-141, score-0.481]
62 Informally, it captures the idea that if we want an algorithm to find the clusters Sl, S2 and S3, then we require that each of these sets Si really look like a "tight" cluster. [sent-142, score-0.51]
63 2 U S2 U S3 looks like (at least) four clusters, and it would be unreasonable to expect an algorithm to correctly guess what partition of the four clusters into three subsets we had in mind. [sent-150, score-0.57]
64 This connection between the eigengap and the cohesiveness of the individual clusters can be formalized in a number of ways. [sent-151, score-0.519]
65 Define the Cheeger constant [3] of the cluster Si to be _. [sent-153, score-0.285]
66 A standard result in spectral graph theory shows that Assumption Al. [sent-167, score-0.215]
67 Recall that d)i) = 2:k A)~) characterizes how "well connected" or how "similar" point j is to the other points in the same cluster. [sent-169, score-0.153]
68 The term in the minI{·} characterizes how well (I , I) partitions Si into two subsets, and the minimum over I picks out the best such partition. [sent-170, score-0.086]
69 Specifically, if there is a partition of Si'S points so that the weight of the edges across the partition is small, and so that each of the partitions has moderately large "volume" (sum of dY) 's), then the Cheeger constant will be small. [sent-171, score-0.288]
70 Thus, the assumption that the Cheeger constants h(Si) be large is exactly that the clusters Si be hard to split into two subsets. [sent-172, score-0.583]
71 We can also relate the eigengap to the mixing time of a random walk (as in [6]) defined on the points of a cluster, in which the chance of transitioning from point i to j is proportional to A ij , so that we tend to jump to nearby-points. [sent-173, score-0.378]
72 Assumption Al is equivalent to assuming that, for such a walk defined on the points of any one of the clusters Si , the corresponding transition matrix has second eigenvalue at most 1- 8. [sent-174, score-0.81]
73 The mixing time of a random walk is governed by the second eigenvalue; thus, this assumption is exactly that the walks mix rapidly. [sent-175, score-0.29]
74 Intuitively, this will be true for tight (or at least fairly "well connected") clusters, and untrue if a cluster consists of two well-separated sets of points so that the random walk takes a long time to transition from one half of the cluster to the other. [sent-176, score-0.857]
75 Assumption Al can also be related to the existence of multiple paths between any two points in the same cluster. [sent-177, score-0.11]
76 ,k} , (6) To gain intuition about this, consider the case of two "dense" clusters il and i2 of size O(n) each. [sent-183, score-0.447]
77 Since dj measures how "connected" point j is to other points in the same cluster, it will be dj = O(n) in this case, so the sum, which is over 0(n 2 ) terms , is in turn divided by djdk = O(n 2 ) . [sent-184, score-0.32]
78 Thus, as long as the individual Ajk's are small, the sum will also be small, and the assumption will hold with small fl. [sent-185, score-0.123]
79 Whereas dj measures how connected Sj E Si is to the rest of Si, 2:k:k'itSi Ajk measures how connected Sj is to points in other clusters. [sent-186, score-0.443]
80 The next assumption is that all points must be more connected to points in the same cluster than to points in other clusters; specifically, that the ratio between these two quantities be small. [sent-187, score-0.852]
81 ,k, j E Si, we have: (7) For intuition about this assumption, again consider the case of densely connected clusters (as we did previously). [sent-192, score-0.527]
82 Here, the quantity in parentheses on the right hand side is 0(1), so this becomes equivalent to demanding that the following ratio be small: (2:k:k'it Si Ajk)/dj = (2: k:k'it Si Ajk)/(2:k:kESi A jk ) = 0(f2) . [sent-193, score-0.056]
83 ,k, This last assumption is a fairly benign one that no points in a cluster be "too much less" connected than other points in the same cluster. [sent-203, score-0.742]
84 If 0 > (2 + V2}::, then there exist k orthogonal vectors rl, . [sent-206, score-0.099]
85 , rk (rF r j = I if i = j, o otherwise) so that Y's rows satisfy (8) Thus, the rows of Y will form tight clusters around k well-separated points (at 90 0 from each other) on the surface of the k-sphere according to their "true" cluster Si. [sent-209, score-1.389]
86 4 Experiments To test our algorithm, we applied it to seven clustering problems. [sent-210, score-0.132]
87 Note that whereas was previously described as a human-specified parameter, the analysis also suggests a particularly simple way of choosing it automatically: For the right (J2, Theorem 2 predicts that the rows of Y will form k "tight" clusters on the surface of the k-sphere. [sent-211, score-0.656]
88 Thus, we simply search over (J2 , and pick the value that, after clustering Y 's rows, gives the tightest (smallest distortion) clusters. [sent-212, score-0.163]
89 K-means in Step 5 of the algorithm was also inexpensively initialized using the prior knowledge that the clusters are about 90 0 apart. [sent-213, score-0.467]
90 Giving the algorithm only the coordinates of the points and k, the different clusters found are shown in the Figure via the different symbols (and colors, where available). [sent-215, score-0.577]
91 The results are surprisingly good: Even for clusters that do not form convex regions or that are not cleanly separated (such as in Figure 19) , the algorithm reliably finds clusterings consistent with what a human would have chosen. [sent-216, score-0.67]
92 (J2 We note that there are other, related algorithms that can give good results on a subset of these problems, but we are aware of no equally simple algorithm that can give results comparable to these. [sent-217, score-0.161]
93 For example, we noted earlier how K-means easily fails when clusters do not correspond to convex regions (Figure Ii). [sent-218, score-0.516]
94 Another alternative may be a simple "connected components" algorithm that, for a threshold T, draws an edge between points Si and Sj whenever Iisi - sjl12 :s: T, and takes the resulting connected components to be the clusters. [sent-219, score-0.278]
95 Here, T is a parameter that can (say) be optimized to obtain the desired number of clusters k. [sent-220, score-0.413]
96 Their method is similar to ours, except for the seemingly cosmetic difference that they normalize A's rows to sum to I and use its eigenvectors instead of L 's, and do not renormalize the rows of X to unit length. [sent-226, score-0.806]
97 A refinement of our analysis suggests that this method might be susceptible to bad clusterings when the degree to which different clusters are connected (L: j d;il) varies substantially across clusters. [sent-227, score-0.589]
98 3 Briefiy, we let the first cluster centroid be a randomly chosen row of Y , and then repeatedly choose as the next centroid the row of Y that is closest to being 90° from all the centroids (formally, from the worst-case centroid) already picked. [sent-228, score-0.593]
99 The resulting K-means was run only once (no restarts) to give the results presented. [sent-229, score-0.065]
100 K-means with the more conventional random initialization and a small number of restarts also gave identical results. [sent-230, score-0.084]
wordName wordTfidf (topN-words)
[('clusters', 0.413), ('eigenvectors', 0.319), ('cluster', 0.285), ('si', 0.259), ('rows', 0.195), ('jrni', 0.142), ('clustering', 0.132), ('spectral', 0.132), ('assumption', 0.123), ('ajk', 0.123), ('aij', 0.117), ('connected', 0.114), ('points', 0.11), ('partitioning', 0.108), ('cheeger', 0.106), ('eigengap', 0.106), ('dj', 0.105), ('orthogonal', 0.099), ('eigenvalues', 0.094), ('tight', 0.091), ('eigenvector', 0.091), ('sj', 0.09), ('eigenvalue', 0.086), ('walk', 0.086), ('restarts', 0.084), ('graph', 0.083), ('row', 0.08), ('ii', 0.077), ('shi', 0.074), ('centroid', 0.074), ('matrix', 0.073), ('xk', 0.071), ('lrii', 0.071), ('maxi', 0.071), ('stacking', 0.071), ('meila', 0.07), ('blocks', 0.067), ('successes', 0.062), ('dii', 0.062), ('clusterings', 0.062), ('renormalize', 0.062), ('columns', 0.058), ('perturbation', 0.056), ('affinity', 0.056), ('jk', 0.056), ('algorithm', 0.054), ('define', 0.054), ('cs', 0.053), ('lh', 0.052), ('dk', 0.052), ('rk', 0.052), ('ideal', 0.052), ('partition', 0.052), ('subsets', 0.051), ('infinitely', 0.049), ('laplacian', 0.049), ('specifically', 0.048), ('surface', 0.048), ('exactly', 0.047), ('distortion', 0.045), ('characterizes', 0.043), ('partitions', 0.043), ('find', 0.043), ('al', 0.042), ('defined', 0.042), ('subspace', 0.042), ('letting', 0.04), ('diagonal', 0.04), ('convex', 0.039), ('algorithms', 0.039), ('proposition', 0.039), ('weiss', 0.039), ('fl', 0.038), ('finds', 0.038), ('familiar', 0.038), ('berkeley', 0.036), ('union', 0.035), ('spanned', 0.035), ('later', 0.035), ('unit', 0.035), ('strictly', 0.035), ('considerable', 0.035), ('give', 0.034), ('mixing', 0.034), ('sl', 0.034), ('il', 0.034), ('separated', 0.033), ('correspond', 0.033), ('regions', 0.031), ('run', 0.031), ('tightest', 0.031), ('lei', 0.031), ('disagree', 0.031), ('moderately', 0.031), ('ang', 0.031), ('yij', 0.031), ('subtlety', 0.031), ('mmi', 0.031), ('unsatisfactory', 0.031), ('caution', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
2 0.30701569 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
3 0.2295994 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
Author: Hilbert J. Kappen, Wim Wiegerinck
Abstract: The Cluster Variation method is a class of approximation methods containing the Bethe and Kikuchi approximations as special cases. We derive two novel iteration schemes for the Cluster Variation Method. One is a fixed point iteration scheme which gives a significant improvement over loopy BP, mean field and TAP methods on directed graphical models. The other is a gradient based method, that is guaranteed to converge and is shown to give useful results on random graphs with mild frustration. We conclude that the methods are of significant practical value for large inference problems. 1
4 0.18506859 24 nips-2001-Active Information Retrieval
Author: Tommi Jaakkola, Hava T. Siegelmann
Abstract: In classical large information retrieval systems, the system responds to a user initiated query with a list of results ranked by relevance. The users may further refine their query as needed. This process may result in a lengthy correspondence without conclusion. We propose an alternative active learning approach, where the system responds to the initial user's query by successively probing the user for distinctions at multiple levels of abstraction. The system's initiated queries are optimized for speedy recovery and the user is permitted to respond with multiple selections or may reject the query. The information is in each case unambiguously incorporated by the system and the subsequent queries are adjusted to minimize the need for further exchange. The system's initiated queries are subject to resource constraints pertaining to the amount of information that can be presented to the user per iteration. 1
5 0.18333249 170 nips-2001-Spectral Kernel Methods for Clustering
Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola
Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1
6 0.17467719 136 nips-2001-On the Concentration of Spectral Properties
7 0.16463836 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
8 0.14452419 185 nips-2001-The Method of Quantum Clustering
9 0.13835773 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
10 0.13525736 35 nips-2001-Analysis of Sparse Bayesian Learning
11 0.13319279 164 nips-2001-Sampling Techniques for Kernel Methods
12 0.12590016 89 nips-2001-Grouping with Bias
13 0.10389309 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
14 0.10183514 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
15 0.096430153 58 nips-2001-Covariance Kernels from Bayesian Generative Models
16 0.090887517 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
17 0.089142807 144 nips-2001-Partially labeled classification with Markov random walks
18 0.082392596 30 nips-2001-Agglomerative Multivariate Information Bottleneck
19 0.082013227 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
20 0.080192253 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
topicId topicWeight
[(0, -0.237), (1, 0.072), (2, 0.0), (3, -0.338), (4, 0.206), (5, -0.1), (6, -0.139), (7, -0.008), (8, 0.014), (9, -0.065), (10, 0.227), (11, -0.015), (12, -0.132), (13, 0.22), (14, -0.035), (15, -0.033), (16, -0.038), (17, -0.142), (18, 0.059), (19, -0.1), (20, 0.048), (21, -0.132), (22, 0.129), (23, 0.04), (24, 0.116), (25, 0.056), (26, -0.023), (27, 0.02), (28, -0.082), (29, -0.105), (30, 0.016), (31, -0.012), (32, 0.1), (33, -0.082), (34, 0.012), (35, 0.016), (36, 0.02), (37, -0.048), (38, -0.046), (39, 0.014), (40, -0.018), (41, 0.065), (42, -0.032), (43, -0.015), (44, 0.026), (45, 0.078), (46, -0.029), (47, -0.018), (48, -0.047), (49, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.98287565 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
2 0.89939386 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
3 0.6109609 185 nips-2001-The Method of Quantum Clustering
Author: David Horn, Assaf Gottlieb
Abstract: We propose a novel clustering method that is an extension of ideas inherent to scale-space clustering and support-vector clustering. Like the latter, it associates every data point with a vector in Hilbert space, and like the former it puts emphasis on their total sum, that is equal to the scalespace probability function. The novelty of our approach is the study of an operator in Hilbert space, represented by the Schr¨ dinger equation of o which the probability function is a solution. This Schr¨ dinger equation o contains a potential function that can be derived analytically from the probability function. We associate minima of the potential with cluster centers. The method has one variable parameter, the scale of its Gaussian kernel. We demonstrate its applicability on known data sets. By limiting the evaluation of the Schr¨ dinger potential to the locations of data points, o we can apply this method to problems in high dimensions.
4 0.56501061 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
Author: Marzia Polito, Pietro Perona
Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1
5 0.5639782 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
6 0.54645097 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
7 0.53818172 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
8 0.5115236 136 nips-2001-On the Concentration of Spectral Properties
9 0.51014215 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
10 0.47192425 24 nips-2001-Active Information Retrieval
11 0.45830747 35 nips-2001-Analysis of Sparse Bayesian Learning
12 0.41457391 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
13 0.3787787 170 nips-2001-Spectral Kernel Methods for Clustering
14 0.37231559 30 nips-2001-Agglomerative Multivariate Information Bottleneck
15 0.37011823 89 nips-2001-Grouping with Bias
16 0.33727071 120 nips-2001-Minimax Probability Machine
17 0.33311608 154 nips-2001-Products of Gaussians
18 0.32980448 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
19 0.32244352 164 nips-2001-Sampling Techniques for Kernel Methods
20 0.31534404 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
topicId topicWeight
[(14, 0.042), (17, 0.09), (19, 0.016), (20, 0.02), (27, 0.166), (30, 0.059), (38, 0.02), (53, 0.137), (59, 0.037), (72, 0.097), (79, 0.054), (83, 0.017), (91, 0.185)]
simIndex simValue paperId paperTitle
same-paper 1 0.93404913 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
2 0.8851099 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
3 0.88377702 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
Author: Marzia Polito, Pietro Perona
Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
5 0.87363827 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
6 0.87241244 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
7 0.87127924 121 nips-2001-Model-Free Least-Squares Policy Iteration
8 0.87097025 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
9 0.86996126 36 nips-2001-Approximate Dynamic Programming via Linear Programming
10 0.86971223 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
11 0.86937559 89 nips-2001-Grouping with Bias
12 0.8681829 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
13 0.86348832 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
14 0.86060464 143 nips-2001-PAC Generalization Bounds for Co-training
15 0.86018825 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
16 0.85813081 58 nips-2001-Covariance Kernels from Bayesian Generative Models
17 0.85579318 61 nips-2001-Distribution of Mutual Information
18 0.85474157 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
19 0.85043138 84 nips-2001-Global Coordination of Local Linear Models
20 0.85020602 8 nips-2001-A General Greedy Approximation Algorithm with Applications