nips nips2004 nips2004-103 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ulrike V. Luxburg, Olivier Bousquet, Mikhail Belkin
Abstract: An important aspect of clustering algorithms is whether the partitions constructed on finite samples converge to a useful clustering of the whole data space as the sample size increases. This paper investigates this question for normalized and unnormalized versions of the popular spectral clustering algorithm. Surprisingly, the convergence of unnormalized spectral clustering is more difficult to handle than the normalized case. Even though recently some first results on the convergence of normalized spectral clustering have been obtained, for the unnormalized case we have to develop a completely new approach combining tools from numerical integration, spectral and perturbation theory, and probability. It turns out that while in the normalized case, spectral clustering usually converges to a nice partition of the data space, in the unnormalized case the same only holds under strong additional assumptions which are not always satisfied. We conclude that our analysis gives strong evidence for the superiority of normalized spectral clustering. It also provides a basis for future exploration of other Laplacian-based methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract An important aspect of clustering algorithms is whether the partitions constructed on finite samples converge to a useful clustering of the whole data space as the sample size increases. [sent-9, score-0.738]
2 This paper investigates this question for normalized and unnormalized versions of the popular spectral clustering algorithm. [sent-10, score-1.003]
3 Surprisingly, the convergence of unnormalized spectral clustering is more difficult to handle than the normalized case. [sent-11, score-1.124]
4 Even though recently some first results on the convergence of normalized spectral clustering have been obtained, for the unnormalized case we have to develop a completely new approach combining tools from numerical integration, spectral and perturbation theory, and probability. [sent-12, score-1.505]
5 It turns out that while in the normalized case, spectral clustering usually converges to a nice partition of the data space, in the unnormalized case the same only holds under strong additional assumptions which are not always satisfied. [sent-13, score-1.415]
6 We conclude that our analysis gives strong evidence for the superiority of normalized spectral clustering. [sent-14, score-0.445]
7 In this paper we study the consistency of spectral clustering, which is an important property in the general framework of statistical pattern recognition. [sent-19, score-0.359]
8 Surprisingly, relatively little research into consistency of clustering algorithms has been done so far, exceptions being only k-centers (Pollard, 1981) and linkage algorithms (Hartigan, 1985). [sent-22, score-0.295]
9 While finite-sample properties of spectral clustering have been studied from a theoretical point of view (Spielman and Teng, 1996; Guattery and Miller, 1998; Kannan et al. [sent-23, score-0.552]
10 In this paper we develop a new strategy to prove convergence results for spectral clustering algorithms. [sent-26, score-0.743]
11 Unlike our previous attempts this strategy allows to obtain results for both normalized and unnormalized spectral clustering. [sent-27, score-0.734]
12 (2004), which had been proved with different and more restrictive methods, and, in brief, states that usually normalized spectral clustering converges. [sent-29, score-0.652]
13 Our second result concerns the case of unnormalized spectral clustering, for which no convergence properties had been known so far. [sent-31, score-0.78]
14 This case is much more difficult to treat than the normalized case, as the limit operators have a more complicated form. [sent-32, score-0.547]
15 We show that unnormalized spectral clustering also converges, but only under strong additional assumptions. [sent-33, score-0.915]
16 It turns out that in case of convergence, the structure of the clustering constructed on finite samples is conserved in the limit process. [sent-37, score-0.447]
17 From this we can conclude that if convergence takes place, then the limit clustering presents an intuitively appealing partition of the data space. [sent-38, score-0.699]
18 There is an ongoing debate on the advantages of the normalized versus unnormalized graph Laplacians for spectral clustering. [sent-44, score-0.734]
19 It has been found empirically that the normalized version performs as well or better than the unnormalized version (e. [sent-45, score-0.426]
20 Normalized spectral clustering is a well-behaved algorithm which always converges to a sensible limit clustering. [sent-50, score-0.845]
21 Unnormalized spectral clustering on the other hand should be treated with care as consistency can only be asserted under strong assumptions which are not always satisfied and, moreover, are difficult to check in practice. [sent-51, score-0.752]
22 2 Graph Laplacians and spectral clustering on finite samples In the following we denote by σ(T ) the spectrum of a linear operator, by C(X ) the space of continuous functions on X with infinity norm, and by rg(d) the range of a function d ∈ C(X ). [sent-52, score-0.667]
23 The unnormalized Laplacian matrix is defined −1/2 −1/2 as Ln := Dn − Kn , and two common ways of normalizing it are L′ := Dn Ln Dn n ′′ −1 or Ln := Dn Ln . [sent-61, score-0.326]
24 normalized) spectral clustering partitions the sample points Xi into two groups according to whether the i-th coordinate of the second eigenvector is larger or smaller than a certain threshold b ∈ I Often, instead of considering R. [sent-67, score-0.677]
25 only the second eigenvector, one uses the first r eigenvectors (for some small number r) simultaneously to obtain a partition into several sets. [sent-68, score-0.272]
26 For an overview of different spectral clustering algorithms see for example Weiss (1999). [sent-69, score-0.552]
27 Let us start with the first question raised in the introduction: does the spectral clustering constructed on a finite sample converge to a partition of the whole data space if the sample size increases? [sent-75, score-0.971]
28 In the normalized case, convergence results have recently been obtained in von Luxburg et al. [sent-76, score-0.319]
29 However, those methods were specifically designed for the normalized Laplacian and cannot be used in the unnormalized case. [sent-78, score-0.426]
30 n Theorem 1 (Convergence of normalized spectral clustering) Under the general assumptions, if the first r eigenvalues of the limit operator U ′ have multiplicity 1, then the same holds for the first r eigenvalues of L′ for sufficiently large n. [sent-81, score-1.241]
31 In this case, the first r n eigenvalues of L′ converge to the first r eigenvalues of U ′ , and the corresponding eigenvecn tors converge almost surely. [sent-82, score-0.512]
32 The partitions constructed by normalized spectral clustering from the first r eigenvectors on finite samples converge almost surely to a limit partition of the whole data space. [sent-83, score-1.316]
33 In this case, the first r eigenvalues of n Ln converge to the first r eigenvalues of U , and the the corresponding eigenvectors converge almost surely. [sent-85, score-0.591]
34 The partitions constructed by unnormalized spectral clustering from the first r eigenvectors on finite samples converge almost surely to a limit partition of the whole data space. [sent-86, score-1.542]
35 On the first glance, this theorem looks very similar to Theorem 1: if the general assumptions are satisfied and the first eigenvalues are “nice”, then unnormalized spectral clustering converges. [sent-87, score-1.148]
36 In Theorem 1 we only require the eigenvalues to have multiplicity 1 (and in fact, if the multiplicity is larger than 1 we can still prove convergence of eigenspaces instead of eigenvectors). [sent-89, score-0.674]
37 In the proof this is needed to ensure that the eigenvalue λ is isolated in the spectrum of U , which is a fundamental requirement for applying perturbation theory to the convergence of eigenvectors. [sent-91, score-0.561]
38 The reason why this condition appears in the unnormalized case but not in the normalized case lies in the structure of the respective limit operators, which, surprisingly, is more complicated in the unnormalized case than in the normalized one. [sent-93, score-1.0]
39 This means that there actually exist situations in which Theorem 2 cannot be applied, and hence unnormalized spectral clustering might not converge. [sent-95, score-0.878]
40 Now we want to turn to the second question raised in the introduction: In case of convergence, is the limit clustering a reasonable clustering of the whole data space? [sent-96, score-0.721]
41 To answer this question we analyze the structure of the limit operators (for simplicity we state this for the unnormalized case only). [sent-97, score-0.798]
42 In a similar way, the limit operator U can be decomposed into a matrix of operators Uij : C(Xj ) → C(Xi ). [sent-103, score-0.618]
43 This is a very strong result as it means that for every given partition of X , the structure of the operators is preserved in the limit process. [sent-108, score-0.645]
44 Theorem 3 (Structure of the limit operators) Let X = ∪k Xi be a partition of the data i=1 space. [sent-109, score-0.309]
45 Then under the general assumptions, n Lij,n converges ′ compactly to Uij a. [sent-111, score-0.257]
46 In Meila and Shi (2001) it has been established that normalized spectral clustering tries to find a partition such that a random walk on the sample points tends to stay within each of the partition sets Xi instead of jumping between them. [sent-116, score-1.038]
47 With the help of Theorem 3, the same can now be said for the normalized limit partition, and this can also be extended to the unnormalized case. [sent-117, score-0.574]
48 The operators U ′ and U can be interpreted as diffusion operators on the data space. [sent-118, score-0.65]
49 The limit clusterings try to find a partition such that the diffusion tends to stay within the sets Xi instead of jumping between them. [sent-119, score-0.42]
50 In particular, the limit partition segments the data space into sets such that the similarity within the sets is high and the similarity between the sets is low, which intuitively is what clustering is supposed to do. [sent-120, score-0.637]
51 To be able to define convergence of linear operators, all operators have to act on the same space. [sent-125, score-0.445]
52 In 1 Step 2 we show that the interesting eigenvalues and eigenvectors of n Ln and Un are in a one-to-one relationship. [sent-127, score-0.273]
53 Then we will prove that the Un converge in a strong sense to some limit operator U on C(X ) in Step 3. [sent-128, score-0.479]
54 As we can show in Step 4, this convergence implies the convergence of eigenvalues and eigenvectors of Un . [sent-129, score-0.565]
55 Step 1: Construction of the operators Un on C(X ). [sent-131, score-0.299]
56 Corresponding to the matrices Dn and Kn we introduce the following multiplication and integral operators on C(X ): Mdn f (x) := dn (x)f (x) and Md f (x) := d(x)f (x) Sn f (x) := and Sf (x) := s(x, y)f (y)dPn (y) s(x, y)f (y)dP (y). [sent-133, score-0.491]
57 Hence the function dn and the 1 operators Mdn and Sn are the counterparts of the discrete degrees n di and the matrices 1 1 n Dn and n Kn . [sent-138, score-0.512]
58 The operators corresponding to the unnormalized Laplacians n Ln = 1 n (Dn − Kn ) and its limit operator are Un f (x) := Mdn f (x) − Sn f (x) and U f (x) := Md f (x) − Sf (x). [sent-141, score-0.944]
59 The spectrum of Un consists of rg(dn ), plus some isolated eigenvalues with finite multiplicity. [sent-144, score-0.347]
60 If f ∈ C(X ) is an eigenfunction of Un with arbitrary eigenvalue λ, then the vector 1 v ∈ I n with vi = f (Xi ) is an eigenvector of the matrix n Ln with eigenvalue λ. [sent-147, score-0.356]
61 If v is an eigenvector of the matrix n Ln with eigenvalue λ ∈ rg(dn ), then the 1 function f (x) = n ( j s(x, Xj )vj )/(dn (x) − λ) is the unique eigenfunction of Un with eigenvalue λ satisfying f (Xi ) = vi . [sent-149, score-0.356]
62 It is well-known that the (essential) spectrum of a multiplication operator coincides with the range of the multiplier function. [sent-151, score-0.295]
63 Moreover, the spectrum of a sum of a bounded operator with a compact operator contains the essential spectrum of the bounded operator. [sent-152, score-0.61]
64 Additionally, it can only contain some isolated eigenvalues with finite multiplicity (e. [sent-153, score-0.402]
65 The proofs of the other parts of this proposition can be obtained by elementary shuffling of eigenvalue equations and will be skipped. [sent-158, score-0.258]
66 Recall that the operators Un are random operators as they depend on the given sample points X1 , . [sent-161, score-0.63]
67 A sequence (Sn )n of linear operators on E is called collectively compact if the set n Sn B is relatively compact in E (with respect to the norm topology). [sent-178, score-0.605]
68 A sequence of operators converges collectively compactly if it converges pointwise and if there exists some N ∈ I such that N the operators (Sn − S)n>N are collectively compact. [sent-179, score-1.293]
69 A sequence of operators converges compactly if it converges pointwise and if for every sequence (xn )n in B, the sequence (S−Sn )xn is relatively compact. [sent-180, score-0.819]
70 A sequence (xn )n in E converges up to a change of sign to x ∈ E if there exists a sequence (an )n of signs an ∈ {−1, +1} such that the sequence (an xn )n converges to x. [sent-182, score-0.405]
71 As the limit operator S is compact, to prove that (Sn − S)n are collectively compact a. [sent-194, score-0.553]
72 Both operator norm convergence and collectively compact convergence imply compact convergence (cf. [sent-204, score-0.884]
73 Moreover, it is easy to see that the sum of two compactly converging operators converges compactly. [sent-207, score-0.556]
74 1) that compact convergence of operators implies the convergence of eigenvalues and spectral projections in the following way. [sent-210, score-1.147]
75 If λ is an isolated eigenvalue in σ(U ) with finite multiplicity, then there exists a sequence λn ∈ σ(Un ) of isolated eigenvalues with finite multiplicity such that λn → λ. [sent-211, score-0.685]
76 If the first r eigenvalues of T have multiplicity 1, then the same holds for the first r eigenvalues of Tn for sufficiently large n, and the i-th eigenvalues of Tn converge to the i-th eigenvalue of T . [sent-212, score-0.881]
77 If the multiplicity of the eigenvalues is larger than 1 but finite, then the corresponding eigenspaces converge. [sent-214, score-0.337]
78 Note that for eigenvalues which are not isolated in the spectrum, convergence cannot be asserted, and the same holds for the corresponding eigenvectors (e. [sent-215, score-0.557]
79 Hence an eigenvalue λ ∈ σ(U ) is isolated in the spectrum iff λ ∈ rg(d) holds, in which case convergence holds as stated above. [sent-220, score-0.525]
80 In the first two steps we transferred the 1 problem of the convergence of the eigenvectors of n Ln to the convergence of eigenfunctions of Un . [sent-223, score-0.437]
81 In Step 3 we showed that Un converges compactly to the limit operator U , which according to Step 4 implies the convergence of the eigenfunctions of Un . [sent-224, score-0.756]
82 In terms 1 of the eigenvectors of n Ln this means the following: if λ denotes the j-th eigenvalue 1 of U with eigenfunction f ∈ C(X ) and λn the j-th eigenvalue of n Ln with eigenvector vn = (vn,1 , . [sent-225, score-0.467]
83 As spectral clustering is constructed from the coordinates of the eigenvectors, this leads to the convergence of spectral clustering in the unnormalized case. [sent-233, score-1.607]
84 Here the limit operator is U ′ f (x) := (I − S ′ )f (x) := f (x) − (s(x, y)/ d(x)d(y) )f (y)dP (y). [sent-236, score-0.319]
85 The main difference to the unnormalized case is that the operator Md in U gets replaced by the identity operator I in U ′ . [sent-237, score-0.668]
86 This simplifies matters as one can easily express the spectrum of (I − S ′ ) via the spectrum of the compact operator S ′ . [sent-238, score-0.439]
87 From a different point of view, consider the identity operator as the operator of multiplication by the constant one function 1. [sent-239, score-0.375]
88 Finally, note that it is also possible to prove more general versions of Theorems 1 and 2 where the eigenvalues have finite multiplicity larger than 1. [sent-241, score-0.353]
89 Instead of the convergence of the eigenvectors we then obtain the convergence of the projections on the eigenspaces. [sent-242, score-0.403]
90 In this situation, spectral clustering recovers the ideal clustering. [sent-246, score-0.552]
91 If there exists no ideal clustering, but there exists a partition such that the off-diagonal operators are “small” and the diagonal operators are “large”, then it can be seen by perturbation theory arguments that spectral clustering will find such a partition. [sent-247, score-1.47]
92 The off-diagonal operators can be interpreted as diffusion operators between different clusters (note that even in the unnormalized case, the multiplication operator only appears in the diagonal operators). [sent-248, score-1.204]
93 Hence, constructing a clustering with small off-diagonal operators corresponds to a partition such that few diffusion between the clusters takes place. [sent-249, score-0.756]
94 Note that in applications of spectral clustering, we do not know the limit operator U and hence cannot test whether its relevant eigenvalues are in its essential spectrum rg(d) or not. [sent-262, score-0.88]
95 If, for some special reason, one really wants to use unnormalized spectral clustering, one should at least estimate the critical region rg(d) by [mini di /n, maxi di /n] and check 1 whether the relevant eigenvalues of n Ln are inside or close to this interval or not. [sent-263, score-0.936]
96 5 Conclusions We have shown that under standard assumptions, normalized spectral clustering always converges to a limit partition of the whole data space which depends only on the probability distribution P and the similarity function s. [sent-266, score-1.16]
97 For unnormalized spectral clustering, this can only be guaranteed under the strong additional assumption that the first eigenvalues of the Laplacian do not fall inside the range of the degree function. [sent-267, score-0.857]
98 Therefore, in the light of our theoretical analysis we assert that the normalized version of spectral clustering should be preferred in practice. [sent-271, score-0.652]
99 An improved spectral bisection algorithm and its application to dynamic load balancing. [sent-346, score-0.308]
100 On the convergence of spectral clustering on random samples: the normalized case. [sent-361, score-0.798]
wordName wordTfidf (topN-words)
[('unnormalized', 0.326), ('spectral', 0.308), ('operators', 0.299), ('clustering', 0.244), ('rg', 0.232), ('un', 0.217), ('operator', 0.171), ('eigenvalues', 0.162), ('partition', 0.161), ('dn', 0.159), ('uij', 0.15), ('limit', 0.148), ('multiplicity', 0.146), ('convergence', 0.146), ('sn', 0.143), ('ln', 0.14), ('compactly', 0.136), ('mdn', 0.129), ('eigenvalue', 0.127), ('converges', 0.121), ('eigenvectors', 0.111), ('collectively', 0.103), ('normalized', 0.1), ('isolated', 0.094), ('spectrum', 0.091), ('proposition', 0.089), ('compact', 0.086), ('luxburg', 0.08), ('converge', 0.078), ('perturbation', 0.073), ('von', 0.073), ('kn', 0.067), ('xi', 0.065), ('theorem', 0.065), ('eigenfunction', 0.058), ('xj', 0.058), ('chatelin', 0.055), ('dpn', 0.055), ('di', 0.054), ('md', 0.053), ('diffusion', 0.052), ('laplacians', 0.051), ('consistency', 0.051), ('theorems', 0.051), ('partitions', 0.049), ('pointwise', 0.049), ('sij', 0.048), ('satis', 0.047), ('nite', 0.046), ('prove', 0.045), ('eigenvector', 0.044), ('laplacian', 0.044), ('holds', 0.044), ('sf', 0.044), ('assumptions', 0.043), ('proofs', 0.042), ('similarity', 0.042), ('meila', 0.041), ('xn', 0.039), ('strong', 0.037), ('asserted', 0.037), ('driessche', 0.037), ('guattery', 0.037), ('kato', 0.037), ('sanity', 0.037), ('spielman', 0.037), ('vaart', 0.037), ('bousquet', 0.037), ('whole', 0.036), ('dp', 0.035), ('eigenfunctions', 0.034), ('multiplication', 0.033), ('jumping', 0.032), ('uii', 0.032), ('dpj', 0.032), ('tors', 0.032), ('sample', 0.032), ('check', 0.032), ('belkin', 0.031), ('nice', 0.031), ('constructed', 0.031), ('exists', 0.031), ('sequence', 0.031), ('proof', 0.03), ('eigenspaces', 0.029), ('kannan', 0.029), ('pn', 0.029), ('clusterings', 0.027), ('surely', 0.026), ('question', 0.025), ('raised', 0.024), ('sensible', 0.024), ('zhou', 0.024), ('degree', 0.024), ('diagonal', 0.024), ('fi', 0.024), ('samples', 0.024), ('chicago', 0.023), ('iff', 0.023), ('tn', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 103 nips-2004-Limits of Spectral Clustering
Author: Ulrike V. Luxburg, Olivier Bousquet, Mikhail Belkin
Abstract: An important aspect of clustering algorithms is whether the partitions constructed on finite samples converge to a useful clustering of the whole data space as the sample size increases. This paper investigates this question for normalized and unnormalized versions of the popular spectral clustering algorithm. Surprisingly, the convergence of unnormalized spectral clustering is more difficult to handle than the normalized case. Even though recently some first results on the convergence of normalized spectral clustering have been obtained, for the unnormalized case we have to develop a completely new approach combining tools from numerical integration, spectral and perturbation theory, and probability. It turns out that while in the normalized case, spectral clustering usually converges to a nice partition of the data space, in the unnormalized case the same only holds under strong additional assumptions which are not always satisfied. We conclude that our analysis gives strong evidence for the superiority of normalized spectral clustering. It also provides a basis for future exploration of other Laplacian-based methods. 1
2 0.22374403 161 nips-2004-Self-Tuning Spectral Clustering
Author: Lihi Zelnik-manor, Pietro Perona
Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1
3 0.16887563 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty
Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1
4 0.16345897 115 nips-2004-Maximum Margin Clustering
Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans
Abstract: We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1
5 0.14836298 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
Author: Chakra Chennubhotla, Allan D. Jepson
Abstract: We show how to build hierarchical, reduced-rank representation for large stochastic matrices and use this representation to design an efficient algorithm for computing the largest eigenvalues, and the corresponding eigenvectors. In particular, the eigen problem is first solved at the coarsest level of the representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy. A small number of power iterations are employed at each stage to correct the eigen solution. The typical speedups obtained by a Matlab implementation of our fast eigensolver over a standard sparse matrix eigensolver [13] are at least a factor of ten for large image sizes. The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. 1 Spectral Methods Graph-theoretic spectral methods have gained popularity in a variety of application domains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clustering parallel scientific computation tasks [19]. Spectral methods enable the study of properties global to a dataset, using only local (pairwise) similarity or affinity measurements between the data points. The global properties that emerge are best understood in terms of a random walk formulation on the graph. For example, the graph can be partitioned into clusters by analyzing the perturbations to the stationary distribution of a Markovian relaxation process defined in terms of the affinity weights [17, 18, 24, 7]. The Markovian relaxation process need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. In this paper we consider the practical application of spectral methods to large datasets. In particular, the eigen decomposition can be very expensive, on the order of O(n 3 ), where n is the number of nodes in the graph. While it is possible to compute analytically the first eigenvector (see §3 below), the remaining subspace of vectors (necessary for say clustering) has to be explicitly computed. A typical approach to dealing with this difficulty is to first sparsify the links in the graph [22] and then apply an efficient eigensolver [13, 23, 3]. In comparison, we propose in this paper a specialized eigensolver suitable for large stochastic matrices with known stationary distributions. In particular, we exploit the spectral properties of the Markov transition matrix to generate hierarchical, successively lower-ranked approximations to the full transition matrix. The eigen problem is solved directly at the coarsest level of representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy, using a small number of power iterations to correct the solution at each stage. 2 Previous Work One approach to speeding up the eigen decomposition is to use the fact that the columns of the affinity matrix are typically correlated. The idea then is to pick a small number of representative columns to perform eigen decomposition via SVD. For example, in the Nystrom approximation procedure, originally proposed for integral eigenvalue problems, the idea is to randomly pick a small set of m columns; generate the corresponding affinity matrix; solve the eigenproblem and finally extend the solution to the complete graph [9, 10]. The Nystrom method has also been recently applied in the kernel learning methods for fast Gaussian process classification and regression [25]. Other sampling-based approaches include the work reported in [1, 2, 11]. Our starting point is the transition matrix generated from affinity weights and we show how building a representational hierarchy follows naturally from considering the stochastic matrix. A closely related work is the paper by Lin on reduced rank approximations of transition matrices [14]. We differ in how we approximate the transition matrices, in particular our objective function is computationally less expensive to solve. In particular, one of our goals in reducing transition matrices is to develop a fast, specialized eigen solver for spectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels in the hierarchy can potentially have negative affinities. A graph coarsening process for clustering was also pursued in [21, 3]. 3 Markov Chain Terminology We first provide a brief overview of the Markov chain terminology here (for more details see [17, 15, 6]). We consider an undirected graph G = (V, E) with vertices vi , for i = {1, . . . , n}, and edges ei,j with non-negative weights ai,j . Here the weight ai,j represents the affinity between vertices vi and vj . The affinities are represented by a non-negative, symmetric n × n matrix A having weights ai,j as elements. The degree of a node j is n n defined to be: dj = i=1 ai,j = j=1 aj,i , where we define D = diag(d1 , . . . , dn ). A Markov chain is defined using these affinities by setting a transition probability matrix M = AD −1 , where the columns of M each sum to 1. The transition probability matrix defines the random walk of a particle on the graph G. The random walk need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. Because the stochastic matrices need not be symmetric in general, a direct eigen decomposition step is not preferred for reasons of instability. This problem is easily circumvented by considering a normalized affinity matrix: L = D −1/2 AD−1/2 , which is related to the stochastic matrix by a similarity transformation: L = D −1/2 M D1/2 . Because L is symmetric, it can be diagonalized: L = U ΛU T , where U = [u1 , u2 , · · · , un ] is an orthogonal set of eigenvectors and Λ is a diagonal matrix of eigenvalues [λ1 , λ2 , · · · , λn ] sorted in decreasing order. The eigenvectors have unit length uk = 1 and from the form of A and D it can be shown that the eigenvalues λi ∈ (−1, 1], with at least one eigenvalue equal to one. Without loss of generality, we take λ1 = 1. Because L and M are similar we can perform an eigen decomposition of the Markov transition matrix as: M = D1/2 LD−1/2 = D1/2 U Λ U T D−1/2 . Thus an eigenvector u of L corresponds to an eigenvector D 1/2 u of M with the same eigenvalue λ. The Markovian relaxation process after β iterations, namely M β , can be represented as: M β = D1/2 U Λβ U T D−1/2 . Therefore, a particle undertaking a random walk with an initial distribution p 0 acquires after β steps a distribution p β given by: p β = M β p 0 . Assuming the graph is connected, as β → ∞, the Markov chain approaches a unique n stationary distribution given by π = diag(D)/ i=1 di , and thus, M ∞ = π1T , where 1 is a n-dim column vector of all ones. Observe that π is an eigenvector of M as it is easy to show that M π = π and the corresponding eigenvalue is 1. Next, we show how to generate hierarchical, successively low-ranked approximations for the transition matrix M . 4 Building a Hierarchy of Transition Matrices The goal is to generate a very fast approximation, while simultaneously achieving sufficient accuracy. For notational ease, we think of M as a fine-scale representation and M as some coarse-scale approximation to be derived here. By coarsening M further, we can generate successive levels of the representation hierarchy. We use the stationary distribution π to construct a corresponding coarse-scale stationary distribution δ. As we just discussed a critical property of the fine scale Markov matrix M is that it is similar to the symmetric matrix L and we wish to preserve this property at every level of the representation hierarchy. 4.1 Deriving Coarse-Scale Stationary Distribution We begin by expressing the stationary distribution π as a probabilistic mixture of latent distributions. In matrix notation, we have (1) π = K δ, where δ is an unknown mixture coefficient vector of length m, K is an n × m non-negative n kernel matrix whose columns are latent distributions that each sum to 1: i=1 Ki,j = 1 and m n. It is easy to derive a maximum likelihood approximation of δ using an EM type algorithm [16]. The main step is to find a stationary point δ, λ for the Lagrangian: m n i=1 m Ki,j δj + λ πi ln E≡− j=1 δj − 1 . (2) j=1 An implicit step in this EM procedure is to compute the the ownership probability r i,j of the j th kernel (or node) at the coarse scale for the ith node on the fine scale and is given by ri,j = δj Ki,j . m k=1 δk Ki,k (3) The EM procedure allows for an update of both δ and the latent distributions in the kernel matrix K (see §8.3.1 in [6]). For initialization, δ is taken to be uniform over the coarse-scale states. But in choosing kernels K, we provide a good initialization for the EM procedure. Specifically, the Markov matrix M is diffused using a small number of iterations to get M β . The diffusion causes random walks from neighboring nodes to be less distinguishable. This in turn helps us select a small number of columns of M β in a fast and greedy way to be the kernel matrix K. We defer the exact details on kernel selection to a later section (§4.3). 4.2 Deriving the Coarse-Scale Transition Matrix In order to define M , the coarse-scale transition matrix, we break it down into three steps. First, the Markov chain propagation at the coarse scale can be defined as: q k+1 = M q k , (4) where q is the coarse scale probability distribution after k steps of the random walk. Second, we expand q k into the fine scale using the kernels K resulting in a fine scale probability distribution p k : p k = Kq k . (5) k Finally, we lift p k back into the coarse scale by using the ownership probability of the j th kernel for the ith node on the fine grid: n qjk+1 = ri,j pik i=1 (6) Substituting for Eqs.(3) and (5) in Eq. 6 gives n m qjk+1 = i=1 n Ki,t qtk = ri,j t=1 i=1 δj Ki,j m k=1 δk Ki,k m Ki,t qtk . (7) t=1 We can write the preceding equation in a matrix form: q k+1 = diag( δ ) K T diag K δ −1 Kq k . (8) Comparing this with Eq. 4, we can derive the transition matrix M as: M = diag( δ ) K T diag K δ −1 (9) K. It is easy to see that δ = M δ, so δ is the stationary distribution for M . Following the definition of M , and its stationary distribution δ, we can generate a symmetric coarse scale affinity matrix A given by A = M diag(δ) = diag( δ ) K T diag K δ −1 Kdiag(δ) , (10) where we substitute for the expression M from Eq. 9. The coarse-scale affinity matrix A is then normalized to get: L = D−1/2 AD−1/2 ; D = diag(d1 , d2 , · · · , dm ), (11) where dj is the degree of node j in the coarse-scale graph represented by the matrix A (see §3 for degree definition). Thus, the coarse scale Markov matrix M is precisely similar to a symmetric matrix L. 4.3 Selecting Kernels For demonstration purpose, we present the kernel selection details on the image of an eye shown below. To begin with, a random walk is defined where each pixel in the test image is associated with a vertex of the graph G. The edges in G are defined by the standard 8-neighbourhood of each pixel. For the demonstrations in this paper, the edge weight ai,j between neighbouring pixels xi and xj is given by a function of the difference in the 2 corresponding intensities I(xi ) and I(xj ): ai,j = exp(−(I(xi ) − I(xj ))2 /2σa ), where σa is set according to the median absolute difference |I(xi ) − I(xj )| between neighbours measured over the entire image. The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The kernel selection process we use is fast and greedy. First, the fine scale Markov matrix M is diffused to M β using β = 4. The Markov matrix M is sparse as we make the affinity matrix A sparse. Every column in the diffused matrix M β is a potential kernel. To facilitate the selection process, the second step is to rank order the columns of M β based on a probability value in the stationary distribution π. Third, the kernels (i.e. columns of M β ) are picked in such a way that for a kernel Ki all of the neighbours of pixel i which are within the half-height of the the maximum value in the kernel Ki are suppressed from the selection process. Finally, the kernel selection is continued until every pixel in the image is within a half-height of the peak value of at least one kernel. If M is a full matrix, to avoid the expense of computing M β explicitly, random kernel centers can be selected, and only the corresponding columns of M β need be computed. We show results from a three-scale hierarchy on the eye image (below). The image has 25 × 20 pixels but is shown here enlarged for clarity. At the first coarse scale 83 kernels are picked. The kernels each correspond to a different column in the fine scale transition matrix and the pixels giving rise to these kernels are shown numbered on the image. Using these kernels as an initialization, the EM procedure derives a coarse-scale stationary distribution δ 21 14 26 4 (Eq. 2), while simultaneously updating the kernel ma12 27 2 19 trix. Using the newly updated kernel matrix K and the 5 8 13 23 30 18 6 9 derived stationary distribution δ a transition matrix M 28 20 15 32 10 22 is generated (Eq. 9). The coarse scale Markov matrix 24 17 7 is then diffused to M β , again using β = 4. The kernel Coarse Scale 1 Coarse Scale 2 selection algorithm is reapplied, this time picking 32 kernels for the second coarse scale. Larger values of β cause the coarser level to have fewer elements. But the exact number of elements depends on the form of the kernels themselves. For the random experiments that we describe later in §6 we found β = 2 in the first iteration and 4 thereafter causes the number of kernels to be reduced by a factor of roughly 1/3 to 1/4 at each level. 72 28 35 44 51 64 82 4 12 31 56 19 77 36 45 52 65 13 57 23 37 5 40 53 63 73 14 29 6 66 38 74 47 24 7 30 41 54 71 78 58 15 8 20 39 48 59 67 25 68 79 21 16 2 11 26 42 49 55 60 75 32 83 43 9 76 50 17 27 61 33 69 80 3 46 18 70 81 34 10 62 22 1 25 11 1 3 16 31 29 At coarser levels of the hierarchy, we expect the kernels to get less sparse and so will the affinity and the transition matrices. In order to promote sparsity at successive levels of the hierarchy we sparsify A by zeroing out elements associated with “small” transition probabilities in M . However, in the experiments described later in §6, we observe this sparsification step to be not critical. To summarize, we use the stationary distribution π at the fine-scale to derive a transition matrix M , and its stationary distribution δ, at the coarse-scale. The coarse scale transition in turn helps to derive an affinity matrix A and its normalized version L. It is obvious that this procedure can be repeated recursively. We describe next how to use this representation hierarchy for building a fast eigensolver. 5 Fast EigenSolver Our goal in generating a hierarchical representation of a transition matrix is to develop a fast, specialized eigen solver for spectral clustering. To this end, we perform a full eigen decomposition of the normalized affinity matrix only at the coarsest level. As discussed in the previous section, the affinity matrix at the coarsest level is not likely to be sparse, hence it will need a full (as opposed to a sparse) version of an eigen solver. However it is typically the case that e ≤ m n (even in the case of the three-scale hierarchy that we just considered) and hence we expect this step to be the least expensive computationally. The resulting eigenvectors are interpolated to the next lower level of the hierarchy by a process which will be described next. Because the eigen interpolation process between every adjacent pair of scales in the hierarchy is similar, we will assume we have access to the leading eigenvectors U (size: m × e) for the normalized affinity matrix L (size: m × m) and describe how to generate the leading eigenvectors U (size: n × e), and the leading eigenvalues S (size: e × 1), for the fine-scale normalized affinity matrix L (size: n × n). There are several steps to the eigen interpolation process and in the discussion that follows we refer to the lines in the pseudo-code presented below. First, the coarse-scale eigenvectors U can be interpolated using the kernel matrix K to generate U = K U , an approximation for the fine-scale eigenvectors (line 9). Second, interpolation alone is unlikely to set the directions of U exactly aligned with U L , the vectors one would obtain by a direct eigen decomposition of the fine-scale normalized affinity matrix L. We therefore update the directions in U by applying a small number of power iterations with L, as given in lines 13-15. e e function (U, S) = CoarseToFine(L, K, U , S) 1: INPUT 2: L, K ⇐ {L is n × n and K is n × m where m n} e e e e 3: U /S ⇐ {leading coarse-scale eigenvectors/eigenvalues of L. U is of size m × e, e ≤ m} 4: OUTPUT 5: U, S ⇐ {leading fine-scale eigenvectors/eigenvalues of L. U is n × e and S is e × 1.} x 10 0.4 3 0.96 0.94 0.92 0.9 0.35 2.5 Relative Error Absolute Relative Error 0.98 Eigen Value |δλ|λ−1 −3 Eigen Spectrum 1 2 1.5 1 5 10 15 20 Eigen Index (a) 25 30 0.2 0.15 0.1 0.5 0.88 0.3 0.25 0.05 5 10 15 20 Eigen Index (b) 25 30 5 10 15 20 Eigen Index 25 30 (c) Figure 1: Hierarchical eigensolver results. (a) comparing ground truth eigenvalues S L (red circles) with multi-scale eigensolver spectrum S (blue line) (b) Relative absolute error between eigenvalues: |S−SL | (c) Eigenvector mismatch: 1 − diag |U T UL | , between SL eigenvectors U derived by the multi-scale eigensolver and the ground truth U L . Observe the slight mismatch in the last few eigenvectors, but excellent agreement in the leading eigenvectors (see text). 6: CONSTANTS: TOL = 1e-4; POWER ITERS = 50 7: “ ” e 8: TPI = min POWER ITERS, log(e × eps/TOL)/ log(min(S)) {eps: machine accuracy} e 9: U = K U {interpolation from coarse to fine} 10: while not converged do 11: Uold = U {n × e matrix, e n} 12: for i = 1 to TPI do 13: U ⇐ LU 14: end for 15: U ⇐ Gram-Schmidt(U ) {orthogonalize U } 16: Le = U T LU {L may be sparse, but Le need not be.} 17: Ue Se UeT = svd(Le ) {eigenanalysis of Le , which is of size e × e.} 18: U ⇐ U Ue {update the leading eigenvectors of L} 19: S = diag(Se ) {grab the leading eigenvalues of L} T 20: innerProd = 1 − diag( Uold U ) {1 is a e × 1 vector of all ones} 21: converged = max[abs(innerProd)] < TOL 22: end while The number of power iterations TPI can be bounded as discussed next. Suppose v = U c where U is a matrix of true eigenvectors and c is a coefficient vector for an arbitrary vector v. After TPI power iterations v becomes v = U diag(S TPI )c, where S has the exact eigenvalues. In order for the component of a vector v in the direction Ue (the eth column of U ) not to be swamped by other components, we can limit it’s decay after TPI iterations as TPI follows: (S(e)/S(1)) >= e×eps/TOL, where S(e) is the exact eth eigenvalue, S(1) = 1, eps is the machine precision, TOL is requested accuracy. Because we do not have access to the exact value S(e) at the beginning of the interpolation procedure, we estimate it from the coarse eigenvalues S. This leads to a bound on the power iterations TPI, as derived on the line 9 above. Third, the interpolation process and the power iterations need not preserve orthogonality in the eigenvectors in U . We fix this by Gram-Schmidt orthogonalization procedure (line 16). Finally, there is a still a problem with power iterations that needs to be resolved, in that it is very hard to separate nearby eigenvalues. In particular, for the convergence of the power iterations the ratio that matters is between the (e + 1) st and eth eigenvalues. So the idea we pursue is to use the power iterations only to separate the reduced space of eigenvectors (of dimension e) from the orthogonal subspace (of dimension n − e). We then use a full SVD on the reduced space to update the leading eigenvectors U , and eigenvalues S, for the fine-scale (lines 17-20). This idea is similar to computing the Ritz values and Ritz vectors in a Rayleigh-Ritz method. 6 Interpolation Results Our multi-scale decomposition code is in Matlab. For the direct eigen decomposition, we have used the Matlab program svds.m which invokes the compiled ARPACKC routine [13], with a default convergence tolerance of 1e-10. In Fig. 1a we compare the spectrum S obtained from a three-scale decomposition on the eye image (blue line) with the ground truth, which is the spectrum SL resulting from direct eigen decomposition of the fine-scale normalized affinity matrices L (red circles). There is an excellent agreement in the leading eigenvalues. To illustrate this, we show absolute relative error between the spectra: |S−SL | in Fig. 1b. The spectra agree mostly, except for SL the last few eigenvalues. For a quantitative comparison between the eigenvectors, we plot in Fig. 1c the following measure: 1 − diag(|U T UL |), where U is the matrix of eigenvectors obtained by the multi-scale approximation, UL is the ground-truth resulting from a direct eigen decomposition of the fine-scale affinity matrix L and 1 is a vector of all ones. The relative error plot demonstrates a close match, within the tolerance threshold of 1e-4 that we chose for the multi-scale method, in the leading eigenvector directions between the two methods. The relative error is high with the last few eigen vectors, which suggests that the power iterations have not clearly separated them from other directions. So, the strategy we suggest is to pad the required number of leading eigen basis by about 20% before invoking the multi-scale procedure. Obviously, the number of hierarchical stages for the multi-scale procedure must be chosen such that the transition matrix at the coarsest scale can accommodate the slight increase in the subspace dimensions. For lack of space we are omitting extra results (see Ch.8 in [6]). Next we measure the time the hierarchical eigensolver takes to compute the leading eigenbasis for various input sizes, in comparison with the svds.m procedure [13]. We form images of different input sizes by Gaussian smoothing of i.i.d noise. The Gaussian function has a standard deviation of 3 pixels. The edges in graph G are defined by the standard 8-neighbourhood of each pixel. The edge weights between neighbouring pixels are simply given by a function of the difference in the corresponding intensities (see §4.3). The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The fast eigensolver is run on ten different instances of the input image of a given size and the average of these times is reported here. For a fair comparison between the two procedures, we set the convergence tolerance value for the svds.m procedure to be 1e-4, the same as the one used for the fast eigensolver. We found the hierarchical representation derived from this tolerance threshold to be sufficiently accurate for a novel min-cut based segmentation results that we reported in [8]. Also, the subspace dimensionality is fixed to be 51 where we expect (and indeed observe) the leading 40 eigenpairs derived from the multi-scale procedure to be accurate. Hence, while invoking svds.m we compute only the leading 41 eigenpairs. In the table shown below, the first column corresponds to the number of nodes in the graph, while the second and third columns report the time taken in seconds by the svds.m procedure and the Matlab implementation of the multi-scale eigensolver respectively. The fourth column reports the speedups of the multi-scale eigensolver over svds.m procedure on a standard desktop (Intel P4, 2.5GHz, 1GB RAM). Lowering the tolerance threshold for svds.m made it faster by about 20 − 30%. Despite this, the multi-scale algorithm clearly outperforms the svds.m procedure. The most expensive step in the multi-scale algorithm is the power iteration required in the last stage, that is interpolating eigenvectors from the first coarse scale to the required fine scale. The complexity is of the order of n × e where e is the subspace dimensionality and n is the size of the graph. Indeed, from the table we can see that the multi-scale procedure is taking time roughly proportional to n. Deviations from the linear trend are observed at specific values of n, which we believe are due to the n 322 632 642 652 1002 1272 1282 1292 1602 2552 2562 2572 5112 5122 5132 6002 7002 8002 svds.m 1.6 10.8 20.5 12.6 44.2 91.1 230.9 96.9 179.3 819.2 2170.8 871.7 7977.2 20269 7887.2 10841.4 15048.8 Multi-Scale 1.5 4.9 5.5 5.1 13.1 20.4 35.2 20.9 34.4 90.3 188.7 93.3 458.8 739.3 461.9 644.2 1162.4 1936.6 Speedup 1.1 2.2 3.7 2.5 3.4 4.5 6.6 4.6 5.2 9.1 11.5 9.3 17.4 27.4 17.1 16.8 12.9 variations in the difficulty of the specific eigenvalue problem (eg. nearly multiple eigenvalues). The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. Here we explored the use of random walks and associated spectral embedding techniques for the automatic generation of suitable proposal (source and sink) regions for a min-cut based algorithm. The multiscale algorithm was used to generate the 40 leading eigenvectors of large transition matrices (eg. size 20K × 20K). In terms of future work, it will be useful to compare our work with other approximate methods for SVD such as [23]. Ack: We thank S. Roweis, F. Estrada and M. Sakr for valuable comments. References [1] D. Achlioptas and F. McSherry. Fast Computation of Low-Rank Approximations. STOC, 2001. [2] D. Achlioptas et al Sampling Techniques for Kernel Methods. NIPS, 2001. [3] S. Barnard and H. Simon Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems. PPSC, 627-632. [4] M. Belkin et al Laplacian Eigenmaps and Spectral Techniques for Embedding. NIPS, 2001. [5] M. Brand et al A unifying theorem for spectral embedding and clustering. AI & STATS, 2002. [6] C. Chennubhotla. Spectral Methods for Multi-scale Feature Extraction and Spectral Clustering. http://www.cs.toronto.edu/˜chakra/thesis.pdf Ph.D Thesis, Department of Computer Science, University of Toronto, Canada, 2004. [7] C. Chennubhotla and A. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS, 2002. [8] F. Estrada, A. Jepson and C. Chennubhotla. Spectral Embedding and Min-Cut for Image Segmentation. Manuscript Under Review, 2004. [9] C. Fowlkes et al Efficient spatiotemporal grouping using the Nystrom method. CVPR, 2001. [10] S. Belongie et al Spectral Partitioning with Indefinite Kernels using Nystrom app. ECCV, 2002. [11] A. Frieze et al Fast Monte-Carlo Algorithms for finding low-rank approximations. FOCS, 1998. [12] Y. Koren et al ACE: A Fast Multiscale Eigenvectors Computation for Drawing Huge Graphs IEEE Symp. on InfoVis 2002, pp. 137-144 [13] R. B. Lehoucq, D. C. Sorensen and C. Yang. ARPACK User Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods. SIAM 1998. [14] J. J. Lin. Reduced Rank Approximations of Transition Matrices. AI & STATS, 2002. [15] L. Lova’sz. Random Walks on Graphs: A Survey Combinatorics, 1996, 353–398. [16] G. J. McLachlan et al Mixture Models: Inference and Applications to Clustering. 1988 [17] M. Meila and J. Shi. A random walks view of spectral segmentation. AI & STATS, 2001. [18] A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: analysis and an algorithm NIPS, 2001. [19] A. Pothen Graph partitioning algorithms with applications to scientific computing. Parallel Numerical Algorithms, D. E. Keyes et al (eds.), Kluwer Academic Press, 1996. [20] G. L. Scott et al Feature grouping by relocalization of eigenvectors of the proximity matrix. BMVC, pg. 103-108, 1990. [21] E. Sharon et al Fast Multiscale Image Segmentation CVPR, I:70-77, 2000. [22] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, August, 2000. [23] H. Simon et al Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications SIAM J. of Sci. Comp. 21(6):2257-2274, 2000. [24] N. Tishby et al Data clustering by Markovian Relaxation NIPS, 2001. [25] C. Williams et al Using the Nystrom method to speed up the kernel machines. NIPS, 2001.
6 0.12193424 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
7 0.11328734 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
8 0.11144838 168 nips-2004-Semigroup Kernels on Finite Sets
9 0.09804941 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
10 0.095462784 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
11 0.092594609 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
12 0.090641595 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
13 0.089757532 185 nips-2004-The Convergence of Contrastive Divergences
14 0.089208208 77 nips-2004-Hierarchical Clustering of a Mixture Model
15 0.084255628 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
16 0.083248921 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
17 0.081209965 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
18 0.07748913 30 nips-2004-Binet-Cauchy Kernels
19 0.076054089 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
20 0.072138771 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
topicId topicWeight
[(0, -0.213), (1, 0.08), (2, -0.048), (3, -0.004), (4, -0.165), (5, -0.153), (6, -0.2), (7, 0.071), (8, -0.214), (9, -0.01), (10, -0.257), (11, 0.026), (12, -0.014), (13, -0.041), (14, 0.021), (15, -0.061), (16, -0.074), (17, 0.032), (18, -0.105), (19, 0.032), (20, -0.056), (21, 0.017), (22, 0.056), (23, -0.046), (24, -0.087), (25, 0.005), (26, 0.059), (27, 0.032), (28, -0.025), (29, 0.03), (30, -0.009), (31, 0.026), (32, -0.107), (33, 0.036), (34, 0.019), (35, -0.025), (36, 0.051), (37, -0.041), (38, -0.067), (39, 0.024), (40, -0.04), (41, -0.014), (42, -0.028), (43, 0.136), (44, -0.011), (45, -0.098), (46, -0.058), (47, -0.002), (48, 0.031), (49, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.98060882 103 nips-2004-Limits of Spectral Clustering
Author: Ulrike V. Luxburg, Olivier Bousquet, Mikhail Belkin
Abstract: An important aspect of clustering algorithms is whether the partitions constructed on finite samples converge to a useful clustering of the whole data space as the sample size increases. This paper investigates this question for normalized and unnormalized versions of the popular spectral clustering algorithm. Surprisingly, the convergence of unnormalized spectral clustering is more difficult to handle than the normalized case. Even though recently some first results on the convergence of normalized spectral clustering have been obtained, for the unnormalized case we have to develop a completely new approach combining tools from numerical integration, spectral and perturbation theory, and probability. It turns out that while in the normalized case, spectral clustering usually converges to a nice partition of the data space, in the unnormalized case the same only holds under strong additional assumptions which are not always satisfied. We conclude that our analysis gives strong evidence for the superiority of normalized spectral clustering. It also provides a basis for future exploration of other Laplacian-based methods. 1
2 0.7917949 161 nips-2004-Self-Tuning Spectral Clustering
Author: Lihi Zelnik-manor, Pietro Perona
Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1
3 0.74170041 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
Author: Chakra Chennubhotla, Allan D. Jepson
Abstract: We show how to build hierarchical, reduced-rank representation for large stochastic matrices and use this representation to design an efficient algorithm for computing the largest eigenvalues, and the corresponding eigenvectors. In particular, the eigen problem is first solved at the coarsest level of the representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy. A small number of power iterations are employed at each stage to correct the eigen solution. The typical speedups obtained by a Matlab implementation of our fast eigensolver over a standard sparse matrix eigensolver [13] are at least a factor of ten for large image sizes. The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. 1 Spectral Methods Graph-theoretic spectral methods have gained popularity in a variety of application domains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clustering parallel scientific computation tasks [19]. Spectral methods enable the study of properties global to a dataset, using only local (pairwise) similarity or affinity measurements between the data points. The global properties that emerge are best understood in terms of a random walk formulation on the graph. For example, the graph can be partitioned into clusters by analyzing the perturbations to the stationary distribution of a Markovian relaxation process defined in terms of the affinity weights [17, 18, 24, 7]. The Markovian relaxation process need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. In this paper we consider the practical application of spectral methods to large datasets. In particular, the eigen decomposition can be very expensive, on the order of O(n 3 ), where n is the number of nodes in the graph. While it is possible to compute analytically the first eigenvector (see §3 below), the remaining subspace of vectors (necessary for say clustering) has to be explicitly computed. A typical approach to dealing with this difficulty is to first sparsify the links in the graph [22] and then apply an efficient eigensolver [13, 23, 3]. In comparison, we propose in this paper a specialized eigensolver suitable for large stochastic matrices with known stationary distributions. In particular, we exploit the spectral properties of the Markov transition matrix to generate hierarchical, successively lower-ranked approximations to the full transition matrix. The eigen problem is solved directly at the coarsest level of representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy, using a small number of power iterations to correct the solution at each stage. 2 Previous Work One approach to speeding up the eigen decomposition is to use the fact that the columns of the affinity matrix are typically correlated. The idea then is to pick a small number of representative columns to perform eigen decomposition via SVD. For example, in the Nystrom approximation procedure, originally proposed for integral eigenvalue problems, the idea is to randomly pick a small set of m columns; generate the corresponding affinity matrix; solve the eigenproblem and finally extend the solution to the complete graph [9, 10]. The Nystrom method has also been recently applied in the kernel learning methods for fast Gaussian process classification and regression [25]. Other sampling-based approaches include the work reported in [1, 2, 11]. Our starting point is the transition matrix generated from affinity weights and we show how building a representational hierarchy follows naturally from considering the stochastic matrix. A closely related work is the paper by Lin on reduced rank approximations of transition matrices [14]. We differ in how we approximate the transition matrices, in particular our objective function is computationally less expensive to solve. In particular, one of our goals in reducing transition matrices is to develop a fast, specialized eigen solver for spectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels in the hierarchy can potentially have negative affinities. A graph coarsening process for clustering was also pursued in [21, 3]. 3 Markov Chain Terminology We first provide a brief overview of the Markov chain terminology here (for more details see [17, 15, 6]). We consider an undirected graph G = (V, E) with vertices vi , for i = {1, . . . , n}, and edges ei,j with non-negative weights ai,j . Here the weight ai,j represents the affinity between vertices vi and vj . The affinities are represented by a non-negative, symmetric n × n matrix A having weights ai,j as elements. The degree of a node j is n n defined to be: dj = i=1 ai,j = j=1 aj,i , where we define D = diag(d1 , . . . , dn ). A Markov chain is defined using these affinities by setting a transition probability matrix M = AD −1 , where the columns of M each sum to 1. The transition probability matrix defines the random walk of a particle on the graph G. The random walk need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. Because the stochastic matrices need not be symmetric in general, a direct eigen decomposition step is not preferred for reasons of instability. This problem is easily circumvented by considering a normalized affinity matrix: L = D −1/2 AD−1/2 , which is related to the stochastic matrix by a similarity transformation: L = D −1/2 M D1/2 . Because L is symmetric, it can be diagonalized: L = U ΛU T , where U = [u1 , u2 , · · · , un ] is an orthogonal set of eigenvectors and Λ is a diagonal matrix of eigenvalues [λ1 , λ2 , · · · , λn ] sorted in decreasing order. The eigenvectors have unit length uk = 1 and from the form of A and D it can be shown that the eigenvalues λi ∈ (−1, 1], with at least one eigenvalue equal to one. Without loss of generality, we take λ1 = 1. Because L and M are similar we can perform an eigen decomposition of the Markov transition matrix as: M = D1/2 LD−1/2 = D1/2 U Λ U T D−1/2 . Thus an eigenvector u of L corresponds to an eigenvector D 1/2 u of M with the same eigenvalue λ. The Markovian relaxation process after β iterations, namely M β , can be represented as: M β = D1/2 U Λβ U T D−1/2 . Therefore, a particle undertaking a random walk with an initial distribution p 0 acquires after β steps a distribution p β given by: p β = M β p 0 . Assuming the graph is connected, as β → ∞, the Markov chain approaches a unique n stationary distribution given by π = diag(D)/ i=1 di , and thus, M ∞ = π1T , where 1 is a n-dim column vector of all ones. Observe that π is an eigenvector of M as it is easy to show that M π = π and the corresponding eigenvalue is 1. Next, we show how to generate hierarchical, successively low-ranked approximations for the transition matrix M . 4 Building a Hierarchy of Transition Matrices The goal is to generate a very fast approximation, while simultaneously achieving sufficient accuracy. For notational ease, we think of M as a fine-scale representation and M as some coarse-scale approximation to be derived here. By coarsening M further, we can generate successive levels of the representation hierarchy. We use the stationary distribution π to construct a corresponding coarse-scale stationary distribution δ. As we just discussed a critical property of the fine scale Markov matrix M is that it is similar to the symmetric matrix L and we wish to preserve this property at every level of the representation hierarchy. 4.1 Deriving Coarse-Scale Stationary Distribution We begin by expressing the stationary distribution π as a probabilistic mixture of latent distributions. In matrix notation, we have (1) π = K δ, where δ is an unknown mixture coefficient vector of length m, K is an n × m non-negative n kernel matrix whose columns are latent distributions that each sum to 1: i=1 Ki,j = 1 and m n. It is easy to derive a maximum likelihood approximation of δ using an EM type algorithm [16]. The main step is to find a stationary point δ, λ for the Lagrangian: m n i=1 m Ki,j δj + λ πi ln E≡− j=1 δj − 1 . (2) j=1 An implicit step in this EM procedure is to compute the the ownership probability r i,j of the j th kernel (or node) at the coarse scale for the ith node on the fine scale and is given by ri,j = δj Ki,j . m k=1 δk Ki,k (3) The EM procedure allows for an update of both δ and the latent distributions in the kernel matrix K (see §8.3.1 in [6]). For initialization, δ is taken to be uniform over the coarse-scale states. But in choosing kernels K, we provide a good initialization for the EM procedure. Specifically, the Markov matrix M is diffused using a small number of iterations to get M β . The diffusion causes random walks from neighboring nodes to be less distinguishable. This in turn helps us select a small number of columns of M β in a fast and greedy way to be the kernel matrix K. We defer the exact details on kernel selection to a later section (§4.3). 4.2 Deriving the Coarse-Scale Transition Matrix In order to define M , the coarse-scale transition matrix, we break it down into three steps. First, the Markov chain propagation at the coarse scale can be defined as: q k+1 = M q k , (4) where q is the coarse scale probability distribution after k steps of the random walk. Second, we expand q k into the fine scale using the kernels K resulting in a fine scale probability distribution p k : p k = Kq k . (5) k Finally, we lift p k back into the coarse scale by using the ownership probability of the j th kernel for the ith node on the fine grid: n qjk+1 = ri,j pik i=1 (6) Substituting for Eqs.(3) and (5) in Eq. 6 gives n m qjk+1 = i=1 n Ki,t qtk = ri,j t=1 i=1 δj Ki,j m k=1 δk Ki,k m Ki,t qtk . (7) t=1 We can write the preceding equation in a matrix form: q k+1 = diag( δ ) K T diag K δ −1 Kq k . (8) Comparing this with Eq. 4, we can derive the transition matrix M as: M = diag( δ ) K T diag K δ −1 (9) K. It is easy to see that δ = M δ, so δ is the stationary distribution for M . Following the definition of M , and its stationary distribution δ, we can generate a symmetric coarse scale affinity matrix A given by A = M diag(δ) = diag( δ ) K T diag K δ −1 Kdiag(δ) , (10) where we substitute for the expression M from Eq. 9. The coarse-scale affinity matrix A is then normalized to get: L = D−1/2 AD−1/2 ; D = diag(d1 , d2 , · · · , dm ), (11) where dj is the degree of node j in the coarse-scale graph represented by the matrix A (see §3 for degree definition). Thus, the coarse scale Markov matrix M is precisely similar to a symmetric matrix L. 4.3 Selecting Kernels For demonstration purpose, we present the kernel selection details on the image of an eye shown below. To begin with, a random walk is defined where each pixel in the test image is associated with a vertex of the graph G. The edges in G are defined by the standard 8-neighbourhood of each pixel. For the demonstrations in this paper, the edge weight ai,j between neighbouring pixels xi and xj is given by a function of the difference in the 2 corresponding intensities I(xi ) and I(xj ): ai,j = exp(−(I(xi ) − I(xj ))2 /2σa ), where σa is set according to the median absolute difference |I(xi ) − I(xj )| between neighbours measured over the entire image. The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The kernel selection process we use is fast and greedy. First, the fine scale Markov matrix M is diffused to M β using β = 4. The Markov matrix M is sparse as we make the affinity matrix A sparse. Every column in the diffused matrix M β is a potential kernel. To facilitate the selection process, the second step is to rank order the columns of M β based on a probability value in the stationary distribution π. Third, the kernels (i.e. columns of M β ) are picked in such a way that for a kernel Ki all of the neighbours of pixel i which are within the half-height of the the maximum value in the kernel Ki are suppressed from the selection process. Finally, the kernel selection is continued until every pixel in the image is within a half-height of the peak value of at least one kernel. If M is a full matrix, to avoid the expense of computing M β explicitly, random kernel centers can be selected, and only the corresponding columns of M β need be computed. We show results from a three-scale hierarchy on the eye image (below). The image has 25 × 20 pixels but is shown here enlarged for clarity. At the first coarse scale 83 kernels are picked. The kernels each correspond to a different column in the fine scale transition matrix and the pixels giving rise to these kernels are shown numbered on the image. Using these kernels as an initialization, the EM procedure derives a coarse-scale stationary distribution δ 21 14 26 4 (Eq. 2), while simultaneously updating the kernel ma12 27 2 19 trix. Using the newly updated kernel matrix K and the 5 8 13 23 30 18 6 9 derived stationary distribution δ a transition matrix M 28 20 15 32 10 22 is generated (Eq. 9). The coarse scale Markov matrix 24 17 7 is then diffused to M β , again using β = 4. The kernel Coarse Scale 1 Coarse Scale 2 selection algorithm is reapplied, this time picking 32 kernels for the second coarse scale. Larger values of β cause the coarser level to have fewer elements. But the exact number of elements depends on the form of the kernels themselves. For the random experiments that we describe later in §6 we found β = 2 in the first iteration and 4 thereafter causes the number of kernels to be reduced by a factor of roughly 1/3 to 1/4 at each level. 72 28 35 44 51 64 82 4 12 31 56 19 77 36 45 52 65 13 57 23 37 5 40 53 63 73 14 29 6 66 38 74 47 24 7 30 41 54 71 78 58 15 8 20 39 48 59 67 25 68 79 21 16 2 11 26 42 49 55 60 75 32 83 43 9 76 50 17 27 61 33 69 80 3 46 18 70 81 34 10 62 22 1 25 11 1 3 16 31 29 At coarser levels of the hierarchy, we expect the kernels to get less sparse and so will the affinity and the transition matrices. In order to promote sparsity at successive levels of the hierarchy we sparsify A by zeroing out elements associated with “small” transition probabilities in M . However, in the experiments described later in §6, we observe this sparsification step to be not critical. To summarize, we use the stationary distribution π at the fine-scale to derive a transition matrix M , and its stationary distribution δ, at the coarse-scale. The coarse scale transition in turn helps to derive an affinity matrix A and its normalized version L. It is obvious that this procedure can be repeated recursively. We describe next how to use this representation hierarchy for building a fast eigensolver. 5 Fast EigenSolver Our goal in generating a hierarchical representation of a transition matrix is to develop a fast, specialized eigen solver for spectral clustering. To this end, we perform a full eigen decomposition of the normalized affinity matrix only at the coarsest level. As discussed in the previous section, the affinity matrix at the coarsest level is not likely to be sparse, hence it will need a full (as opposed to a sparse) version of an eigen solver. However it is typically the case that e ≤ m n (even in the case of the three-scale hierarchy that we just considered) and hence we expect this step to be the least expensive computationally. The resulting eigenvectors are interpolated to the next lower level of the hierarchy by a process which will be described next. Because the eigen interpolation process between every adjacent pair of scales in the hierarchy is similar, we will assume we have access to the leading eigenvectors U (size: m × e) for the normalized affinity matrix L (size: m × m) and describe how to generate the leading eigenvectors U (size: n × e), and the leading eigenvalues S (size: e × 1), for the fine-scale normalized affinity matrix L (size: n × n). There are several steps to the eigen interpolation process and in the discussion that follows we refer to the lines in the pseudo-code presented below. First, the coarse-scale eigenvectors U can be interpolated using the kernel matrix K to generate U = K U , an approximation for the fine-scale eigenvectors (line 9). Second, interpolation alone is unlikely to set the directions of U exactly aligned with U L , the vectors one would obtain by a direct eigen decomposition of the fine-scale normalized affinity matrix L. We therefore update the directions in U by applying a small number of power iterations with L, as given in lines 13-15. e e function (U, S) = CoarseToFine(L, K, U , S) 1: INPUT 2: L, K ⇐ {L is n × n and K is n × m where m n} e e e e 3: U /S ⇐ {leading coarse-scale eigenvectors/eigenvalues of L. U is of size m × e, e ≤ m} 4: OUTPUT 5: U, S ⇐ {leading fine-scale eigenvectors/eigenvalues of L. U is n × e and S is e × 1.} x 10 0.4 3 0.96 0.94 0.92 0.9 0.35 2.5 Relative Error Absolute Relative Error 0.98 Eigen Value |δλ|λ−1 −3 Eigen Spectrum 1 2 1.5 1 5 10 15 20 Eigen Index (a) 25 30 0.2 0.15 0.1 0.5 0.88 0.3 0.25 0.05 5 10 15 20 Eigen Index (b) 25 30 5 10 15 20 Eigen Index 25 30 (c) Figure 1: Hierarchical eigensolver results. (a) comparing ground truth eigenvalues S L (red circles) with multi-scale eigensolver spectrum S (blue line) (b) Relative absolute error between eigenvalues: |S−SL | (c) Eigenvector mismatch: 1 − diag |U T UL | , between SL eigenvectors U derived by the multi-scale eigensolver and the ground truth U L . Observe the slight mismatch in the last few eigenvectors, but excellent agreement in the leading eigenvectors (see text). 6: CONSTANTS: TOL = 1e-4; POWER ITERS = 50 7: “ ” e 8: TPI = min POWER ITERS, log(e × eps/TOL)/ log(min(S)) {eps: machine accuracy} e 9: U = K U {interpolation from coarse to fine} 10: while not converged do 11: Uold = U {n × e matrix, e n} 12: for i = 1 to TPI do 13: U ⇐ LU 14: end for 15: U ⇐ Gram-Schmidt(U ) {orthogonalize U } 16: Le = U T LU {L may be sparse, but Le need not be.} 17: Ue Se UeT = svd(Le ) {eigenanalysis of Le , which is of size e × e.} 18: U ⇐ U Ue {update the leading eigenvectors of L} 19: S = diag(Se ) {grab the leading eigenvalues of L} T 20: innerProd = 1 − diag( Uold U ) {1 is a e × 1 vector of all ones} 21: converged = max[abs(innerProd)] < TOL 22: end while The number of power iterations TPI can be bounded as discussed next. Suppose v = U c where U is a matrix of true eigenvectors and c is a coefficient vector for an arbitrary vector v. After TPI power iterations v becomes v = U diag(S TPI )c, where S has the exact eigenvalues. In order for the component of a vector v in the direction Ue (the eth column of U ) not to be swamped by other components, we can limit it’s decay after TPI iterations as TPI follows: (S(e)/S(1)) >= e×eps/TOL, where S(e) is the exact eth eigenvalue, S(1) = 1, eps is the machine precision, TOL is requested accuracy. Because we do not have access to the exact value S(e) at the beginning of the interpolation procedure, we estimate it from the coarse eigenvalues S. This leads to a bound on the power iterations TPI, as derived on the line 9 above. Third, the interpolation process and the power iterations need not preserve orthogonality in the eigenvectors in U . We fix this by Gram-Schmidt orthogonalization procedure (line 16). Finally, there is a still a problem with power iterations that needs to be resolved, in that it is very hard to separate nearby eigenvalues. In particular, for the convergence of the power iterations the ratio that matters is between the (e + 1) st and eth eigenvalues. So the idea we pursue is to use the power iterations only to separate the reduced space of eigenvectors (of dimension e) from the orthogonal subspace (of dimension n − e). We then use a full SVD on the reduced space to update the leading eigenvectors U , and eigenvalues S, for the fine-scale (lines 17-20). This idea is similar to computing the Ritz values and Ritz vectors in a Rayleigh-Ritz method. 6 Interpolation Results Our multi-scale decomposition code is in Matlab. For the direct eigen decomposition, we have used the Matlab program svds.m which invokes the compiled ARPACKC routine [13], with a default convergence tolerance of 1e-10. In Fig. 1a we compare the spectrum S obtained from a three-scale decomposition on the eye image (blue line) with the ground truth, which is the spectrum SL resulting from direct eigen decomposition of the fine-scale normalized affinity matrices L (red circles). There is an excellent agreement in the leading eigenvalues. To illustrate this, we show absolute relative error between the spectra: |S−SL | in Fig. 1b. The spectra agree mostly, except for SL the last few eigenvalues. For a quantitative comparison between the eigenvectors, we plot in Fig. 1c the following measure: 1 − diag(|U T UL |), where U is the matrix of eigenvectors obtained by the multi-scale approximation, UL is the ground-truth resulting from a direct eigen decomposition of the fine-scale affinity matrix L and 1 is a vector of all ones. The relative error plot demonstrates a close match, within the tolerance threshold of 1e-4 that we chose for the multi-scale method, in the leading eigenvector directions between the two methods. The relative error is high with the last few eigen vectors, which suggests that the power iterations have not clearly separated them from other directions. So, the strategy we suggest is to pad the required number of leading eigen basis by about 20% before invoking the multi-scale procedure. Obviously, the number of hierarchical stages for the multi-scale procedure must be chosen such that the transition matrix at the coarsest scale can accommodate the slight increase in the subspace dimensions. For lack of space we are omitting extra results (see Ch.8 in [6]). Next we measure the time the hierarchical eigensolver takes to compute the leading eigenbasis for various input sizes, in comparison with the svds.m procedure [13]. We form images of different input sizes by Gaussian smoothing of i.i.d noise. The Gaussian function has a standard deviation of 3 pixels. The edges in graph G are defined by the standard 8-neighbourhood of each pixel. The edge weights between neighbouring pixels are simply given by a function of the difference in the corresponding intensities (see §4.3). The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The fast eigensolver is run on ten different instances of the input image of a given size and the average of these times is reported here. For a fair comparison between the two procedures, we set the convergence tolerance value for the svds.m procedure to be 1e-4, the same as the one used for the fast eigensolver. We found the hierarchical representation derived from this tolerance threshold to be sufficiently accurate for a novel min-cut based segmentation results that we reported in [8]. Also, the subspace dimensionality is fixed to be 51 where we expect (and indeed observe) the leading 40 eigenpairs derived from the multi-scale procedure to be accurate. Hence, while invoking svds.m we compute only the leading 41 eigenpairs. In the table shown below, the first column corresponds to the number of nodes in the graph, while the second and third columns report the time taken in seconds by the svds.m procedure and the Matlab implementation of the multi-scale eigensolver respectively. The fourth column reports the speedups of the multi-scale eigensolver over svds.m procedure on a standard desktop (Intel P4, 2.5GHz, 1GB RAM). Lowering the tolerance threshold for svds.m made it faster by about 20 − 30%. Despite this, the multi-scale algorithm clearly outperforms the svds.m procedure. The most expensive step in the multi-scale algorithm is the power iteration required in the last stage, that is interpolating eigenvectors from the first coarse scale to the required fine scale. The complexity is of the order of n × e where e is the subspace dimensionality and n is the size of the graph. Indeed, from the table we can see that the multi-scale procedure is taking time roughly proportional to n. Deviations from the linear trend are observed at specific values of n, which we believe are due to the n 322 632 642 652 1002 1272 1282 1292 1602 2552 2562 2572 5112 5122 5132 6002 7002 8002 svds.m 1.6 10.8 20.5 12.6 44.2 91.1 230.9 96.9 179.3 819.2 2170.8 871.7 7977.2 20269 7887.2 10841.4 15048.8 Multi-Scale 1.5 4.9 5.5 5.1 13.1 20.4 35.2 20.9 34.4 90.3 188.7 93.3 458.8 739.3 461.9 644.2 1162.4 1936.6 Speedup 1.1 2.2 3.7 2.5 3.4 4.5 6.6 4.6 5.2 9.1 11.5 9.3 17.4 27.4 17.1 16.8 12.9 variations in the difficulty of the specific eigenvalue problem (eg. nearly multiple eigenvalues). The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. Here we explored the use of random walks and associated spectral embedding techniques for the automatic generation of suitable proposal (source and sink) regions for a min-cut based algorithm. The multiscale algorithm was used to generate the 40 leading eigenvectors of large transition matrices (eg. size 20K × 20K). In terms of future work, it will be useful to compare our work with other approximate methods for SVD such as [23]. Ack: We thank S. Roweis, F. Estrada and M. Sakr for valuable comments. References [1] D. Achlioptas and F. McSherry. Fast Computation of Low-Rank Approximations. STOC, 2001. [2] D. Achlioptas et al Sampling Techniques for Kernel Methods. NIPS, 2001. [3] S. Barnard and H. Simon Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems. PPSC, 627-632. [4] M. Belkin et al Laplacian Eigenmaps and Spectral Techniques for Embedding. NIPS, 2001. [5] M. Brand et al A unifying theorem for spectral embedding and clustering. AI & STATS, 2002. [6] C. Chennubhotla. Spectral Methods for Multi-scale Feature Extraction and Spectral Clustering. http://www.cs.toronto.edu/˜chakra/thesis.pdf Ph.D Thesis, Department of Computer Science, University of Toronto, Canada, 2004. [7] C. Chennubhotla and A. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS, 2002. [8] F. Estrada, A. Jepson and C. Chennubhotla. Spectral Embedding and Min-Cut for Image Segmentation. Manuscript Under Review, 2004. [9] C. Fowlkes et al Efficient spatiotemporal grouping using the Nystrom method. CVPR, 2001. [10] S. Belongie et al Spectral Partitioning with Indefinite Kernels using Nystrom app. ECCV, 2002. [11] A. Frieze et al Fast Monte-Carlo Algorithms for finding low-rank approximations. FOCS, 1998. [12] Y. Koren et al ACE: A Fast Multiscale Eigenvectors Computation for Drawing Huge Graphs IEEE Symp. on InfoVis 2002, pp. 137-144 [13] R. B. Lehoucq, D. C. Sorensen and C. Yang. ARPACK User Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods. SIAM 1998. [14] J. J. Lin. Reduced Rank Approximations of Transition Matrices. AI & STATS, 2002. [15] L. Lova’sz. Random Walks on Graphs: A Survey Combinatorics, 1996, 353–398. [16] G. J. McLachlan et al Mixture Models: Inference and Applications to Clustering. 1988 [17] M. Meila and J. Shi. A random walks view of spectral segmentation. AI & STATS, 2001. [18] A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: analysis and an algorithm NIPS, 2001. [19] A. Pothen Graph partitioning algorithms with applications to scientific computing. Parallel Numerical Algorithms, D. E. Keyes et al (eds.), Kluwer Academic Press, 1996. [20] G. L. Scott et al Feature grouping by relocalization of eigenvectors of the proximity matrix. BMVC, pg. 103-108, 1990. [21] E. Sharon et al Fast Multiscale Image Segmentation CVPR, I:70-77, 2000. [22] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, August, 2000. [23] H. Simon et al Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications SIAM J. of Sci. Comp. 21(6):2257-2274, 2000. [24] N. Tishby et al Data clustering by Markovian Relaxation NIPS, 2001. [25] C. Williams et al Using the Nystrom method to speed up the kernel machines. NIPS, 2001.
4 0.5520103 115 nips-2004-Maximum Margin Clustering
Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans
Abstract: We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1
5 0.54671717 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
Author: Massimiliano Pavan, Marcello Pelillo
Abstract: Dominant sets are a new graph-theoretic concept that has proven to be relevant in pairwise data clustering problems, such as image segmentation. They generalize the notion of a maximal clique to edgeweighted graphs and have intriguing, non-trivial connections to continuous quadratic optimization and spectral-based grouping. We address the problem of grouping out-of-sample examples after the clustering process has taken place. This may serve either to drastically reduce the computational burden associated to the processing of very large data sets, or to efficiently deal with dynamic situations whereby data sets need to be updated continually. We show that the very notion of a dominant set offers a simple and efficient way of doing this. Numerical experiments on various grouping problems show the effectiveness of the approach. 1
6 0.51212054 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
7 0.50162309 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
8 0.49157256 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
9 0.4906041 185 nips-2004-The Convergence of Contrastive Divergences
10 0.45090157 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
11 0.43041682 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
12 0.39421856 30 nips-2004-Binet-Cauchy Kernels
13 0.39030188 168 nips-2004-Semigroup Kernels on Finite Sets
14 0.39022154 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
15 0.38657865 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
16 0.37738869 77 nips-2004-Hierarchical Clustering of a Mixture Model
17 0.3722738 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
18 0.33302456 94 nips-2004-Kernels for Multi--task Learning
19 0.32954481 165 nips-2004-Semi-supervised Learning on Directed Graphs
20 0.32901612 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
topicId topicWeight
[(13, 0.115), (15, 0.126), (26, 0.092), (31, 0.028), (33, 0.174), (35, 0.021), (39, 0.022), (50, 0.076), (76, 0.01), (79, 0.226), (82, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.86385149 103 nips-2004-Limits of Spectral Clustering
Author: Ulrike V. Luxburg, Olivier Bousquet, Mikhail Belkin
Abstract: An important aspect of clustering algorithms is whether the partitions constructed on finite samples converge to a useful clustering of the whole data space as the sample size increases. This paper investigates this question for normalized and unnormalized versions of the popular spectral clustering algorithm. Surprisingly, the convergence of unnormalized spectral clustering is more difficult to handle than the normalized case. Even though recently some first results on the convergence of normalized spectral clustering have been obtained, for the unnormalized case we have to develop a completely new approach combining tools from numerical integration, spectral and perturbation theory, and probability. It turns out that while in the normalized case, spectral clustering usually converges to a nice partition of the data space, in the unnormalized case the same only holds under strong additional assumptions which are not always satisfied. We conclude that our analysis gives strong evidence for the superiority of normalized spectral clustering. It also provides a basis for future exploration of other Laplacian-based methods. 1
2 0.75349545 69 nips-2004-Fast Rates to Bayes for Kernel Machines
Author: Ingo Steinwart, Clint Scovel
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1
3 0.75205439 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
4 0.74691361 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
5 0.74003047 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
6 0.73665369 28 nips-2004-Bayesian inference in spiking neurons
7 0.7358135 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
8 0.73235965 116 nips-2004-Message Errors in Belief Propagation
9 0.73120815 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
10 0.73103303 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
11 0.73041391 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
12 0.73002499 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
13 0.72961617 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
14 0.72746652 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
15 0.72677493 102 nips-2004-Learning first-order Markov models for control
16 0.72672164 70 nips-2004-Following Curved Regularized Optimization Solution Paths
17 0.72665107 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
18 0.72612429 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
19 0.7259692 151 nips-2004-Rate- and Phase-coded Autoassociative Memory
20 0.72551179 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve