nips nips2003 nips2003-107 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. [sent-7, score-1.742]
2 In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. [sent-8, score-1.553]
3 Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. [sent-9, score-0.992]
4 Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. [sent-10, score-0.893]
5 We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. [sent-11, score-0.226]
6 1 Introduction Spectral clustering has many applications in machine learning, exploratory data analysis, computer vision and speech processing. [sent-12, score-0.348]
7 Most techniques explicitly or implicitly assume a metric or a similarity structure over the space of configurations, which is then used by clustering algorithms. [sent-13, score-0.685]
8 Thus, time-consuming manual feature selection and weighting is often a necessary precursor to the use of spectral methods. [sent-15, score-0.344]
9 Several recent papers have considered ways to alleviate this burden by incorporating prior knowledge into the metric, either in the setting of K-means clustering [1, 2] or spectral clustering [3, 4]. [sent-16, score-0.956]
10 In this paper, we consider a complementary approach, providing a general framework for learning the similarity matrix for spectral clustering from examples. [sent-17, score-1.119]
11 We assume that we are given sample data with known partitions and are asked to build similarity matrices that will lead to these partitions when spectral clustering is performed. [sent-18, score-1.197]
12 Another important motivation for our work is the need to develop spectral clustering methods that are robust to irrelevant features. [sent-20, score-0.732]
13 2, the performance of current spectral methods can degrade dramatically in the presence of such irrelevant features. [sent-22, score-0.426]
14 Our work is based on a new cost function J(W, e) that characterizes how close the eigenstructure of a similarity matrix W is to a partition e. [sent-24, score-0.917]
15 3, minimizing J with respect to e leads to a new clustering algorithm that takes the form of a weighted K-means algorithm. [sent-27, score-0.529]
16 Minimizing J with respect to W yields an algorithm for learning the similarity matrix, as we show in Section 4. [sent-28, score-0.385]
17 Section 3 provides foundational material on the approximation of the eigensubspace of a symmetric matrix that is needed for Section 4. [sent-29, score-0.288]
18 Let D denote the diagonal matrix whose i-th diagonal element is the sum of the elements in the i-th row of W , i. [sent-32, score-0.2]
19 ” The classical relaxation of this NP-hard problem [7, 8, 9] leads to an eigenvalue problem. [sent-37, score-0.222]
20 1 Normalized cut and graph partitioning The clustering problem is usually defined in terms of a complete graph with vertices V = {1, . [sent-40, score-0.58]
21 , P } and an affinity matrix with weights Wpp , for p, p ∈ V . [sent-43, score-0.132]
22 ,R} , where r Ar = V , that optimize a certain cost function. [sent-47, score-0.174]
23 An example of such a function is the R-way normalized cut defined as follows [7, 10]: C(A, W ) = R r=1 i∈Ar ,j∈V \Ar Wij / i∈Ar ,j∈V Wij . [sent-48, score-0.177]
24 Let er be the indicator vector in RP for the r-th cluster, i. [sent-49, score-0.182]
25 , er ∈ {0, 1}R is such that er has a nonzero component exactly at points in the r-th cluster. [sent-51, score-0.403]
26 A short calculation reveals that the normalized cut is then R equal to C(e, W ) = r=1 er (D − W )er / (er Der ). [sent-53, score-0.359]
27 Proof The constraint (a) is equivalent to the existence of a matrix Λ ∈ RR×R such that D−1/2 Y = (e1 , . [sent-56, score-0.132]
28 The matrix E DE is diagonal, with elements er Der and is thus positive and invertible. [sent-61, score-0.314]
29 This in turn implies that tr Y D−1/2 W D−1/2 Y = tr Λ E W EΛ = tr E W EΛΛ = tr E W E(E DE)−1 , which is exactly the normalized cut (up to an additive constant). [sent-63, score-0.561]
30 By removing the constraint (a), we obtain a relaxed optimization problem, whose solutions involve the eigenstructure of D−1/2 W D−1/2 and which leads to the classical lower bound on the optimal normalized cut [8, 9]. [sent-64, score-0.346]
31 The following proposition gives the solution obtained from the relaxation (for the proof, see [11]): Proposition 2 The maximum of tr Y D−1/2 W D−1/2 Y over matrices Y ∈ RP ×R such that Y Y = I is the sum of the R largest eigenvalues of D−1/2 W D−1/2 . [sent-65, score-0.429]
32 It is attained at all Y of the form Y = U B1 where U ∈ RP ×R is any orthonormal basis of the R-th principal subspace of D−1/2 W D−1/2 and B1 is an arbitrary rotation matrix in RR×R . [sent-66, score-0.349]
33 The solutions found by this relaxation will not in general be piecewise constant. [sent-67, score-0.179]
34 In order to obtain a piecewise constant solution, we wish to find a piecewise constant matrix that is as close as possible to one of the possible Y obtained from the eigendecomposition. [sent-68, score-0.336]
35 Since such matrices are defined up to a rotation matrix, it makes sense to compare the subspaces spanned by their columns. [sent-69, score-0.186]
36 This cost function characterizes the ability of the matrix W to produce the partition e when using its eigenvectors. [sent-75, score-0.418]
37 Minimizing with respect to e leads to a new clustering algorithm that we now present. [sent-76, score-0.393]
38 Minimizing with respect to the matrix for a given partition e leads to the learning of the similarity matrix, as we show in Section 4. [sent-77, score-0.674]
39 3 Minimizing with respect to the partition In this section, we show that minimizing J(W, e) is equivalent to a weighted K-means algorithm. [sent-79, score-0.302]
40 The following theorem, inspired by the spectral relaxation of K-means presented in [8], shows that the cost function can be interpreted as a weighted distortion measure1 : Theorem 1 Let W be an affinity matrix and let U = (u1 , . [sent-80, score-0.844]
41 , uP ), where up ∈ RR , be an orthonormal basis of the R-th principal subspace of D−1/2 W D−1/2 . [sent-83, score-0.217]
42 For any partition e ≡ A, we have J(W, e) = dp ||up d−1/2 − µr ||2 . [sent-84, score-0.347]
43 ,µR )∈RR×R r p∈Ar −1/2 Proof Let D(µ, A) = r p∈Ar dp ||up dp − µr ||2 . [sent-88, score-0.458]
44 ,µR )∈RR×R r p∈Ar dp ||gp d−1 − µr ||2 + R − tr D−1/2 W D−1/2 . [sent-94, score-0.325]
45 For all p, assign p to Ar where r = arg minr ||up dp − µr || Output: partition A, distortion measure −1/2 r p∈Ar dp ||up dp − µr ||2 Figure 1: Spectral clustering algorithm. [sent-106, score-1.151]
46 = p up up − 1/2 1/2 r p,p ∈Ar dp dp up up / (er Der ) = R − r er D1/2 U U D1/2 er / (er Der ) = J(W, e) This theorem has an immediate algorithmic implication—to minimize the cost function J(W, e) with respect to the partition e, we can use a weighted K-means algorithm. [sent-107, score-1.218]
47 While K-means is often used heuristically as a post-processor for spectral clustering [13], our approach provides a mathematical foundation for the use of K-means, and yields a specific weighted form of K-means that is appropriate for the problem. [sent-109, score-0.714]
48 4 Minimizing with respect to the similarity matrix When the partition e is given, we can consider minimization with respect to W . [sent-111, score-0.716]
49 As we have suggested, intuitively this has the effect of yielding a matrix W such that the result of spectral clustering with that W is as close as possible to e. [sent-112, score-0.814]
50 We now make this notion precise, by showing that the cost function J(W, e) is an upper bound on the distance between the partition e and the result of spectral clustering using the similarity matrix W . [sent-113, score-1.374]
51 2 The following theorem shows that if we can perform weighted K-means exactly, we obtain a bound on the performance of our spectral clustering algorithm (for a proof, see [11]): Theorem 2 Let η = maxp Dpp / minp Dpp 1. [sent-115, score-0.743]
52 3 Approximation of the cost function In order to minimize the cost function J(W, e) with respect to W , which is the topic of Section 4, we need to optimize a function of the R-th principal subspace of the matrix D−1/2 W D−1/2 . [sent-117, score-0.653]
53 We assume that |λR | > |λR+1 | so that the R-th principal subspace ER is well defined, with orthogonal projection ΠR . [sent-122, score-0.264]
54 The same method can be generalized to the computation of dominant eigensubspaces: If V is a matrix in RP ×R , the subspace generated by the R columns of X q V will tend to the principal eigensubspace of X. [sent-125, score-0.408]
55 Note that since we are interested only in subspaces, and in particular the orthogonal projection operators on those subspaces, we can choose any method for finding an orthonormal basis of range(X q V ). [sent-126, score-0.199]
56 of X, and we have: ||ΠR (X) − ΠR ||2 (1−η 2 )1/2 This proposition shows that as q tends to infinity, the range of X q V will tend to the principal eigensubspace. [sent-131, score-0.165]
57 The rate of convergence is determined by the (multiplicative) eigengap |λR+1 |/|λR | < 1: it is usually hard to compute principal subspace of matrices with eigengap close to one. [sent-132, score-0.548]
58 2 Potentially hard eigenvalue problems In most of the literature on spectral clustering, it is taken for granted that the eigenvalue problem is easy to solve. [sent-136, score-0.456]
59 It turns out that in many situations, the (multiplicative) eigengap is very close to one, making the eigenvector computation difficult (examples are given in the next section). [sent-137, score-0.205]
60 The cost function that M 1 we use is the average error F (W, Π0 (e)) = 2M m=1 ||Bm − Π0 ||2 . [sent-144, score-0.137]
61 This cost function F can be rewritten as the distance between the average of the Bm and Π0 plus the variance of the approximations, thus explicitly penalizing the non-convergence of the power iterations. [sent-145, score-0.191]
62 3 Empirical comparisons In this section, we study the ability of various cost functions to track the gold standard error measure in Eq. [sent-149, score-0.26]
63 (2) as we vary the parameter α in the similarity matrix Wpp = exp(−α||xp − xp ||2 ). [sent-150, score-0.514]
64 We study the cost function J(W, e), its approximation based on the power method presented in Section 3, and two existing approaches, one based on a Markov chain interpretation of spectral clustering [15] and one based on the alignment [16] of D−1/2 W D−1/2 and Π0 . [sent-151, score-0.912]
65 We carry out this experiment for the simple clustering example The matrix D−1/2 W D−1/2 always has the same largest eigenvalue 1 with eigenvector D 1 and we could consider instead the (R − 1)-st principal subspace of D−1/2 W D−1/2 − D1/2 11 D1/2 / (1 D1). [sent-152, score-0.694]
66 2 0 1 2 log(α) (c) 3 0 0 1 log(α) 2 3 (d) Figure 2: Empirical comparison of cost functions. [sent-161, score-0.137]
67 (b) Eigengap of the similarity matrix as a function of α. [sent-163, score-0.469]
68 (c) Gold standard clustering error (solid), spectral cost function J (dotted) and its approximation based on the power method (dashed). [sent-164, score-0.876]
69 (d) Gold standard clustering error (solid), the alignment (dashed), and a Markov-chain-based cost, divided by 16 (dotted). [sent-165, score-0.342]
70 This apparently simple toy example captures much of the core difficulty of spectral clustering—nonlinear separability and thinness/sparsity of clusters (any point has very few near neighbors belonging to the same cluster, so that the weighted graph is sparse). [sent-167, score-0.548]
71 In particular, in Figure 2(b) we plot the eigengap of the similarity matrix as a function of α, noting that at the optimum, this gap is very close to one, and thus the eigenvalue problem is hard to solve. [sent-168, score-0.692]
72 In Figure 2(c) and (d), we plot the four cost functions against the gold standard. [sent-169, score-0.26]
73 5 on a log scale, and as seen in Figure 2(c), the minima of the new cost function and its approximation lie near to this value. [sent-171, score-0.172]
74 As seen in Figure 2(d), on the other hand, the other two cost functions show a poor match to the gold standard, and yield minima far from the optimum. [sent-172, score-0.26]
75 The problem with the alignment and Markov-chain-based cost functions is that these functions essentially measure the distance between the similarity matrix W (or a normalized version of W ) and a matrix T which (after permutation) is block-diagonal with constant blocks. [sent-173, score-0.886]
76 Unfortunately, in examples like the one in Figure 2, the optimal similarity matrix is very far from being block diagonal with constant blocks. [sent-174, score-0.537]
77 In the language of spectral graph partitioning, where we have a weighted graph with weights W , each cluster is a connected but very sparse graph. [sent-177, score-0.582]
78 Thus taking powers can be interpreted as “thickening” the graph to make the clusters more apparent, while not changing the eigenstructure of the matrix (taking powers of symmetric matrices only changes the eigenvalues, not the eigenvectors). [sent-181, score-0.619]
79 4 Learning the similarity matrix We now turn to the problem of learning the similarity matrix from data. [sent-182, score-0.938]
80 We assume that we are given one or more sets of data for which the desired clustering is known. [sent-183, score-0.306]
81 The goal is to design a “similarity map,” that is, a mapping from datasets of elements in X to the space of symmetric matrices with nonnegative elements. [sent-184, score-0.213]
82 To turn this into a parametric learning problem, we focus on similarity matrices that are obtained as Gram matrices of a kernel function k(x, y) defined on X×X . [sent-185, score-0.505]
83 Each dataset is segmented, that is, for each n we know the partition en , so that the “target” matrix Π0 (en , α) can be computed for each dataset. [sent-196, score-0.326]
84 The cost function that 1 we use is H(α) = N n F (Wn (α), Π0 (en , α)) + C||α||1 . [sent-198, score-0.137]
85 Since the complexity of the cost function increases with q, we start the minimization with small q and gradually increase q up to its maximum value. [sent-201, score-0.17]
86 In order to cluster previously unseen datasets, we compute the similarity matrix W and use the algorithm of Figure 1. [sent-205, score-0.569]
87 This yields the real number λ such that the weighted distortion obtained after application of the spectral clustering algorithm of Figure 1, with the similarity matrices defined by λα, is minimum. [sent-207, score-1.175]
88 2 Simulations We performed simulations on synthetic datasets in two dimensions, where we consider datasets similar to the one in Figure 2, with two rings whose relative distance is constant across samples (but whose relative orientation has a random direction). [sent-209, score-0.234]
89 The goal is thus to learn the diagonal scale α ∈ RD+2 of a Gaussian kernel that leads to the best clustering on unseen data. [sent-211, score-0.448]
90 We learn α from N sample datasets (N = 1 or 10), and compute the clustering error of our algorithm with and without adaptive tuning of the norm of α during testing (as described in Section 4. [sent-212, score-0.505]
91 Without feature selection, the performance of spectral clustering degrades very rapidly when the number of irrelevant features increases, while our learning approach is very robust, even with only one training dataset. [sent-216, score-0.732]
92 5 Conclusion We have presented two algorithms—one for spectral clustering and one for learning the similarity matrix. [sent-217, score-0.987]
93 These algorithms can be derived as the minimization of a single cost function with respect to its two arguments. [sent-218, score-0.218]
94 This cost function depends directly on the eigenstructure of the similarity matrix. [sent-219, score-0.604]
95 We have shown that it can be approximated efficiently using the power method, yielding a method for learning similarity matrices that can cluster effectively in cases in which non-adaptive approaches fail. [sent-220, score-0.535]
96 Note in particular that our new approach yields a spectral clustering method that is significantly more robust to irrelevant features than current methods. [sent-221, score-0.732]
97 Table 1: Performance on synthetic datasets: clustering errors (multiplied by 100) for method without learning (but with tuning) and for our learning method with and without tuning, with N = 1 or 10 training datasets; D is the number of irrelevant features. [sent-223, score-0.388]
98 The number of points in such datasets can be very large and we have developed efficient implementations of both learning and clustering based on sparsity and low-rank approximations [11]. [sent-247, score-0.464]
99 Distance metric learning, with application to clustering with side-information. [sent-264, score-0.348]
100 Spectral relaxation models and structure analysis for K-way graph clustering and bi-clustering. [sent-325, score-0.49]
wordName wordTfidf (topN-words)
[('spectral', 0.344), ('similarity', 0.337), ('clustering', 0.306), ('ar', 0.291), ('dp', 0.229), ('er', 0.182), ('rp', 0.141), ('cost', 0.137), ('eigengap', 0.135), ('matrix', 0.132), ('fs', 0.131), ('eigenstructure', 0.13), ('relaxation', 0.127), ('rr', 0.123), ('gold', 0.123), ('partition', 0.118), ('rf', 0.103), ('subspaces', 0.102), ('cut', 0.099), ('tr', 0.096), ('proposition', 0.092), ('subspace', 0.089), ('datasets', 0.086), ('der', 0.085), ('matrices', 0.084), ('clusters', 0.083), ('irrelevant', 0.082), ('dpp', 0.078), ('eigensubspace', 0.078), ('wpp', 0.078), ('normalized', 0.078), ('principal', 0.073), ('minimizing', 0.072), ('nity', 0.065), ('weighted', 0.064), ('partitions', 0.063), ('partitioning', 0.061), ('cuts', 0.061), ('cluster', 0.06), ('orthogonal', 0.057), ('bm', 0.057), ('graph', 0.057), ('eigenvalue', 0.056), ('orthonormal', 0.055), ('power', 0.054), ('segmentation', 0.052), ('piecewise', 0.052), ('af', 0.052), ('tuning', 0.052), ('respect', 0.048), ('nips', 0.048), ('projection', 0.045), ('powers', 0.045), ('en', 0.045), ('xp', 0.045), ('bach', 0.045), ('fbach', 0.045), ('ding', 0.045), ('qr', 0.045), ('zha', 0.045), ('symmetric', 0.043), ('operators', 0.042), ('metric', 0.042), ('speech', 0.042), ('differentiable', 0.04), ('unseen', 0.04), ('distortion', 0.04), ('leads', 0.039), ('disjoint', 0.039), ('points', 0.039), ('gu', 0.038), ('vm', 0.038), ('eigenvector', 0.038), ('optimize', 0.037), ('columns', 0.036), ('alignment', 0.036), ('diag', 0.036), ('approximation', 0.035), ('berkeley', 0.034), ('segmented', 0.034), ('acknowledge', 0.034), ('dn', 0.034), ('diagonal', 0.034), ('constant', 0.034), ('minimization', 0.033), ('approximations', 0.033), ('wn', 0.033), ('pn', 0.033), ('close', 0.032), ('norm', 0.032), ('characterizes', 0.031), ('proof', 0.031), ('dataset', 0.031), ('jordan', 0.03), ('eigenvalues', 0.03), ('shi', 0.029), ('theorem', 0.029), ('subsets', 0.029), ('learn', 0.029), ('simulations', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
2 0.23521434 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
3 0.21521248 152 nips-2003-Pairwise Clustering and Graphical Models
Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss
Abstract: Significant progress in clustering has been achieved by algorithms that are based on pairwise affinities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efficient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difficult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for affinity matrices based on labeled data. 1
4 0.17265552 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
5 0.16854426 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet
Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a unified framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1
6 0.16266951 111 nips-2003-Learning the k in k-means
7 0.14125849 87 nips-2003-Identifying Structure across Pre-partitioned Data
8 0.13625133 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
9 0.12948997 48 nips-2003-Convex Methods for Transduction
10 0.12920998 126 nips-2003-Measure Based Regularization
11 0.11832587 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
12 0.10813574 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
13 0.10742399 128 nips-2003-Minimax Embeddings
14 0.10673018 113 nips-2003-Learning with Local and Global Consistency
15 0.10256885 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
16 0.094879553 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
17 0.090047002 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples
18 0.085499726 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
19 0.081618354 92 nips-2003-Information Bottleneck for Gaussian Variables
20 0.077406339 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
topicId topicWeight
[(0, -0.27), (1, -0.18), (2, -0.064), (3, 0.116), (4, -0.155), (5, 0.227), (6, -0.156), (7, 0.19), (8, -0.049), (9, 0.057), (10, 0.007), (11, -0.066), (12, 0.014), (13, -0.062), (14, -0.116), (15, 0.018), (16, 0.044), (17, -0.025), (18, -0.07), (19, -0.019), (20, -0.009), (21, -0.124), (22, 0.059), (23, -0.12), (24, -0.112), (25, -0.029), (26, -0.008), (27, 0.127), (28, 0.017), (29, -0.007), (30, 0.042), (31, -0.039), (32, -0.041), (33, -0.091), (34, -0.05), (35, -0.037), (36, 0.02), (37, 0.089), (38, 0.011), (39, 0.064), (40, -0.025), (41, 0.132), (42, 0.024), (43, 0.047), (44, 0.001), (45, 0.001), (46, 0.06), (47, 0.102), (48, 0.047), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.98017824 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
2 0.86749423 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
3 0.6711064 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
4 0.64298648 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
5 0.60244709 152 nips-2003-Pairwise Clustering and Graphical Models
Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss
Abstract: Significant progress in clustering has been achieved by algorithms that are based on pairwise affinities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efficient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difficult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for affinity matrices based on labeled data. 1
6 0.57949263 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
7 0.5516457 48 nips-2003-Convex Methods for Transduction
8 0.53375846 87 nips-2003-Identifying Structure across Pre-partitioned Data
9 0.51875657 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
10 0.49310806 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
11 0.48517993 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
12 0.46616361 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples
13 0.45589083 128 nips-2003-Minimax Embeddings
14 0.42856815 113 nips-2003-Learning with Local and Global Consistency
15 0.42639554 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
16 0.42383537 126 nips-2003-Measure Based Regularization
17 0.41794586 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
18 0.40070117 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
19 0.37860799 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
20 0.37697861 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
topicId topicWeight
[(0, 0.055), (11, 0.02), (29, 0.013), (30, 0.035), (35, 0.052), (53, 0.131), (66, 0.03), (68, 0.032), (69, 0.011), (71, 0.068), (72, 0.129), (76, 0.082), (85, 0.043), (91, 0.163), (99, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.91960102 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
2 0.86293012 145 nips-2003-Online Classification on a Budget
Author: Koby Crammer, Jaz Kandola, Yoram Singer
Abstract: Online algorithms for classification often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has inspired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classification Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classification vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classification algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classifier Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modification of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classifiers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The first case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as fixed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We finish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classification accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to find an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modified version of the algorithm described by Fig. 2 in which we restricted ourselves to have a fixed bounded cache. This modified algorithm (which we refer to as the fixed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modification the number of support patterns can never exceed the pre-determined limit. This modified algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classification problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the specific ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More specifically, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The first algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a fixed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a fixed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a fixed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This figure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the fixed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a fixed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot fix the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method finds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-fly validation of generalization performance. As the plots indicate, for the fixed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the fixed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes significantly more insertions into the cache than the fixed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classification algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classifier and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classification. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.
3 0.84321576 152 nips-2003-Pairwise Clustering and Graphical Models
Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss
Abstract: Significant progress in clustering has been achieved by algorithms that are based on pairwise affinities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efficient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difficult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for affinity matrices based on labeled data. 1
4 0.83596766 81 nips-2003-Geometric Analysis of Constrained Curves
Author: Anuj Srivastava, Washington Mio, Xiuwen Liu, Eric Klassen
Abstract: We present a geometric approach to statistical shape analysis of closed curves in images. The basic idea is to specify a space of closed curves satisfying given constraints, and exploit the differential geometry of this space to solve optimization and inference problems. We demonstrate this approach by: (i) defining and computing statistics of observed shapes, (ii) defining and learning a parametric probability model on shape space, and (iii) designing a binary hypothesis test on this space. 1
5 0.83328855 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
6 0.83188242 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
7 0.83018899 78 nips-2003-Gaussian Processes in Reinforcement Learning
8 0.82744813 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
9 0.82630509 30 nips-2003-Approximability of Probability Distributions
10 0.82405132 55 nips-2003-Distributed Optimization in Adaptive Networks
11 0.821311 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
12 0.81936139 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
13 0.81858742 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
14 0.81720167 113 nips-2003-Learning with Local and Global Consistency
15 0.81643456 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
16 0.8137877 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
17 0.81359041 43 nips-2003-Bounded Invariance and the Formation of Place Fields
18 0.81251878 112 nips-2003-Learning to Find Pre-Images
19 0.81103355 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
20 0.80925429 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus