nips nips2006 nips2006-7 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. [sent-6, score-1.1]
2 We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. [sent-9, score-0.156]
3 1 Introduction In the multi-class clustering problem, we are given n data points, x1 , . [sent-11, score-0.419]
4 The goal is to partition the given data xi (1 ≤ i ≤ n) into c clusters, such that different clusters are in some sense “distinct” from each other. [sent-15, score-0.262]
5 Here xi ∈ X ⊆ Rd is the input data, X is the input space. [sent-16, score-0.154]
6 Many clustering algorithms have been proposed, including the traditional k-means algorithm and the currently very popular spectral clustering approach [3, 10]. [sent-19, score-1.022]
7 Recently the spectral clustering approach has attracted increasing attention due to its promising performance and easy implementation. [sent-20, score-0.657]
8 In spectral clustering, the eigenvectors of a matrix are used to reveal the cluster structure in the data. [sent-21, score-0.532]
9 In this paper, we propose a clustering method that also has this characteristic. [sent-22, score-0.396]
10 Namely, the cluster label of each data point should be well estimated based on its neighboring data and their cluster labels, using current supervised learning methods. [sent-24, score-0.601]
11 As will be seen later, the proposed algorithm is also easy to implement while it shows better performance than the spectral clustering approach in the experiments. [sent-27, score-0.68]
12 The local learning idea has already been successfully applied in supervised learning problems [1]. [sent-28, score-0.231]
13 Adapting valuable supervised learning ideas for unsupervised learning problems can be fruitful. [sent-30, score-0.162]
14 For example, in [9] the idea of large margin, which has proved effective in supervised learning, is applied to the clustering problem and good results are obtained. [sent-31, score-0.569]
15 The details of our local learning based clustering algorithm are presented in section 3. [sent-34, score-0.483]
16 2 Notations In the following, “neighboring points” or “neighbors” of xi simply refers the nearest neighbors of xi according to some distance metric. [sent-37, score-0.365]
17 the set of neighboring points of xi , 1 ≤ i ≤ n, not including xi itself. [sent-42, score-0.372]
18 Clustering via Local Learning Local Learning in Supervised Learning In supervised learning algorithms, a model is trained with all the labeled training data and is then used to predict the labels of unseen test data. [sent-47, score-0.189]
19 These algorithms can be called global learning algorithms as the whole training dataset is used for training. [sent-48, score-0.109]
20 In contrast, in local learning algorithms [1], for a given test data point, a model is built only with its neighboring training data, and then the label of the given test point is predicted by this locally learned model. [sent-49, score-0.279]
21 It has been reported that local learning algorithms often outperform global ones [1] as the local models are trained only with the points that are related to the particular test data. [sent-50, score-0.197]
22 2 Representation of Clustering Results The procedure of our clustering approach largely follows that of the clustering algorithms proposed in [2, 10]. [sent-53, score-0.792]
23 We also use a Partition Matrix (PM) P = [pil ] ∈ {0, 1}n×c to represent a clustering scheme. [sent-54, score-0.396]
24 Namely pil = 1 if xi (1 ≤ i ≤ n) is assigned to cluster Cl (1 ≤ l ≤ c), otherwise pil = 0. [sent-55, score-0.466]
25 As in [2, 10], instead of computing the PM directly to cluster the given data, we compute a Scaled 1 Partition Matrix (SPM) F defined by: F = P(P P)− 2 . [sent-57, score-0.187]
26 3 Basic Idea The good performance of local learning methods indicates that the label of a data point can be well estimated based on its neighbors. [sent-69, score-0.174]
27 Details on how to compute n 1 ol (xi ) will be given later. [sent-75, score-0.343]
28 For the function ol (·), the superscript l indicates that it is for the l-th i i cluster, and the subscript i means the KM is trained with the neighbors of xi . [sent-76, score-0.579]
29 Hence apart from xi , l l the training data {(xj , fj )}xj ∈Ni also influence the value of ol (xi ). [sent-77, score-0.678]
30 Note that fj (xj ∈ Ni ) are also i variables of the problem (3)–(4). [sent-78, score-0.135]
31 For a data point xi and a cluster Cl , given the values of fj at xj ∈ Ni , what should be l the proper value of fi at xi ? [sent-80, score-0.825]
32 In particular, we can build a KM with the l training data {(xj , fj )}xj ∈Ni . [sent-82, score-0.206]
33 As mentioned before, let ol (·) denote the output function of this i locally learned KM, then the good performance of local learning methods mentioned above implies that ol (xi ) is probably a good guess of fil , or the proper fil should be similar as ol (xi ). [sent-83, score-1.807]
34 i i Therefore, a good SPM F should have the following property: For any xi (1 ≤ i ≤ n) and any cluster Cl (1 ≤ l ≤ c), the value of fil can be well estimated based on the neighbors of xi . [sent-84, score-0.822]
35 That is, l fil should be similar to the output of the KM that is trained locally with the data {(xj , fj )}xj ∈Ni . [sent-85, score-0.492]
36 A good clustering method will put the data into well separated clusters. [sent-88, score-0.442]
37 This implies that it is easy to predict the cluster membership of a point based on its neighbors. [sent-89, score-0.197]
38 If, on the other hand, a cluster is split in the middle, then there will be points at the boundary for which it is hard to predict which cluster they belong to. [sent-90, score-0.332]
39 So minimizing the objective function (3) favors the clustering schemes that do not split the same group of data into different clusters. [sent-91, score-0.443]
40 Moreover, it is very difficult to construct local clustering algorithms in the same way as for supervised learning. [sent-92, score-0.555]
41 In [1], a local learning algorithm is obtained by running a standard supervised algorithm on a local training set. [sent-93, score-0.269]
42 4 Computing ol (xi ) i Having explained the basic idea, now we have to make the problem (3)–(4) more specific to build l a concrete clustering algorithm. [sent-97, score-0.764]
43 So we consider, based on xi and {(xj , fj )}xj ∈Ni , how to coml pute oi (xi ) with kernel learning algorithms. [sent-98, score-0.421]
44 In general, any kernel learning algorithms can be applied to compute the coefficients βij . [sent-100, score-0.111]
45 To this end, we adopt the Kernel Ridge Regression (KRR) algorithm [6], with which we can obtain an analytic expression of l ol (xi ) based on {(xj , fj )}xj ∈Ni . [sent-102, score-0.478]
46 Thus for each xi , we need to solve the following KRR training i problem: min λ(β l ) Ki β l + Ki β l − fil i i i β l ∈Rni i 2 (6) l where β l ∈ Rni is the vector of the expansion coefficients, i. [sent-103, score-0.474]
47 β l = [βij ] for xj ∈ Ni , λ > 0 is i i l the regularization parameter, fil ∈ Rni denotes the vector fj for xj ∈ Ni , and Ki ∈ Rni ×ni is the kernel matrix over xj ∈ Ni , namely Ki = [K(xu , xv )], for xu , xv ∈ Ni . [sent-105, score-1.193]
48 Solving problem (6) leads to β l = (Ki + λI)−1 fil . [sent-106, score-0.268]
49 Substituting it into (5), we have i ol (xi ) = ki (Ki + λI)−1 fil i (7) ni where ki ∈ R denotes the vector [K(xi , xj )] for xj ∈ Ni . [sent-107, score-1.453]
50 Equation (7) can be written as a linear equation: ol (xi ) = αi fil (8) i ni where αi ∈ R is computed as αi = ki (Ki + λI)−1 (9) l It can be seen that αi is independent of fi and the cluster index l, and it is different for different xi . [sent-108, score-1.345]
51 Similar as αi , the matrix A is also independent of f l and the cluster index l. [sent-110, score-0.237]
52 6 Discretization: Obtaining the Final Clustering Result According to [10], to get the final clustering result, we need to find a true SPM F which is close to the subspace (16). [sent-116, score-0.396]
53 In this paper, we adopt the approach in [10] to get the final clustering result. [sent-123, score-0.396]
54 7 Comparison with Spectral Clustering Our Local Learning based Clustering Algorithm (LLCA) also uses the eigenvalues of a matrix (T in (13)) to reveal the cluster structure in the data, therefore it can be regarded as belonging to the category of spectral clustering approaches. [sent-125, score-0.886]
55 The matrix whose eigenvectors are used for clustering plays the key role in spectral clustering. [sent-126, score-0.739]
56 In LLCA, this matrix is computed based on the local learning idea: a clustering result is obtained based on whether the label of each point can be well estimated base on its neighbors with a well established supervised learning algorithm. [sent-127, score-0.763]
57 This is different from the graph partitioning based spectral clustering method. [sent-128, score-0.672]
58 As will be seen later, LLCA and spectral clustering have quite different performance in the experiments. [sent-129, score-0.626]
59 6) are the same as in the spectral clustering approach. [sent-135, score-0.626]
60 According to equation (13), to compute T, we need to compute the matrix A in (10), which in turn requires calculating αi in (9) for each xi . [sent-136, score-0.194]
61 i In practice, just like in the spectral clustering method, the number of neighbors ni is usually set to a fixed small value k for all xi in LLCA. [sent-138, score-1.103]
62 We conclude that LLCA is easy to implement, and in practice, the main computational load is to compute the eigenvectors of T, therefore the LLCA and the spectral clustering approach have the same order of time complexity in most practical cases. [sent-143, score-0.73]
63 1 4 Experimental Results In this section, we empirically compare LLCA with the spectral clustering approach of [10] as well as with k-means clustering. [sent-144, score-0.626]
64 6), we use the same code contained in the implementation of the spectral clustering algorithm, available at http://www. [sent-147, score-0.626]
65 1 Sometimes we are also interested in a special case: ni = n − 1 for all xi , i. [sent-157, score-0.42]
66 In this case, it can be proved that T = Q Q, where Q = (Diag(B))−1 B with B = I − K(K + λI)−1 , where K is the kernel matrix over all the data points. [sent-160, score-0.18]
67 This is the same as computing the eigenvectors of the non-sparse matrix T. [sent-162, score-0.134]
68 2 Performance Measure In the experiments, we set the number of clusters equal to the number of classes c for all the clustering algorithms. [sent-175, score-0.438]
69 The value calculated in (21) is used as a performance measure for the given clustering result. [sent-183, score-0.42]
70 To compute it for a clustering result, we need to build a permutation mapping function map(·) that maps each cluster index to a true class label. [sent-188, score-0.643]
71 The classification error based on map(·) can then be computed as: n δ(yi , map(ci )) err = 1 − i=1 n where yi and ci are the true class label and the obtained cluster index of xi respectively, δ(x, y) is the delta function that equals 1 if x = y and equals 0 otherwise. [sent-189, score-0.48]
72 The clustering error is defined as the minimal classification error among all possible permutation mappings. [sent-190, score-0.421]
73 3 Parameter Selection In the spectral clustering algorithm, first a graph of n nodes is constructed, each node of which corresponds to a data point, then the clustering problem is converted into a graph partition problem. [sent-193, score-1.138]
74 In the experiments, for the spectral clustering algorithm, a weighted k-nearest neighbor graph is employed, where k is a parameter searched over the grid: k ∈ {5, 10, 20, 40, 80}. [sent-194, score-0.736]
75 On this graph, the edge weight between two connected data points is computed with a kernel function, for which the following two kernel functions are tried respectively in the experiments. [sent-195, score-0.222]
76 The cosine kernel: K1 (xi , xj ) = xi xj xi xj (22) and the Gaussian kernel: 1 2 xi − xj ) (23) γ 2 2 2 2 2 2 2 2 2 The parameter γ in (23) is searched in: γ ∈ {σ0 /16, σ0 /8, σ0 /4, σ0 /2, σ0 , 2σ0 , 4σ0 , 8σ0 , 16σ0 }, where σ0 is the mean norm of the given data xi , 1 ≤ i ≤ n. [sent-196, score-1.516]
77 K2 (xi , xj ) = exp(− For LLCA, the cosine function (22) and the Gaussian function (23) are also adopted respectively as the kernel function in (5). [sent-197, score-0.367]
78 The number of neighbors ni for all xi is set to a single value k. [sent-198, score-0.477]
79 Automatic parameter selection for unsupervised learning is still a difficult problem. [sent-204, score-0.122]
80 For a clustering result obtained with a set of parameters, which in our case consists of k and λ when the cosine kernel (22) is used, or k, γ and λ when the Gaussian kernel (23) is used, we compute its corresponding SPM F and then use the objective value (11) as the evaluation criteria. [sent-206, score-0.728]
81 Namely, the clustering result corresponding to the smallest objective value is finally selected for LLCA. [sent-207, score-0.444]
82 For simplicity, on each dataset, we will just report the best result of spectral clustering. [sent-208, score-0.254]
83 For LLCA, both the best result (LLCA1) and the one obtained with the above parameter selection method (LLCA2) will be provided. [sent-209, score-0.102]
84 No parameter selection is needed for the k-means algorithm, since the number of clusters is given. [sent-210, score-0.12]
85 The results on News4a and News4b datasets show that different kernels may lead to dramatically different performance for both spectral clustering and LLCA. [sent-213, score-0.668]
86 For spectral clustering, the results on USPS-3568 are also significantly different for different kernels. [sent-214, score-0.23]
87 It can also be observed that different performance measures may result in different performance ranks of the clustering algorithms being investigated. [sent-215, score-0.42]
88 This is reflected by the results on USPS-3568 when the cosine kernel is used and the results on News4b when the Gaussian kernel is used. [sent-216, score-0.284]
89 Despite all these phenomena, we can still see from Table 2 that both LLCA1 and LLCA2 outperform the spectral clustering and the k-means algorithm in most cases. [sent-217, score-0.626]
90 We can also see that LLCA2 fails to find good parameters on News4a and News4b when the Gaussian kernel is used, while in the remaining cases, LLCA2 is either slightly worse than or identical to LLCA1. [sent-218, score-0.111]
91 And analogous to LLCA1, LLCA2 also improves the results of the spectral clustering and the k-means algorithm on most datasets. [sent-219, score-0.626]
92 Finally, it can be seen that the k-means algorithm is worse than spectral clustering, except on USPS3568 with respect to the clustering error criteria when the cosine kernel is used for spectral clustering. [sent-221, score-1.052]
93 This corroborates the advantage of the popular spectral clustering approach over the traditional k-means algorithm. [sent-222, score-0.647]
94 5 Conclusion We have proposed a local learning approach for clustering, where an optimization problem is formulated leading to a solution with the property that the label of each data point can be well estimated Table 2: Clustering results. [sent-223, score-0.244]
95 Both the normalized mutual information and the clustering error are provided. [sent-224, score-0.433]
96 Two kernel functions (22) and (23) are tried for both spectral clustering and LLCA. [sent-225, score-0.737]
97 On each dataset, the best result of the spectral clustering algorithm is reported (Spec-Clst). [sent-226, score-0.65]
98 For LLCA, both the best result (LLCA1) and the one obtained with the parameter selection method described before (LLCA2) are provided. [sent-227, score-0.102]
99 NMI, cosine NMI, Gaussian Error (%), cosine Error (%), Gaussian Spec-Clst LLCA1 LLCA2 k-means Spec-Clst LLCA1 LLCA2 k-means Spec-Clst LLCA1 LLCA2 k-means Spec-Clst LLCA1 LLCA2 k-means USPS-3568 0. [sent-230, score-0.216]
100 We have also provided a parameter selection method for the proposed clustering algorithm. [sent-319, score-0.474]
wordName wordTfidf (topN-words)
[('clustering', 0.396), ('ol', 0.343), ('llca', 0.342), ('fil', 0.268), ('ni', 0.266), ('spectral', 0.23), ('spm', 0.191), ('xj', 0.171), ('cluster', 0.166), ('xi', 0.154), ('fj', 0.135), ('nmi', 0.127), ('ki', 0.117), ('pm', 0.113), ('cl', 0.113), ('cosine', 0.108), ('km', 0.102), ('rni', 0.098), ('umist', 0.098), ('supervised', 0.095), ('kernel', 0.088), ('nh', 0.085), ('pil', 0.073), ('eigenvectors', 0.073), ('nl', 0.072), ('local', 0.064), ('neighboring', 0.064), ('krr', 0.064), ('searched', 0.058), ('tf', 0.058), ('neighbors', 0.057), ('selection', 0.051), ('rc', 0.051), ('discretization', 0.046), ('documents', 0.046), ('equals', 0.044), ('partition', 0.043), ('xv', 0.042), ('datasets', 0.042), ('dataset', 0.042), ('clusters', 0.042), ('label', 0.041), ('locally', 0.041), ('matrix', 0.04), ('aij', 0.039), ('optimization', 0.038), ('xu', 0.038), ('mutual', 0.037), ('rn', 0.034), ('relaxation', 0.033), ('index', 0.031), ('belonging', 0.031), ('easy', 0.031), ('ij', 0.03), ('property', 0.03), ('usps', 0.03), ('proved', 0.029), ('solve', 0.029), ('covering', 0.028), ('orthogonal', 0.027), ('diag', 0.027), ('namely', 0.027), ('scaled', 0.027), ('parameter', 0.027), ('bottou', 0.026), ('idea', 0.026), ('graph', 0.025), ('build', 0.025), ('permutation', 0.025), ('formulated', 0.025), ('trained', 0.025), ('topics', 0.025), ('result', 0.024), ('objective', 0.024), ('calculated', 0.024), ('implement', 0.023), ('reveal', 0.023), ('tried', 0.023), ('training', 0.023), ('data', 0.023), ('mentioned', 0.023), ('learning', 0.023), ('good', 0.023), ('handwritten', 0.022), ('subject', 0.022), ('proper', 0.022), ('af', 0.021), ('capacity', 0.021), ('corroborates', 0.021), ('dover', 0.021), ('larson', 0.021), ('neufeld', 0.021), ('pute', 0.021), ('statements', 0.021), ('strehl', 0.021), ('computing', 0.021), ('unsupervised', 0.021), ('digits', 0.021), ('partitioning', 0.021), ('global', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
2 0.35443521 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
3 0.29265624 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
4 0.16262867 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
5 0.15760008 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
Author: Ron Zass, Amnon Shashua
Abstract: In this paper we focus on the issue of normalization of the affinity matrix in spectral clustering. We show that the difference between N-cuts and Ratio-cuts is in the error measure being used (relative-entropy versus L1 norm) in finding the closest doubly-stochastic matrix to the input affinity matrix. We then develop a scheme for finding the optimal, under Frobenius norm, doubly-stochastic approximation using Von-Neumann’s successive projections lemma. The new normalization scheme is simple and efficient and provides superior clustering performance over many of the standardized tests. 1
6 0.13127518 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
7 0.12986231 116 nips-2006-Learning from Multiple Sources
8 0.12730899 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
9 0.12674852 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
10 0.12555599 65 nips-2006-Denoising and Dimension Reduction in Feature Space
11 0.12207785 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
12 0.10701397 128 nips-2006-Manifold Denoising
13 0.099202745 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
14 0.093114048 158 nips-2006-PG-means: learning the number of clusters in data
15 0.091814578 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
16 0.091075741 181 nips-2006-Stability of $K$-Means Clustering
17 0.086419985 175 nips-2006-Simplifying Mixture Models through Function Approximation
18 0.085986942 79 nips-2006-Fast Iterative Kernel PCA
19 0.085497141 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
20 0.078257069 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
topicId topicWeight
[(0, -0.258), (1, 0.156), (2, 0.05), (3, 0.362), (4, 0.086), (5, -0.088), (6, 0.041), (7, 0.14), (8, 0.203), (9, 0.027), (10, -0.043), (11, 0.167), (12, -0.11), (13, 0.081), (14, -0.033), (15, 0.081), (16, 0.041), (17, 0.062), (18, 0.076), (19, -0.016), (20, 0.022), (21, 0.073), (22, -0.009), (23, 0.144), (24, 0.001), (25, 0.049), (26, 0.033), (27, -0.081), (28, -0.039), (29, -0.03), (30, 0.086), (31, -0.015), (32, 0.089), (33, -0.004), (34, -0.013), (35, 0.073), (36, -0.006), (37, -0.05), (38, 0.023), (39, -0.007), (40, -0.005), (41, 0.017), (42, -0.025), (43, 0.074), (44, 0.026), (45, -0.041), (46, -0.052), (47, -0.037), (48, 0.02), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.96932554 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
2 0.86797184 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
3 0.86048782 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
4 0.80702007 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
Author: Ron Zass, Amnon Shashua
Abstract: In this paper we focus on the issue of normalization of the affinity matrix in spectral clustering. We show that the difference between N-cuts and Ratio-cuts is in the error measure being used (relative-entropy versus L1 norm) in finding the closest doubly-stochastic matrix to the input affinity matrix. We then develop a scheme for finding the optimal, under Frobenius norm, doubly-stochastic approximation using Von-Neumann’s successive projections lemma. The new normalization scheme is simple and efficient and provides superior clustering performance over many of the standardized tests. 1
5 0.71579105 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
6 0.58611667 181 nips-2006-Stability of $K$-Means Clustering
7 0.57685965 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
8 0.54659861 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
9 0.47748834 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
10 0.47566956 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
11 0.4187158 79 nips-2006-Fast Iterative Kernel PCA
12 0.41464552 158 nips-2006-PG-means: learning the number of clusters in data
13 0.38185358 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
14 0.37408763 175 nips-2006-Simplifying Mixture Models through Function Approximation
15 0.3740119 82 nips-2006-Gaussian and Wishart Hyperkernels
16 0.36922973 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
17 0.36329004 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
18 0.36191264 153 nips-2006-Online Clustering of Moving Hyperplanes
19 0.35869437 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
20 0.35470334 128 nips-2006-Manifold Denoising
topicId topicWeight
[(1, 0.103), (3, 0.016), (7, 0.065), (9, 0.459), (22, 0.054), (44, 0.066), (57, 0.048), (65, 0.055), (69, 0.034)]
simIndex simValue paperId paperTitle
1 0.96614492 149 nips-2006-Nonnegative Sparse PCA
Author: Ron Zass, Amnon Shashua
Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1
2 0.96597081 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
3 0.95994723 197 nips-2006-Uncertainty, phase and oscillatory hippocampal recall
Author: Máté Lengyel, Peter Dayan
Abstract: Many neural areas, notably, the hippocampus, show structured, dynamical, population behavior such as coordinated oscillations. It has long been observed that such oscillations provide a substrate for representing analog information in the firing phases of neurons relative to the underlying population rhythm. However, it has become increasingly clear that it is essential for neural populations to represent uncertainty about the information they capture, and the substantial recent work on neural codes for uncertainty has omitted any analysis of oscillatory systems. Here, we observe that, since neurons in an oscillatory network need not only fire once in each cycle (or even at all), uncertainty about the analog quantities each neuron represents by its firing phase might naturally be reported through the degree of concentration of the spikes that it fires. We apply this theory to memory in a model of oscillatory associative recall in hippocampal area CA3. Although it is not well treated in the literature, representing and manipulating uncertainty is fundamental to competent memory; our theory enables us to view CA3 as an effective uncertainty-aware, retrieval system. 1
same-paper 4 0.88573468 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
5 0.64920855 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
6 0.64411688 158 nips-2006-PG-means: learning the number of clusters in data
7 0.61772758 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
8 0.6070652 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
9 0.60636431 80 nips-2006-Fundamental Limitations of Spectral Clustering
10 0.60087627 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
11 0.57683015 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
12 0.56262434 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
13 0.56246418 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
14 0.56037217 65 nips-2006-Denoising and Dimension Reduction in Feature Space
15 0.55404365 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
16 0.55075341 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
17 0.54424626 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity
18 0.53317761 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
19 0.53103393 181 nips-2006-Stability of $K$-Means Clustering
20 0.53051871 163 nips-2006-Prediction on a Graph with a Perceptron