nips nips2003 nips2003-46 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
Reference: text
sentIndex sentText sentNum sentScore
1 While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. [sent-7, score-0.487]
2 In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. [sent-8, score-0.722]
3 Among various possible grouping principles, those methods which try to find compact clusters have gained particular importance. [sent-12, score-0.215]
4 Presumingly the most prominent method of this kind is the K-means clustering for vectorial data [6]. [sent-13, score-0.378]
5 Despite the powerful modeling capabilities of compactness-based clustering methods, they mostly fail in finding elongated structures. [sent-14, score-0.584]
6 The fast single linkage algorithm [9] is the most often used algorithm to search for elongated structures, but it is known to be very sensitive to outliers in the dataset. [sent-15, score-0.425]
7 Mean shift clustering [3], another method of this class, is capable of extracting elongated clusters only if all modes of the underlying probability distribution have one single maximum. [sent-16, score-0.867]
8 Furthermore, a suitable kernel bandwidth parameter has to be preselected [2]. [sent-17, score-0.187]
9 Spectral clustering [10] shows good performance in many cases, but the algorithm is only analyzed for special input instances while a complete analysis of the algorithm is still missing. [sent-18, score-0.378]
10 Concerning the preselection of a suitable kernel width, spectral clustering suffers from similar problems as mean shift clustering. [sent-19, score-0.688]
11 In this paper we present an alternative method for clustering elongated structures. [sent-20, score-0.584]
12 We build up on the work on path-based clustering [7]. [sent-22, score-0.378]
13 For a slight modification of the original prob- lem we show that the defined path distance induces a kernel matrix fulfilling Mercers condition. [sent-23, score-0.265]
14 After the computation of the path-based distance, the compactness-based pairwise clustering principle is used to partition the data. [sent-24, score-0.551]
15 While for the general N P-hard pairwise clustering problem no approximation algorithms are known, we present a polynomial time approximation scheme (PTAS) for our special case with path-based distances. [sent-25, score-0.697]
16 In this vector space, pairwise clustering reduces to minimizing the K-means cost function in (n − 1) dimensions [13]. [sent-27, score-0.507]
17 Our experiments suggest that kernel PCA effectively reduces the noise in the data while preserving the coarse cluster structure. [sent-30, score-0.236]
18 Our method is compared to spectral clustering and mean shift clustering on selected artificial datasets. [sent-31, score-0.941]
19 2 Clustering by Connectivity The main idea of our clustering criterion is to transform elongated structures into compact ones in a preprocessing step. [sent-33, score-0.673]
20 Given the transformed data, we then infer a clustering solution by optimizing a compactness based criterion. [sent-34, score-0.471]
21 in the spanning tree approach is the following: while spanning tree algorithms are extremely sensitive to outliers, the two-step procedure may benefit from the statistical robustness of certain compactness based methods. [sent-37, score-0.342]
22 Concerning the general case of datasets which are not given in a vector space, but only characterized by pairwise dissimilarities, the pairwise clustering model has been shown to be robust against outliers in the dataset [12]. [sent-38, score-0.731]
23 The idea of this preprocessing step is to define distances between objects by considering certain paths through the total object set. [sent-43, score-0.294]
24 The natural formalization of such path problems is to represent the objects as a graph: consider a connected graph G = (V, E, d ) with n vertices (the objects) and symmetric nonnegative edge weights dij on the edge (i, j) (the original dissimilarities). [sent-44, score-0.713]
25 In order to make those objects more similar which are connected by “bridges” of other objects, we define for each path p ∈ Pij the effective dissimilarity dp between i and j connected ij by p as the maximum weight on this path, i. [sent-46, score-0.512]
26 The total dissimilarity between vertices i and j is then defined as the minimum of all path-specific effective dissimilarities dp : ij dij := min { max p∈Pij 1≤h≤|p|−1 dp[h]p[h+1] }. [sent-49, score-0.939]
27 If the objects are in the same cluster their pairwise effective dissimilarities will be small (fig. [sent-51, score-0.657]
28 If the two objects belong to two different clusters, however, all paths contain at least one large dissimilarity and the resulting effective dissimilarity will be large (fig. [sent-53, score-0.547]
29 A problem dij dij (a) (b) dij (c) Figure 1: Effective dissimilarities. [sent-57, score-1.395]
30 (a) If objects belong to the same high-density region, d ij is small. [sent-58, score-0.21]
31 (b) If they are in different regions, dij is larger. [sent-59, score-0.465]
32 The reader should notice that the single linkage algorithm does not posses the robustness properties, since it will separate the three most distant outlier objects in example 1(a) from the remaining data, but it will not detect the dominant structure. [sent-63, score-0.406]
33 Summarizing the above model, we formalize the path-based clustering problem as: INPUT: A symmetric (n × n) matrix D = (dij )1≤i,j≤n of nonnegative pairwise dissimilarities between n objects, with zero diagonal elements. [sent-64, score-0.767]
34 QUESTION: Find clusters by minimizing H PC (c; D), where the matrix D represents the effective dissimilarities derived from D by eq. [sent-65, score-0.43]
35 3 The Connectivity Kernel In this section we show that the effective dissimilarities induce a Mercer kernel on the weighted graph G. [sent-67, score-0.419]
36 The Mercer property will then allow us to derive several approximation results for the N P-hard pairwise clustering problem in section 4. [sent-68, score-0.565]
37 A metric D is called ultra-metric if it satisfies the condition dij ≤ max(dik , dkj ) for all distinct i, j, k. [sent-70, score-0.59]
38 The dissimilarities defined by (2) induce an ultra-metric on G. [sent-72, score-0.218]
39 This situation, however, contradicts the above definition (2) of dij : in this case there exists a path from i to j over k, the weakest link of which is shorter than dij . [sent-75, score-1.062]
40 Equation (2) then implies that dij must be smaller or equal to max(dik , dkj ). [sent-76, score-0.554]
41 A metric D is 2 embeddable, if there exists a set of vectors {xi }n , xi ∈ i=1 Rp , p ≤ n − 1 such that for all pairs i, j xi − xj 2 = dij . [sent-78, score-0.698]
42 For the distance matrix D defined in the setting of theorem 1, the matrix S c = − 1 Dc with D c = QDQ is a Gram matrix or Mercer kernel. [sent-94, score-0.189]
43 It contains dot products 2 between a set of vectors {xi }n with squared Euclidean distances xi − xj 2 = dij . [sent-95, score-0.704]
44 (i) Since D is ultra-metric, D is 2 embeddable by lemma 1, and D c is negative (semi)definite by lemma (2). [sent-97, score-0.233]
45 (ii) Since sc is a dotij product between two vectors xi and xj , the squared Euclidean distance between xi and xj 1 is defined by xi − xj 2 = sc + sc − 2sc = − 2 dc + dc − 2dc . [sent-100, score-0.654]
46 With the definition 2 ii jj ij ii jj ij of the centralized distances, it can be seen easily that all but one term, namely the original distance, cancel out: − 1 dc + dc − 2dc = dij . [sent-101, score-0.938]
47 ii jj ij 2 4 Approximation Results Pairwise clustering is known to be N P-hard [1]. [sent-102, score-0.509]
48 To our knowledge there is no polynomial time approximation algorithm known for the general case of pairwise clustering. [sent-103, score-0.261]
49 For our special case in which the data are transformed into effective dissimilarities, however, we now present a polynomial time approximation scheme. [sent-104, score-0.257]
50 Let us first consider the computation of the effective dissimilarities D. [sent-106, score-0.253]
51 Despite the fact that the path-based distance is a minimum over all paths from i to j, the whole distance matrix can be computed in polynomial time. [sent-107, score-0.243]
52 The path-based dissimilarity matrix D defined by equation 2 can be computed in running time O(n2 log n). [sent-109, score-0.218]
53 The computation of the connectivity kernel matrix is an extention of Kruskal’s minimum spanning tree algorithm. [sent-111, score-0.446]
54 In each iteration step the two clusters Ci and Cj are merged with minimal costs dij = minp∈Ci ,q∈Cj dpq where dpq is the edge weight on the input graph. [sent-113, score-0.736]
55 The link dij gives the effective dissimilarity of all objects in Ci to all objects in Cj . [sent-114, score-1.002]
56 To proof this, one can consider the case, where dij is not the effective dissimilarity between Ci and Cj . [sent-115, score-0.674]
57 Then there exists a path over some other cluster Ck , where all objects on this path have a smaller weight, implying the existence of another pair of clusters with smaller merging costs. [sent-116, score-0.55]
58 The running time is O(n2 log n) for the spanning tree algorithm on the the complete input graph and additional O(n2 ) for filling all elements in the matrix D. [sent-117, score-0.214]
59 , xn ∈ Rp }, the task is to partition the vectors in such a way that the squared Euclidean distance to the cluster centroids is minimized. [sent-122, score-0.301]
60 The objective function for K-means is given by H KM (c; X ) = K ν=1 i:ci =ν (xi − y ν )2 where yν = 1 nν j:cj =ν xj (3) Minimizing the K-means objective function for squared Euclidean distances is N P-hard if the dimension of the vectors is growing with n. [sent-123, score-0.203]
61 Using this approximation lemma we are able to proof the existence of a PTAS for pairwise data clustering using the distance defined by (2). [sent-128, score-0.687]
62 By lemma 3 the dissimilarity matrix D can be computed in polynomial time. [sent-132, score-0.34]
63 From the fact that we can define a connectivity kernel matrix we can use kernel PCA [14] to reduce the data dimension. [sent-142, score-0.442]
64 Diagonalization of the centered kernel matrix S c leads to S c = V t ΛV , with an orthogonal matrix V = (v1 , . [sent-144, score-0.225]
65 Embedding the path-based distances into RK by kernel PCA and enumerating 2 over all possible Voronoi partitions yields an O(nK +1 ) algorithm which approximates path-based clustering within a constant factor of 2. [sent-153, score-0.664]
66 The leaves of the tree are the objects of figure 2. [sent-173, score-0.239]
67 For better visualization of the tree structure, the bar diagrams below the tree show the labels of the three cluster 1 It has been shown in [12] that Ward’s method is an optimization heuristics for H P C . [sent-174, score-0.261]
68 (a) (b) (c) Figure 2: Comparison to other clustering methods. [sent-176, score-0.378]
69 (a) Mean shift clustering, (b) Spectral Clustering, (c) Connectivity kernel clustering. [sent-177, score-0.216]
70 (a) Single Linkage, (b) Ward’s method with connectivity kernel, applied to embedded objects in n − 1 dimensions. [sent-182, score-0.361]
71 It is obvious that the main parts of the spiral arms are found, but the objects drawn on the right side are separated from the rest of the cluster. [sent-188, score-0.382]
72 The respective objects are the outliers that are separated in the highest hierarchical levels of the algorithm. [sent-189, score-0.332]
73 We conclude that for small K, single linkage has the tendency to separates single outlier objects from the data. [sent-190, score-0.366]
74 By way of the connectivity kernel we can transform the original dyadic data to n − 1 dimensional vectorial data. [sent-191, score-0.267]
75 To show comparable results for the connectivity kernel, we apply Ward’s method to the embedded vectors. [sent-192, score-0.197]
76 Opposed to the single linkage results, the main structure of the spiral arms has been successfully found in the hierarchy corresponding to the three cluster solution. [sent-194, score-0.448]
77 It should also be noticed that the costs of the three cluster solution are not much larger as the costs of the four cluster solution, indicating that the three cluster solution does not form a distinctly separated hierarchical level. [sent-196, score-0.549]
78 Figure 3(c) demonstrates that more distinctly separated levels can be found after applying kernel PCA and embedding the objects into a low-dimensional space (here 3 dimensions). [sent-197, score-0.401]
79 One can see that the coarse struc- ture of the tree has been preserved, while the costs of cluster solutions for K > 3 have been shrunken towards zero. [sent-199, score-0.24]
80 Now we compare our results to other recently published clustering techniques, that have been designed to extract elongated structures. [sent-201, score-0.618]
81 Mean shift clustering [3] computes a trajectory of vectors towards the gradient of the underlying probability density. [sent-202, score-0.506]
82 Mean shift clustering is only applicable to finite dimensional vector spaces, because it implicitly involves density estimation. [sent-207, score-0.469]
83 A potential shortcoming of mean-shift clustering is the following: if the modes of the distribution have multiple local maxima (as e. [sent-208, score-0.378]
84 in the spiral arm example), there does not exist any kernel bandwidth to successfully separate the data according to the underlying structure. [sent-210, score-0.293]
85 In figure 2(a) the best result for mean shift clustering is drawn. [sent-211, score-0.469]
86 For smaller values of σ the spiral arms are further subdivided into additional clusters, and for a larger bandwidth values, the result becomes more and more similar to compactness-based criteria like K-means. [sent-212, score-0.242]
87 Spectral clustering with a Gaussian kernel is known to be able to separate nested circles, but we observed that it has severe problems to extract the noisy spiral arms, see 2(b). [sent-218, score-0.643]
88 In spectral clustering, the kernel width σ is a free parameter which has to be selected “correctly”. [sent-219, score-0.219]
89 If σ is too large, spectral clustering becomes similar to standard K-means and fails to extract elongated structures. [sent-220, score-0.712]
90 Our approach to clustering with the connectivity kernel, however, could successfully extract the three spiral arms as can be seen in figure 2(c). [sent-222, score-0.734]
91 (b) K-means labels and (c) clustering with connectivity kernel. [sent-227, score-0.52]
92 In a last experiment, we show the advantages of our method compared to a parameter-free compactness criterion (K-means) on the problem of clustering digits ’2’ and ’9’ from the USPS digits dataset. [sent-228, score-0.512]
93 Figure 4 shows the clustering result of our method using the connectivity kernel. [sent-229, score-0.52]
94 Compared to the ground truth solution, path-based clustering succeeded in extracting the elongated structures, resulting in a very small error of only 1. [sent-233, score-0.687]
95 6 Conclusion In this paper we presented a clustering approach, that is based on path-based distances in the input graph. [sent-237, score-0.467]
96 In a first step, elongated structures are transformed into compact ones, which in the second step are partitioned by the compactness-based pairwise clustering method. [sent-238, score-0.851]
97 We showed that the transformed distances induce a Mercer kernel, which in turn allowed us to derive a polynomial time approximation scheme for the generally N P-hard pairwise clustering problem. [sent-239, score-0.818]
98 Compared to related methods like single linkage, mean shift clustering and spectral clustering, our method has been shown to successfully overcome the problem of sensitivity to outlier objects, while being capable of extracting nested elongated structures. [sent-242, score-0.912]
99 Our method does not involve any free kernel parameters, which we consider to be a particular advantage over both mean shift– and spectral clustering. [sent-243, score-0.219]
100 Path-based clustering for grouping of smooth curves and texture segmentation. [sent-287, score-0.425]
wordName wordTfidf (topN-words)
[('dij', 0.465), ('clustering', 0.378), ('elongated', 0.206), ('ptas', 0.201), ('dissimilarities', 0.177), ('objects', 0.164), ('connectivity', 0.142), ('ward', 0.141), ('dissimilarity', 0.133), ('pairwise', 0.129), ('clusters', 0.127), ('kernel', 0.125), ('linkage', 0.124), ('cluster', 0.111), ('spiral', 0.106), ('mercer', 0.099), ('voronoi', 0.097), ('outliers', 0.095), ('spectral', 0.094), ('shift', 0.091), ('dik', 0.089), ('dkj', 0.089), ('distances', 0.089), ('km', 0.088), ('lemma', 0.083), ('dc', 0.081), ('cj', 0.081), ('outlier', 0.078), ('effective', 0.076), ('tree', 0.075), ('polynomial', 0.074), ('arms', 0.074), ('pc', 0.074), ('embeddable', 0.067), ('ci', 0.066), ('extracting', 0.065), ('bandwidth', 0.062), ('semi', 0.059), ('pca', 0.059), ('agglomerative', 0.058), ('approximation', 0.058), ('embedded', 0.055), ('costs', 0.054), ('spanning', 0.054), ('path', 0.051), ('matrix', 0.05), ('transformed', 0.049), ('sc', 0.049), ('bridge', 0.049), ('centralized', 0.049), ('usps', 0.049), ('structures', 0.048), ('grouping', 0.047), ('jj', 0.047), ('embed', 0.047), ('ij', 0.046), ('exists', 0.046), ('digits', 0.045), ('dpq', 0.045), ('nkp', 0.045), ('ostrovsky', 0.045), ('qdq', 0.045), ('compactness', 0.044), ('partition', 0.044), ('euclidean', 0.043), ('dp', 0.042), ('xj', 0.042), ('paths', 0.041), ('induce', 0.041), ('compact', 0.041), ('rp', 0.041), ('robustness', 0.04), ('en', 0.039), ('pij', 0.039), ('distance', 0.039), ('fischer', 0.039), ('enumerating', 0.039), ('mercers', 0.039), ('embedding', 0.039), ('ii', 0.038), ('separated', 0.038), ('truth', 0.038), ('vectors', 0.037), ('metric', 0.036), ('xi', 0.036), ('gure', 0.036), ('centroids', 0.035), ('distinctly', 0.035), ('kmeans', 0.035), ('weakest', 0.035), ('squared', 0.035), ('hierarchical', 0.035), ('running', 0.035), ('extract', 0.034), ('hierarchy', 0.033), ('nonnegative', 0.033), ('zurich', 0.033), ('buhmann', 0.033), ('roth', 0.033), ('partitions', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
2 0.23521434 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
3 0.22485489 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
4 0.1720601 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
5 0.16915534 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
6 0.16440465 87 nips-2003-Identifying Structure across Pre-partitioned Data
7 0.15104213 152 nips-2003-Pairwise Clustering and Graphical Models
8 0.14945552 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
9 0.14037846 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
10 0.09913335 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
11 0.091360606 126 nips-2003-Measure Based Regularization
12 0.089572169 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
13 0.088082828 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
14 0.086588465 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
15 0.084153108 113 nips-2003-Learning with Local and Global Consistency
16 0.082826473 112 nips-2003-Learning to Find Pre-Images
17 0.082004473 108 nips-2003-Learning a Distance Metric from Relative Comparisons
18 0.079061039 128 nips-2003-Minimax Embeddings
19 0.078747861 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
20 0.078497931 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
topicId topicWeight
[(0, -0.254), (1, -0.185), (2, -0.067), (3, 0.144), (4, -0.167), (5, 0.176), (6, -0.206), (7, 0.221), (8, -0.031), (9, 0.063), (10, 0.007), (11, -0.05), (12, -0.031), (13, -0.082), (14, -0.026), (15, 0.062), (16, 0.013), (17, 0.001), (18, -0.069), (19, 0.006), (20, 0.004), (21, -0.107), (22, 0.048), (23, -0.058), (24, -0.073), (25, -0.058), (26, 0.012), (27, 0.019), (28, 0.059), (29, -0.008), (30, -0.005), (31, -0.046), (32, 0.079), (33, -0.026), (34, -0.09), (35, -0.015), (36, -0.033), (37, 0.028), (38, -0.0), (39, 0.121), (40, -0.069), (41, -0.038), (42, 0.008), (43, 0.025), (44, 0.03), (45, -0.035), (46, 0.153), (47, 0.055), (48, 0.066), (49, -0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.9661122 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
2 0.85745597 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
3 0.76589775 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
4 0.73940474 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
5 0.68205827 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
6 0.60463214 152 nips-2003-Pairwise Clustering and Graphical Models
7 0.58159494 87 nips-2003-Identifying Structure across Pre-partitioned Data
8 0.57033551 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
9 0.56516874 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
10 0.50111592 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
11 0.47426158 108 nips-2003-Learning a Distance Metric from Relative Comparisons
12 0.43381202 126 nips-2003-Measure Based Regularization
13 0.40082046 48 nips-2003-Convex Methods for Transduction
14 0.38364002 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
15 0.37820771 113 nips-2003-Learning with Local and Global Consistency
16 0.37305528 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
17 0.35570189 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
18 0.32701957 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
19 0.31508479 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
20 0.30521485 128 nips-2003-Minimax Embeddings
topicId topicWeight
[(0, 0.035), (5, 0.014), (30, 0.016), (35, 0.039), (53, 0.097), (71, 0.071), (76, 0.034), (85, 0.049), (91, 0.536), (99, 0.016)]
simIndex simValue paperId paperTitle
1 0.99755347 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
Author: D. Philipona, J.k. O'regan, J.-p. Nadal, Olivier Coenen
Abstract: Is there a way for an algorithm linked to an unknown body to infer by itself information about this body and the world it is in? Taking the case of space for example, is there a way for this algorithm to realize that its body is in a three dimensional world? Is it possible for this algorithm to discover how to move in a straight line? And more basically: do these questions make any sense at all given that the algorithm only has access to the very high-dimensional data consisting of its sensory inputs and motor outputs? We demonstrate in this article how these questions can be given a positive answer. We show that it is possible to make an algorithm that, by analyzing the law that links its motor outputs to its sensory inputs, discovers information about the structure of the world regardless of the devices constituting the body it is linked to. We present results from simulations demonstrating a way to issue motor orders resulting in “fundamental” movements of the body as regards the structure of the physical world. 1
2 0.99416989 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
Author: Marc Toussaint
Abstract: We present a connectionist architecture that can learn a model of the relations between perceptions and actions and use this model for behavior planning. State representations are learned with a growing selforganizing layer which is directly coupled to a perception and a motor layer. Knowledge about possible state transitions is encoded in the lateral connectivity. Motor signals modulate this lateral connectivity and a dynamic field on the layer organizes a planning process. All mechanisms are local and adaptation is based on Hebbian ideas. The model is continuous in the action, perception, and time domain.
3 0.98844355 87 nips-2003-Identifying Structure across Pre-partitioned Data
Author: Zvika Marx, Ido Dagan, Eli Shamir
Abstract: We propose an information-theoretic clustering approach that incorporates a pre-known partition of the data, aiming to identify common clusters that cut across the given partition. In the standard clustering setting the formation of clusters is guided by a single source of feature information. The newly utilized pre-partition factor introduces an additional bias that counterbalances the impact of the features whenever they become correlated with this known partition. The resulting algorithmic framework was applied successfully to synthetic data, as well as to identifying text-based cross-religion correspondences. 1 In t ro d u c t i o n The standard task of feature-based data clustering deals with a single set of elements that are characterized by a unified set of features. The goal of the clustering task is to identify implicit constructs, or themes, within the clustered set, grouping together elements that are characterized similarly by the features. In recent years there has been growing interest in more complex clustering settings, in which additional information is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on the information bottleneck (IB) framework [6], which facilitates coherent information-theoretic representation of different information types. In a recent line of research we have investigated the cross-dataset clustering task [7], [8]. In this setting, some inherent a-priori partition of the clustered data to distinct subsets is given. The clustering goal it to identify corresponding (analogous) structures that cut across the different subsets, while ignoring internal structures that characterize individual subsets. To accomplish this task, those features that commonly characterize elements across the different subsets guide the clustering process, while within-subset regularities are neutralized. In [7], we presented a distance-based hard clustering algorithm for the coupledclustering problem, in which the clustered data is pre-partitioned to two subsets. In [8], our setting, generalized to pre-partitions of any number of subsets, was addressed by a heuristic extension of the probabilistic IB algorithm, yielding improved empirical results. Specifically, the algorithm in [8] was based on a modification of the IB stable-point equation, which amplified the impact of features characterizing a formed cluster across all, or most, subsets. This paper describes an information-theoretic framework that motivates and extends the algorithm proposed in [8]. The given pre-partitioning is represented via a probability distribution variable, which may represent “soft” pre-partitioning of the data, versus the strictly disjoint subsets assumed in the earlier cross-dataset framework. Further, we present a new functional that captures the cross-partition motivation. From the new functional, we derive a stable-point equation underlying our algorithmic framework in conjunction with the corresponding IB equation. Our algorithm was tested empirically on synthetic data and on a real-world textbased task that aimed to identify corresponding themes across distinct religions. We have cross-clustered five sets of keywords that were extracted from topical corpora of texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction from standard clustering results, our algorithm reveals themes that are common to all religions, such as sacred writings, festivals, narratives and myths and theological principles, and avoids topical clusters that correspond to individual religions (for example, ‘Christmas’ and ‘Easter’ are clustered together with ‘Ramadan’ rather than with ‘Church’). Finally, we have paid specific attention to the framework of clustering with side information [4]. While this approach was presented for a somewhat different mindset, it might be used directly to address clustering across pre-partitioned data. We compare the technical details of the two approaches and demonstrate empirically that clustering with side information does not seem appropriate for the kind of cross-partition tasks that we explored. 2 Th e In fo rmat i o n B ot t len eck M et h od Probabilistic (“soft”) data clustering outputs, for each element x of the set being clustered and each cluster c, an assignment probability p(c|x). The IB method [6] interprets probabilistic clustering as lossy data compression. The given data is represented by a random variable X ranging over the clustered elements. X is compressed through another random variable C, ranging over the clusters. Every element x is characterized by conditional probability distribution p(Y|x), where Y is a third random variable taking the members y of a given set of features as values. The IB method formalizes the clustering task as minimizing the IB functional: L(IB) = I(C; X) − β I(C; Y) . (1) As known from information theory (Ch. 13 of [9]), minimizing the mutual information I(C; X) optimizes distorted compression rate. A complementary bias to maximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to the obtained clustering, inferred from the level by which C can predict Y. β is a free parameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that minimize L(IB) satisfy the following equation: p(c|x) = 1 p (c )e −β DKL [ p ( Y |x )|| p (Y |c ) ] , z( β , x) (2) where DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, between two distributions and z(β ,x) is a normalization function over C. Eq. (2) implies that, optimally, x is assigned to c in proportion to their KL distance in a feature distribution space, where the distribution p(Y|c) takes the role of a Start at time t = 0 and iterate the following update-steps, till convergence: IB1: initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ IB2: pt (c) = IB3: pt (y|c) = pt −1 (c ) e ∑ x (t = 0) (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x) p ( y | x ) p ( x) p t (c ) x Figure 1: The Information Bottleneck iterative algorithm (with fixed β and |C|). representative, or centroid, of c. The feature variable Y is hence utilized as the (exclusive) means to guide clustering, beyond the random nature of compression. Figure 1 presents the IB iterative algorithm for a fixed value of β . The IB1 update step follows Eq. (2). The other two steps, which are derived from the IB functional as well, estimate the p(c) and p(y|c) values required for the next iteration. The algorithm converges to a local minimum of the IB functional. The IB setting, particularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and C are independent given X, that is: I(C; Y|X) = ∑x p(x) I(C|x; Y|x) = 0. The balancing parameter β affects the number of distinct clusters being formed in a manner that resembles (inverse) temperature in physical systems. The higher β is (i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters are required for encoding the data. For each |C| = 2, 3, …, there is a minimal β value, enabling the formation of |C| distinct clusters. Setting β to be smaller than this critical value corresponding to the current |C| would result in two or more clusters that are identical to one another. Based on this, the iterative algorithm is applied repeatedly within a gradual cooling-like (deterministic annealing) scheme: starting with random initialization of the p0 (c|x)'s, generate two clusters with the critical β value, found empirically, for |C| = 2. Then, use a perturbation on the obtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of clusters and execute additional runs of the algorithm to identify the critical β value for the larger |C|. And so on: each output configuration is used as a basis for a more granular one. The final outcome is a “soft hierarchy” of probabilistic clusters. 3 Cro ss- p a rt i t i o n Clu st eri n g Cross-partition (CP) clustering introduces a factor – a pre-given partition of the clustered data – additional to what considered in a standard clustering setting. For representing this factor we introduce the pre-partitioning variable W, ranging over all parts w of the pre-given partition. Every data element x is associated with W through a given probability distribution p(W|x). Our goal is to cluster the data, so that the clusters C would not be correlated with W. We notice that Y, which is intended to direct the formation of clusters, might be a-priori correlated with W, so the formed clusters might end up being correlated with W as well. Our method aims at eliminating this aspect of Y. 3.1 I n f or ma t i o n D e f oc us i n g As noted, some of the information conveyed by Y characterizes structures correlated with W, while the other part of the information characterizes the target cross-W structures. We are interested in detecting the latter while filtering out the former. However, there is no direct a-priori separation between the two parts of the Ymediated information. Our strategy in tackling this difficulty is: we follow in general Y's directions, as the IB method does, while avoiding Y's impact whenever it entails undesired inter-dependencies of C and W. Our strategy implies conflicting biases with regard to the mutual information I(C,Y): it should be maximized in order to form meaningful clusters, but be minimized as well in the specific context where Y entails C–W dependencies. Accordingly, we propose a computational procedure directed by two distinct cost-terms in tandem. The first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). With this bias alone, Y might dictate (or “explain”, in retrospect) substantial C–W dependencies, implying a low I(C;W|Y) value. 1 Hence, the guideline of preventing Y from accounting for C–W dependencies is realized through an opposing bias of maximizing I(C;W|Y) = ∑y p(y) I(C|y; W|y). The second cost term – the Information Defocusing (ID) functional – consequently counterbalances minimization of I(C,Y) against the new bias: L(ID) = I(C; Y) − η I(C;W|Y) , (3) where η is a free parameter articulating the tradeoff between the biases. The ID functional captures our goal of reducing the impact of Y selectively: “defocusing” a specific aspect of the information Y conveys: the information correlated with W. In a like manner to the stable-point equation of the IB functional (Eq. 2), we derive the following stable-point equation for the ID functional: η p ( w) 1 p ( c )∏ w p ( y | c, w) η +1 , p(c|y) = z (η , y ) (4) where z(η,y) is a normalization function over C. The derivation relies on an additional assumption, I(C;W) = 0, imposing the intended independence between C and W (the detailed derivation will be described elsewhere). The intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated with a cluster c in proportion to a weighted, though flattened, geometric mean of the “W-projected centroids” p(y|c,w), priored by p(c). 2 This scheme overweighs y's that contribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated around centroids biased towards evenly contributing features. The higher η is, heavier emphasis is put on suppressing disagreements between the w's. For η → ∞ a plain weighted geometric-mean scheme is obtained. The inclusion of a step derived from Eq. (4) in our algorithm (see below) facilitates convergence on a configuration with centroids dominated by features that are evenly distributed across W. 3.2 T h e Cr os s - p a r t i t i on C l us t e r i n g A l g or i t h m Our proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering configuration that optimizes simultaneously both the IB and ID functionals, 1 Notice that “Z explaining well the dependencies between A and B” is equivalent with “A and B sharing little information in common given Z”, i.e. low I(A;B|Z) . Complete conditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. 2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the subsets; in the current paper this scheme is analytically derived from the ID functional. Start at time t = 0 and iterate the following update-steps, till convergence: CP1: Initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ CP2: pt (c) = CP3: p*t (y|c,w) = CP4: (t = 0) p t −1 (c ) e ∑ x (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x ) p ( y | x ) p ( w | x ) p ( x ) p t ( c ) p ( w) x Initialize p*t (c) randomly or arbitrarily (t = 0) p*t (c) (t > 0) = ∑ y p *t −1 (c | y ) p ( y ) η CP5: p*t (c|y) ∝ p *t (c)∏w p *t ( y | c, w) η +1 CP6: pt (y|c) = p ( w) p *t (c | y ) p ( y ) p *t (c ) Figure 2: The cross-partition clustering iterative algorithm (with fixed β, η, and |C|). thus obtaining clusters that cut across the pre-given partition W. To this end, the algorithm interleaves an iterative computation of the stable-point equations, and the additional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 correspond to the computations related to the IB functional, while steps CP3, CP4 and CP5, which compute a separate set of parameters (denoted by an asterisk), correspond to the ID functional. Figure 3 summarizes the roles of the two functionals in the dynamics of the CP algorithm. The two components of the iterative cycle are tied together in steps CP3 and CP6, in which parameters from one set are used as input to compute a parameter of other set. The derivation of step CP3 relies on an additional assumption, namely that C, Y and W are jointly independent given X. This assumption, which extends to W the underlying assumption of the IB setting that C and Y are independent given X, still entails the IB stable point equation. At convergence, the stable point equations for both the IB and ID functionals are satisfied, each by its own set of parameters (in steps CP1 and CP5). The deterministic annealing scheme, which gradually increases β over repeated runs (see Sec. 2), is applied for the CP algorithm as well with η held fixed. For a given target number of clusters |C|, the algorithm empirically converges with a wide range of η values 3. I(C;X) ↓ IB β↑ I(C;Y) ↓ ID η↑ I(C; W|Y) I(C; Y; W|X) = 0 ← assumptions → I(C;W) = 0 Figure 3: The interplay of the IB and the ID functionals in the CP algorithm. High η values tend to dictate centroids with features that are unevenly distributed across W, resulting in shrinkage of some of the clusters. Further analysis will be provided in future work. 3 4 Exp e ri men t a l Resu lt s Our synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into three 25-element parts denoted X 1 , X2 and X3 (in our formalism, for each clustered element x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, we partitioned the data twice, getting two (exhaustive) clustering configurations: 1. Target cross-W clustering: five clusters, each with representatives from all X w's; 2. Masking within-w clustering: six clusters, each consisting of roughly half the elements of either X 1, X 2 or X3 with no representatives from the other X w's. Each cluster, of both configurations, was characterized by a designated subset of features. Masking clusters were designed to be more salient than target clusters: they had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) and their elements shared higher feature-element (virtual) co-occurrence counts with those designated features (900 vs. 450 per element-feature pair). Noise (random positive integer < 200) was added to all counts associating elements with their designated features (for both within-w and cross-W clusters), as well as to roughly quarter of the zero counts associating elements with the rest of the features. The plain IB method consistently produced configurations strongly correlated with the masking clustering, while the CP algorithm revealed the target configuration. We got (see Table 1A) almost perfect results in configurations of nearly equal-sized cross-W clusters, and somewhat less perfect reconstruction in configurations of diverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to optimal target-output cluster match by the proportion of elements correctly assigned, where assignment of an element x follows its highest p(c|x). The results indicated were averaged over 200 runs. They were obtained for the optimal η, which was found to be higher in the diverging-sizes task. In the text-based task, the clustered elements – keywords – were automatically extracted from five distinct corpora addressing five religions: introductory web pages, online magazines, encyclopedic entries etc., all downloaded from the Internet. The clustered keyword set X was consequently pre-partitioned to disjoint subsets {X w} w∈W, one for each religion4 (|X w| ≈ 200 for each w). We conducted experiments simultaneously involving religion pairs as well as all five religions. We took the features Y to be a set of words that commonly occur within all five corpora (|Y| ≈ 7000). x–y co-occurrences were recorded within ±5-word sliding window truncated by sentence boundaries. η was fixed to a value (1.0) enabling the formation of 20 clusters in all settings. The obtained clusters revealed interesting cross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing the theme of sacred festivals: the three highest p(c/x) members within each religion were Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas Table 1: Average correct assignment proportion scores for the synthetic task (A) and Jaccard-coefficient scores for the religion keyword classification task (B). A. Synthetic Data IB CP B. Religion Data IB Coupled Clustering [7] CP (cross-expert agreement on religion pairs .462±.232) equal-size clusters .305 .985 non-equal clusters .292 .827 4 religion pairs all five (one case) .200±.100 .220±.138 .407±.144 .104 ––––––– .167 A keyword x that appeared in the corpora of different religions was considered as a distinct element for each religion, so the Xw were kept disjointed. (Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, (Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster produced by the plain IB method was poorer by far, including Islamic Ramadan, and Id and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked high too), but no single related term from the other religions. Our external evaluation standards were cross-religion keyword classes constructed manually by experts of comparative religion studies. One such expert classification involved all five religions, and eight classifications addressed religions in pairs. Each of the eight religion-pair classifications was contributed by two independent experts using the same keywords, so we could also assess the agreement between experts. As an overlap measure we employed the Jaccard coefficient: the number of element pairs co-assigned together by both one of the evaluated clusters and one of the expert classes, divided by the number of pairs co-assigned by either our clusters or the expert (or both). We did not assume the number of expert classes is known in advance (as done in the synthetic experiments), so the results were averaged over all configurations of 2–16 cluster hierarchy, for each experiment. The results shown in Table 1B – clear improvement relatively to plain IB and the distance-based coupled clustering [7] – are, however, persistent when the number of clusters is taken to be equal to the number of classes, or if only the best score in hierarchy is considered. The level of cross-expert agreement indicates that our results are reasonably close to the scores expected in such subjective task. 5 C o mp a ri so n t o R e la t ed W o r k The information bottleneck framework served as the basis for several approaches that represent additional information in their clustering setting. The multivariate information bottleneck (MIB) adapts the IB framework for networks of multiple variables [3]. However, all variables in such networks are either compressed (like X), or predicted (like Y). The incorporation of an empirical variable to be masked or defocused in the sense of our W is not possible. Including such variables in the MIB framework might be explored in future work. Particularly relevant to our work is the IB-based method for extracting relevant constructs with side information [4]. This approach addresses settings in which two different types of features are distinguished explicitly: relevant versus irrelevant ones, denoted by Y+ and Y−. Both types of features are incorporated within a single functional to be minimized: L(IB-side-info) = I(C; X) − β ( I(C; Y +) − γ I(C; Y−) ), which directly drives clustering to de-correlate C and Y−. Formally, our setting can be mapped to the side information setting by regarding the pre-partition W simply as the additional set of irrelevant features, giving symmetric (and opposite) roles to W and Y. However, it seems that this view does not address properly the desired cross-partition setting. In our setting, it is assumed that clustering should be guided in general by Y, while W should only neutralize particular information within Y that would otherwise yield the undesired correlation between C and W (as described in Section 3.1). For that reason, the defocusing functional tie the three variables together by conditioning the de-correlation of C and W on Y, while its underlying assumption ensures the global de-correlation. Indeed, our method was found empirically superior on the cross-dataset task. The side-information IB method (the iterative algorithm with best scoring γ) achieves correct assignment proportion of 0.52 in both synthetic tasks, where our method scored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification task, Jaccard coefficient improved by 20% relatively to plain IB (compared to our 100% improvement, see Table 1B). 6 C o n c lu si o n s This paper addressed the problem of clustering a pre-partitioned dataset, aiming to detect new internal structures that are not correlated with the pre-given partition but rather cut across its components. The proposed framework extends the cross-dataset clustering algorithm [8], providing better formal grounding and representing any pre-given (soft) partition of the dataset. Supported by empirical evidence, we suggest that our framework is better suited for the cross-partition task than applying the side-information framework [4], which was originally developed to address a somewhat different setting. We also demonstrate substantial empirical advantage over the distance-based coupled-clustering algorithm [7]. As an applied real-world goal, the algorithm successfully detects cross-religion commonalities. This goal exemplifies the more general notion of detecting analogies across different systems, which is a somewhat vague and non-consensual task and therefore especially challenging for a computational framework. Our approach can be viewed as an initial step towards principled identification of “hidden” commonalities between substantially different real world systems, while suppressing the vast majority of attributes that are irrelevant for the analogy. Further research may study the role of defocusing in supervised learning, where some pre-given partitions might mask the role of underlying discriminative features. Additionally, it would be interesting to explore relationships to other disciplines, e.g., network information theory ([9], Ch. 14) which provided motivation for the side-information approach. Finally, both frameworks (ours and side-information) suggest the importance of dealing wisely with information that should not dictate the clustering output directly. A c k n ow l e d g me n t s We thank Yuval Krymolowski for helpful discussions and Tiina Mahlamäki, Eitan Reich and William Shepard, for contributing the religion keyword classifications. References [1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine Learning Research, 41(1):177-196. [2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means clustering with background knowledge. The 18th International Conference on Machine Learning (ICML-2001), pp 577-584. [3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information bottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152161. [4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. Advances in Neural Processing Information Systems 15 (NIPS'02). [5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307-1331. [6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. [7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. [8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding themes across multiple corpora. Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 15-21. [9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. Sons, Inc., New York, New York. John Wiley &
4 0.9870922 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time
Author: Xiaofeng Wang, Tuomas Sandholm
Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efficient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1
Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan
Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classifier change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the field in an application for automatic assessment of human-robot interaction.
same-paper 6 0.97189015 46 nips-2003-Clustering with the Connectivity Kernel
7 0.83321983 26 nips-2003-An MDP-Based Approach to Online Mechanism Design
8 0.82483375 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
9 0.7969256 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
10 0.78125799 68 nips-2003-Eye Movements for Reward Maximization
11 0.76769328 43 nips-2003-Bounded Invariance and the Formation of Place Fields
12 0.76675969 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?
13 0.75815094 73 nips-2003-Feature Selection in Clustering Problems
14 0.75324082 55 nips-2003-Distributed Optimization in Adaptive Networks
15 0.7503469 14 nips-2003-A Nonlinear Predictive State Representation
16 0.74312991 107 nips-2003-Learning Spectral Clustering
17 0.74015898 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
18 0.73673087 19 nips-2003-Algorithms for Interdependent Security Games
19 0.73416609 81 nips-2003-Geometric Analysis of Constrained Curves
20 0.73343235 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games