nips nips2003 nips2003-73 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract A novel approach to combining clustering and feature selection is presented. [sent-5, score-0.511]
2 It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. [sent-6, score-0.676]
3 The only free parameter of this method is selected by a resampling-based stability analysis. [sent-8, score-0.347]
4 Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. [sent-9, score-0.364]
5 1 Introduction The task of selecting relevant features in classification problems can be viewed as one of the most fundamental problems in the field of machine learning. [sent-10, score-0.31]
6 A major motivation for selecting a subset of features from which a learning rule is constructed is the interest in sparse and interpretable rules, emphasizing only a few relevant variables. [sent-11, score-0.263]
7 In supervised learning scenarios, feature selection has been studied widely in the literature. [sent-12, score-0.322]
8 From a conceptual viewpoint, wrapper approaches are clearly advantageous, since the features are selected by optimizing the discriminative power of the finally used classifier. [sent-15, score-0.511]
9 Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. [sent-16, score-0.233]
10 The common strategy of most approaches is the use of an iterated stepwise procedure: in the first step a set of hypothetical partitions is extracted (the clustering step), and in the second step features are scored for relevance (the relevance determination step). [sent-20, score-0.984]
11 A possible shortcoming is the way of combining these two steps in an “ad hoc” manner: usually the relevance determination mechanism implements a filter approach and does not take into account the properties of the clustering method used. [sent-21, score-0.56]
12 Usual scoring methods make an implicit independence assumption, while ignoring feature correlations. [sent-22, score-0.2]
13 It is thus of particular interest to combine wrapper selection strategies and clustering methods. [sent-23, score-0.521]
14 It combines a Gaussian mixture model with a Bayesian feature selection principle. [sent-25, score-0.363]
15 The usual combinatorial problems involved with wrapper approaches are overcome by using a Bayesian marginalization mechanism. [sent-26, score-0.24]
16 The only free model parameter is selected by a resampling-based stability analysis. [sent-28, score-0.347]
17 A comparison with ground-truth labels in control experiments indicates that the selected models induce sample clusters and feature subsets which both provide a clear interpretation. [sent-30, score-0.483]
18 Our approach to combining clustering and feature selection is based on a Gaussian mixture model, which is optimized by way of the classical expectation-maximization (EM) algorithm. [sent-31, score-0.552]
19 In order to incorporate the feature selection mechanism, the M-step is first reformulated as a linear discriminant analysis (LDA) problem, which makes use of the “fuzzy labels” estimated in the preceding E-step. [sent-32, score-0.407]
20 We finally arrive at an M-step with integrated feature selection mechanism. [sent-37, score-0.361]
21 2 Clustering and Bayesian relevance determination Gaussian mixtures and LDA. [sent-38, score-0.215]
22 The simultaneous estimation of scores and regression coefficients β constitutes the optimal scoring problem: minimize M (θ, β) = Zθ − Xβ 2 (2) 2 1 2 under the constraint N Zθ 2 = 1. [sent-49, score-0.207]
23 Returning from a standard LDA-problem to the above weighted and augmented problem, it turns out that it is not necessary to explicitly replicate the observations: the optimal scoring version of LDA allows an implicit solution of the augmented problem that still uses only N ˜ observations. [sent-52, score-0.192]
24 We now focus on incorporating the automatic feature selection mechanism into the EM-algorithm. [sent-58, score-0.427]
25 The feature selection mechanism can now be incorporated in the M-step by imposing a certain constraint on the linear regression. [sent-63, score-0.444]
26 The main idea of incorporating an automatic feature selection mechanism consists of replacing the Gaussian prior with an automatic relevance determination (ARD) prior1 of the form p(β| ϑ) = i N (0, ϑ−1 ) ∝ exp[− i i 2 ϑi βi ]. [sent-66, score-0.723]
27 Taking into account feature correlations, estimation of Si necessarily involves searching the space of all possible subsets of features containing the i-th one. [sent-73, score-0.315]
28 In the Bayesian ARD formalism, this combinatorial explosion of the search space is overcome by relaxing the binary selection variable to a positive real-valued variance of a Gaussian prior over each component of the coefficient vector. [sent-74, score-0.297]
29 Since space here precludes a detailed discussion of 1 – constrained regression problems, the reader is referred to [12], where a highly efficient algorithm with guaranteed global convergence has been proposed. [sent-77, score-0.303]
30 3 Model selection Our model has only one free parameter, namely the value of the 1 –constraint κ. [sent-86, score-0.251]
31 In the following we describe a method for selecting κ by observing the stability of data partitions. [sent-87, score-0.26]
32 For each of the partitions which we have identified as “stable”, we then examine the fluctuations involved in the feature selection process. [sent-88, score-0.589]
33 It should be noticed that the concept of measuring the stability of solutions as a means of model selection has been successfully applied to several unsupervised learning problems, see e. [sent-89, score-0.384]
34 We will usually find many potential splits of a dataset, depending on how many features are selected: if we select only one feature, it is likely to find many competing hypotheses for splits. [sent-92, score-0.273]
35 To obtain a quantitative measure of stability, we propose the following procedure: run the class discovery method once, corrupt the data vectors by a small amount of noise, repeat the grouping procedure, and calculate the Hamming distance between the two partitions as a measure of (in-)stability. [sent-96, score-0.317]
36 For computing Hamming distances, the partitions are viewed as vectors containing the cluster labels. [sent-97, score-0.56]
37 Simply taking the average stability over many such two-sample comparisons, however, would not allow an adequate handling of situations where there are two equally likely stable solutions, of which the clustering algorithm randomly selects one. [sent-98, score-0.456]
38 In such situations, the averaged stability will be very low, despite the fact that there exist two stable splitting hypotheses. [sent-99, score-0.325]
39 In order to make Hamming distances suitable for this purpose, we have to consider the inherent permutation symmetry of the clustering process: a cluster called “1” in the first partition can be called “2” in the second one. [sent-107, score-0.722]
40 When computing the pairwise Hamming distances, we thus have to minimize over the two possible permutations of cluster labels. [sent-108, score-0.345]
41 Steps (iv) and (v) need some further explanation: the problem of identifying compact groups in datasets which are represented by pairwise distances can by solved by optimizing the pairwise clustering cost function, [7]. [sent-109, score-0.627]
42 We iteratively increase the number of clusters (which is a free parameter in the pairwise clustering functional) until the average dissimilarity in each group does not exceed a predefined threshold. [sent-110, score-0.489]
43 For the clusters which are considered homogeneous, we observe their populations, and out of all models investigated we choose the one leading to the partition cluster of largest size. [sent-113, score-0.557]
44 For this dominating cluster, we then select a prototypical partition. [sent-114, score-0.229]
45 For selecting such prototypical partitions in pairwise clustering problems, we refer the reader to [13], where it is shown that the pairwise clustering problem can be equivalently restated as a k-means problem in a suitably chosen embedding space. [sent-115, score-1.076]
46 Each partition is represented as a vector Partitions 8 0 Noisy resample 99 0 Dissimilarity matrix Noisy resample 100 0. [sent-116, score-0.3]
47 05 0 2 4 6 8 10 Cluster index Histogramming cluster populations Embedding & Clustering Figure 1: Model selection: schematic work-flow for one fixed value of the 4 Hamming distances Noisy resample 1 Sample n−1 Sample n Sample 1 Sample 2 Sample 3 in this space. [sent-123, score-0.431]
48 This property allows us to select those partitions as representants, which are closest to the partition cluster centroids. [sent-124, score-0.73]
49 The whole work-flow of model selection is summarized schematically in figure 1. [sent-125, score-0.202]
50 In a first experiment we test our method for the task of clustering digits from the USPS handwritten digits database. [sent-128, score-0.327]
51 Figure 2: Sample images of digits ’6’ and ’7’ from the USPS database. [sent-130, score-0.22]
52 The 16 × 16 gray-value images of the digits are treated as 256-dimensional vectors. [sent-131, score-0.22]
53 For this experiment, we extracted a subset of 200 images, consisting of randomly selected digits ’6’ and ’7’. [sent-132, score-0.234]
54 Based on this dataset, we first selected the most stable model according to the model selection procedure described in section 3. [sent-133, score-0.403]
55 We observed the stability of the solutions for different constraint values κ on the interval [0. [sent-134, score-0.241]
56 The histograms show the relative population of partition clusters. [sent-171, score-0.231]
57 The solid line indicates the average pairwise Hamming distance between partitions (divided by 100). [sent-172, score-0.366]
58 Figure 3 exemplarily shows the outcomes of the stability analysis: in the left panel, the solution is so highly constrained that on average only 2. [sent-173, score-0.275]
59 Subsets of only two features seem to be too small for building a consistent splitting hypothesis. [sent-176, score-0.202]
60 Even the most populated partition cluster (index 3) contains only 30% of all partitions. [sent-177, score-0.416]
61 In this case, the dominating partition cluster (cluster no. [sent-185, score-0.555]
62 0, in a next step we select the representative partition (= the one nearest to the centroid) of the dominating partition cluster (no. [sent-188, score-0.836]
63 This partition splits the dataset into two clusters, which highly agree with the true labeling. [sent-190, score-0.341]
64 In the upper part of figure 4, both the inferred labels and the true labels are depicted by horizontal bar diagrams. [sent-191, score-0.247]
65 The lower panel of this figure shows several rows, each of which represents one automatically selected feature. [sent-193, score-0.294]
66 Each of the 200 grey-value coded pixel blocks in a row indicates the feature value for one sample. [sent-194, score-0.196]
67 Lower panel: each row consists of grey-value coded values of the selected features for all samples (1 pixel block = 1 sample). [sent-199, score-0.336]
68 We are not only interested in the stability of splittings of the dataset, but also in the stability of the feature selection process. [sent-200, score-0.686]
69 In order to quantify this latter stability, we return to the dominating partition cluster no. [sent-201, score-0.555]
70 1 in the middle panel of figure 3, and for each of the 73 partitions in this cluster, we count how often a particular feature has been selected. [sent-202, score-0.529]
71 The 22 features (pixels) which are selected in at least one halve of the partitions, are plotted in the second panel of figure 5. [sent-203, score-0.402]
72 The selection stability is grey-value coded (black = 100% stable). [sent-204, score-0.46]
73 To the left and to the right we have again plotted two typical sample images of both classes from the database. [sent-205, score-0.201]
74 A comparison with the selected features leads us to the conclusion, that we were not only able to find reasonable clusters, but we also have exactly selected those discriminative features which we would have expected in this control experiment. [sent-206, score-0.59]
75 In a second experiment we applied our method to the problem of clustering face images. [sent-213, score-0.189]
76 html, where all images can be viewed in higher quality. [sent-226, score-0.198]
77 Since it appears to be infeasible to work directly on the set of pixels of the high-resolution images, in a first step we extracted the 10 leading eigenfaces of the total dataset (eigenfaces are simply the eigenvectors v i of the images treated as pixel-wise vectorial objects). [sent-228, score-0.466]
78 Note that the original images I j can be 10 (partially) reconstructed from this truncated eigenvector expansion as Ij = i=1 v i v i Ij (assuming the image vectors Ij to be centered). [sent-231, score-0.221]
79 Figure 8 depicts the outcome of the model selection procedure. [sent-244, score-0.202]
80 The left panel shows both the number of extracted features and the relative population of the largest partition cluster for different values of κ. [sent-245, score-0.812]
81 A detailed analysis of the selected features within the dominating partition cluster (no. [sent-250, score-0.815]
82 2, 3 and 7 are all selected with a stability of more than 98%. [sent-252, score-0.298]
83 15 Hamming distance 2 Relative population of dominating cluster (x10) 1 0. [sent-260, score-0.446]
84 Left: average number of selected features and relative population of the dominating partition cluster vs. [sent-269, score-0.876]
85 Right: partition clusters for optimal model with κ = 1. [sent-271, score-0.268]
86 In every M-step of our algorithm, a linear discriminant analysis is performed, in which a weight vector β for all features is computed (due to the incorporated feature selection mechanism, most weights will be exactly zero). [sent-272, score-0.551]
87 Our method can, thus, be interpreted as a clustering method that finds a partition and simultaneously produces a “sparse” Fisherface which consists of a linear combination of the most discriminative eigenfaces. [sent-274, score-0.429]
88 Figure 9 shows the derived Fisherface, reconstructed from the weight vector of the representative partition (no. [sent-275, score-0.304]
89 05 * eigenface 2 eigenface 3 = eigenface 7 Fisherface Figure 9: The inferred Fisherface as a linear combination of 3 eigenfaces. [sent-284, score-0.382]
90 5 in the right panel of figure 8) splits the images in two groups, which again highly coincide with the original groups of frontal and profile faces. [sent-286, score-0.485]
91 The success of the clustering method can be understood by reconstructing the original images from the inferred Fisherface (which is nothing but a weighted and truncated eigenvector reconstruction of the original images). [sent-291, score-0.392]
92 Figure 10 shows the same images as in figure 6, this time, however, reconstructed from the Fisherface. [sent-292, score-0.221]
93 Again, the whole set of reconstructed images can be viewed on our supplementary material web page in higher quality. [sent-295, score-0.323]
94 5 Conclusions The problem tackled in this paper consists of simultaneously clustering objects and automatically extracting subsets of features which are most discriminative for this object partition. [sent-297, score-0.49]
95 Some approaches have been proposed in the literature, most of which, however, bear several inherent shortcomings, such as an unclear probabilistic model, the simplifying assumption of features as being uncorrelated, or the absence of a plausible model selection strategy. [sent-298, score-0.393]
96 It has a clear interpretation in terms of a constrained Gaussian mixture model, which combines a clustering method with a Bayesian inference mechanism for automatically selecting relevant features. [sent-301, score-0.504]
97 The model has only one free parameter, κ, for which we propose a stability-based model selection procedure. [sent-303, score-0.251]
98 Experiments demonstrate that this method is able to correctly infer partitions and meaningful feature sets. [sent-304, score-0.387]
99 Our method currently only implements partitions of the object set into two clusters. [sent-305, score-0.312]
100 Such iterative splits have been successfully applied to the problem of simultaneously clustering gene expression datasets and selecting relevant genes. [sent-307, score-0.482]
wordName wordTfidf (topN-words)
[('partitions', 0.267), ('hamming', 0.261), ('cluster', 0.246), ('selection', 0.202), ('clustering', 0.189), ('stability', 0.182), ('partition', 0.17), ('images', 0.151), ('features', 0.144), ('panel', 0.142), ('dominating', 0.139), ('fisherface', 0.137), ('wrapper', 0.13), ('ard', 0.13), ('eigenfaces', 0.13), ('relevance', 0.12), ('feature', 0.12), ('lda', 0.116), ('selected', 0.116), ('eigenface', 0.11), ('lasso', 0.109), ('pairwise', 0.099), ('clusters', 0.098), ('determination', 0.095), ('mislabeled', 0.095), ('stable', 0.085), ('discriminant', 0.085), ('splits', 0.082), ('gure', 0.081), ('scoring', 0.08), ('selecting', 0.078), ('coded', 0.076), ('bayesian', 0.076), ('groups', 0.073), ('reconstructed', 0.07), ('discriminative', 0.07), ('distances', 0.07), ('digits', 0.069), ('regression', 0.068), ('resample', 0.065), ('reader', 0.064), ('representative', 0.064), ('mechanism', 0.063), ('population', 0.061), ('roth', 0.061), ('constraint', 0.059), ('splitting', 0.058), ('coef', 0.058), ('augmented', 0.056), ('overcome', 0.056), ('constrained', 0.056), ('supplementary', 0.055), ('penalized', 0.054), ('marginalization', 0.054), ('dissimilarity', 0.054), ('inferred', 0.052), ('dataset', 0.052), ('optimizing', 0.051), ('subsets', 0.051), ('bar', 0.05), ('sample', 0.05), ('discovery', 0.05), ('index', 0.05), ('free', 0.049), ('extracted', 0.049), ('depicted', 0.049), ('labels', 0.048), ('restated', 0.048), ('shortcoming', 0.048), ('homogeneity', 0.048), ('hyperpriors', 0.048), ('inherent', 0.047), ('faces', 0.047), ('select', 0.047), ('viewed', 0.047), ('gene', 0.046), ('datasets', 0.046), ('hastie', 0.045), ('implements', 0.045), ('usps', 0.045), ('nally', 0.045), ('lange', 0.043), ('stirling', 0.043), ('prototypical', 0.043), ('leading', 0.043), ('automatic', 0.042), ('mixture', 0.041), ('pixels', 0.041), ('relevant', 0.041), ('functional', 0.041), ('zurich', 0.04), ('buhmann', 0.04), ('prior', 0.039), ('arrive', 0.039), ('convergence', 0.039), ('guaranteed', 0.039), ('iv', 0.038), ('highly', 0.037), ('automatically', 0.036), ('shrinkage', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
2 0.22485489 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
3 0.18985045 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
4 0.17265552 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
5 0.15540555 87 nips-2003-Identifying Structure across Pre-partitioned Data
Author: Zvika Marx, Ido Dagan, Eli Shamir
Abstract: We propose an information-theoretic clustering approach that incorporates a pre-known partition of the data, aiming to identify common clusters that cut across the given partition. In the standard clustering setting the formation of clusters is guided by a single source of feature information. The newly utilized pre-partition factor introduces an additional bias that counterbalances the impact of the features whenever they become correlated with this known partition. The resulting algorithmic framework was applied successfully to synthetic data, as well as to identifying text-based cross-religion correspondences. 1 In t ro d u c t i o n The standard task of feature-based data clustering deals with a single set of elements that are characterized by a unified set of features. The goal of the clustering task is to identify implicit constructs, or themes, within the clustered set, grouping together elements that are characterized similarly by the features. In recent years there has been growing interest in more complex clustering settings, in which additional information is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on the information bottleneck (IB) framework [6], which facilitates coherent information-theoretic representation of different information types. In a recent line of research we have investigated the cross-dataset clustering task [7], [8]. In this setting, some inherent a-priori partition of the clustered data to distinct subsets is given. The clustering goal it to identify corresponding (analogous) structures that cut across the different subsets, while ignoring internal structures that characterize individual subsets. To accomplish this task, those features that commonly characterize elements across the different subsets guide the clustering process, while within-subset regularities are neutralized. In [7], we presented a distance-based hard clustering algorithm for the coupledclustering problem, in which the clustered data is pre-partitioned to two subsets. In [8], our setting, generalized to pre-partitions of any number of subsets, was addressed by a heuristic extension of the probabilistic IB algorithm, yielding improved empirical results. Specifically, the algorithm in [8] was based on a modification of the IB stable-point equation, which amplified the impact of features characterizing a formed cluster across all, or most, subsets. This paper describes an information-theoretic framework that motivates and extends the algorithm proposed in [8]. The given pre-partitioning is represented via a probability distribution variable, which may represent “soft” pre-partitioning of the data, versus the strictly disjoint subsets assumed in the earlier cross-dataset framework. Further, we present a new functional that captures the cross-partition motivation. From the new functional, we derive a stable-point equation underlying our algorithmic framework in conjunction with the corresponding IB equation. Our algorithm was tested empirically on synthetic data and on a real-world textbased task that aimed to identify corresponding themes across distinct religions. We have cross-clustered five sets of keywords that were extracted from topical corpora of texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction from standard clustering results, our algorithm reveals themes that are common to all religions, such as sacred writings, festivals, narratives and myths and theological principles, and avoids topical clusters that correspond to individual religions (for example, ‘Christmas’ and ‘Easter’ are clustered together with ‘Ramadan’ rather than with ‘Church’). Finally, we have paid specific attention to the framework of clustering with side information [4]. While this approach was presented for a somewhat different mindset, it might be used directly to address clustering across pre-partitioned data. We compare the technical details of the two approaches and demonstrate empirically that clustering with side information does not seem appropriate for the kind of cross-partition tasks that we explored. 2 Th e In fo rmat i o n B ot t len eck M et h od Probabilistic (“soft”) data clustering outputs, for each element x of the set being clustered and each cluster c, an assignment probability p(c|x). The IB method [6] interprets probabilistic clustering as lossy data compression. The given data is represented by a random variable X ranging over the clustered elements. X is compressed through another random variable C, ranging over the clusters. Every element x is characterized by conditional probability distribution p(Y|x), where Y is a third random variable taking the members y of a given set of features as values. The IB method formalizes the clustering task as minimizing the IB functional: L(IB) = I(C; X) − β I(C; Y) . (1) As known from information theory (Ch. 13 of [9]), minimizing the mutual information I(C; X) optimizes distorted compression rate. A complementary bias to maximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to the obtained clustering, inferred from the level by which C can predict Y. β is a free parameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that minimize L(IB) satisfy the following equation: p(c|x) = 1 p (c )e −β DKL [ p ( Y |x )|| p (Y |c ) ] , z( β , x) (2) where DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, between two distributions and z(β ,x) is a normalization function over C. Eq. (2) implies that, optimally, x is assigned to c in proportion to their KL distance in a feature distribution space, where the distribution p(Y|c) takes the role of a Start at time t = 0 and iterate the following update-steps, till convergence: IB1: initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ IB2: pt (c) = IB3: pt (y|c) = pt −1 (c ) e ∑ x (t = 0) (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x) p ( y | x ) p ( x) p t (c ) x Figure 1: The Information Bottleneck iterative algorithm (with fixed β and |C|). representative, or centroid, of c. The feature variable Y is hence utilized as the (exclusive) means to guide clustering, beyond the random nature of compression. Figure 1 presents the IB iterative algorithm for a fixed value of β . The IB1 update step follows Eq. (2). The other two steps, which are derived from the IB functional as well, estimate the p(c) and p(y|c) values required for the next iteration. The algorithm converges to a local minimum of the IB functional. The IB setting, particularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and C are independent given X, that is: I(C; Y|X) = ∑x p(x) I(C|x; Y|x) = 0. The balancing parameter β affects the number of distinct clusters being formed in a manner that resembles (inverse) temperature in physical systems. The higher β is (i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters are required for encoding the data. For each |C| = 2, 3, …, there is a minimal β value, enabling the formation of |C| distinct clusters. Setting β to be smaller than this critical value corresponding to the current |C| would result in two or more clusters that are identical to one another. Based on this, the iterative algorithm is applied repeatedly within a gradual cooling-like (deterministic annealing) scheme: starting with random initialization of the p0 (c|x)'s, generate two clusters with the critical β value, found empirically, for |C| = 2. Then, use a perturbation on the obtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of clusters and execute additional runs of the algorithm to identify the critical β value for the larger |C|. And so on: each output configuration is used as a basis for a more granular one. The final outcome is a “soft hierarchy” of probabilistic clusters. 3 Cro ss- p a rt i t i o n Clu st eri n g Cross-partition (CP) clustering introduces a factor – a pre-given partition of the clustered data – additional to what considered in a standard clustering setting. For representing this factor we introduce the pre-partitioning variable W, ranging over all parts w of the pre-given partition. Every data element x is associated with W through a given probability distribution p(W|x). Our goal is to cluster the data, so that the clusters C would not be correlated with W. We notice that Y, which is intended to direct the formation of clusters, might be a-priori correlated with W, so the formed clusters might end up being correlated with W as well. Our method aims at eliminating this aspect of Y. 3.1 I n f or ma t i o n D e f oc us i n g As noted, some of the information conveyed by Y characterizes structures correlated with W, while the other part of the information characterizes the target cross-W structures. We are interested in detecting the latter while filtering out the former. However, there is no direct a-priori separation between the two parts of the Ymediated information. Our strategy in tackling this difficulty is: we follow in general Y's directions, as the IB method does, while avoiding Y's impact whenever it entails undesired inter-dependencies of C and W. Our strategy implies conflicting biases with regard to the mutual information I(C,Y): it should be maximized in order to form meaningful clusters, but be minimized as well in the specific context where Y entails C–W dependencies. Accordingly, we propose a computational procedure directed by two distinct cost-terms in tandem. The first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). With this bias alone, Y might dictate (or “explain”, in retrospect) substantial C–W dependencies, implying a low I(C;W|Y) value. 1 Hence, the guideline of preventing Y from accounting for C–W dependencies is realized through an opposing bias of maximizing I(C;W|Y) = ∑y p(y) I(C|y; W|y). The second cost term – the Information Defocusing (ID) functional – consequently counterbalances minimization of I(C,Y) against the new bias: L(ID) = I(C; Y) − η I(C;W|Y) , (3) where η is a free parameter articulating the tradeoff between the biases. The ID functional captures our goal of reducing the impact of Y selectively: “defocusing” a specific aspect of the information Y conveys: the information correlated with W. In a like manner to the stable-point equation of the IB functional (Eq. 2), we derive the following stable-point equation for the ID functional: η p ( w) 1 p ( c )∏ w p ( y | c, w) η +1 , p(c|y) = z (η , y ) (4) where z(η,y) is a normalization function over C. The derivation relies on an additional assumption, I(C;W) = 0, imposing the intended independence between C and W (the detailed derivation will be described elsewhere). The intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated with a cluster c in proportion to a weighted, though flattened, geometric mean of the “W-projected centroids” p(y|c,w), priored by p(c). 2 This scheme overweighs y's that contribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated around centroids biased towards evenly contributing features. The higher η is, heavier emphasis is put on suppressing disagreements between the w's. For η → ∞ a plain weighted geometric-mean scheme is obtained. The inclusion of a step derived from Eq. (4) in our algorithm (see below) facilitates convergence on a configuration with centroids dominated by features that are evenly distributed across W. 3.2 T h e Cr os s - p a r t i t i on C l us t e r i n g A l g or i t h m Our proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering configuration that optimizes simultaneously both the IB and ID functionals, 1 Notice that “Z explaining well the dependencies between A and B” is equivalent with “A and B sharing little information in common given Z”, i.e. low I(A;B|Z) . Complete conditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. 2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the subsets; in the current paper this scheme is analytically derived from the ID functional. Start at time t = 0 and iterate the following update-steps, till convergence: CP1: Initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ CP2: pt (c) = CP3: p*t (y|c,w) = CP4: (t = 0) p t −1 (c ) e ∑ x (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x ) p ( y | x ) p ( w | x ) p ( x ) p t ( c ) p ( w) x Initialize p*t (c) randomly or arbitrarily (t = 0) p*t (c) (t > 0) = ∑ y p *t −1 (c | y ) p ( y ) η CP5: p*t (c|y) ∝ p *t (c)∏w p *t ( y | c, w) η +1 CP6: pt (y|c) = p ( w) p *t (c | y ) p ( y ) p *t (c ) Figure 2: The cross-partition clustering iterative algorithm (with fixed β, η, and |C|). thus obtaining clusters that cut across the pre-given partition W. To this end, the algorithm interleaves an iterative computation of the stable-point equations, and the additional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 correspond to the computations related to the IB functional, while steps CP3, CP4 and CP5, which compute a separate set of parameters (denoted by an asterisk), correspond to the ID functional. Figure 3 summarizes the roles of the two functionals in the dynamics of the CP algorithm. The two components of the iterative cycle are tied together in steps CP3 and CP6, in which parameters from one set are used as input to compute a parameter of other set. The derivation of step CP3 relies on an additional assumption, namely that C, Y and W are jointly independent given X. This assumption, which extends to W the underlying assumption of the IB setting that C and Y are independent given X, still entails the IB stable point equation. At convergence, the stable point equations for both the IB and ID functionals are satisfied, each by its own set of parameters (in steps CP1 and CP5). The deterministic annealing scheme, which gradually increases β over repeated runs (see Sec. 2), is applied for the CP algorithm as well with η held fixed. For a given target number of clusters |C|, the algorithm empirically converges with a wide range of η values 3. I(C;X) ↓ IB β↑ I(C;Y) ↓ ID η↑ I(C; W|Y) I(C; Y; W|X) = 0 ← assumptions → I(C;W) = 0 Figure 3: The interplay of the IB and the ID functionals in the CP algorithm. High η values tend to dictate centroids with features that are unevenly distributed across W, resulting in shrinkage of some of the clusters. Further analysis will be provided in future work. 3 4 Exp e ri men t a l Resu lt s Our synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into three 25-element parts denoted X 1 , X2 and X3 (in our formalism, for each clustered element x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, we partitioned the data twice, getting two (exhaustive) clustering configurations: 1. Target cross-W clustering: five clusters, each with representatives from all X w's; 2. Masking within-w clustering: six clusters, each consisting of roughly half the elements of either X 1, X 2 or X3 with no representatives from the other X w's. Each cluster, of both configurations, was characterized by a designated subset of features. Masking clusters were designed to be more salient than target clusters: they had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) and their elements shared higher feature-element (virtual) co-occurrence counts with those designated features (900 vs. 450 per element-feature pair). Noise (random positive integer < 200) was added to all counts associating elements with their designated features (for both within-w and cross-W clusters), as well as to roughly quarter of the zero counts associating elements with the rest of the features. The plain IB method consistently produced configurations strongly correlated with the masking clustering, while the CP algorithm revealed the target configuration. We got (see Table 1A) almost perfect results in configurations of nearly equal-sized cross-W clusters, and somewhat less perfect reconstruction in configurations of diverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to optimal target-output cluster match by the proportion of elements correctly assigned, where assignment of an element x follows its highest p(c|x). The results indicated were averaged over 200 runs. They were obtained for the optimal η, which was found to be higher in the diverging-sizes task. In the text-based task, the clustered elements – keywords – were automatically extracted from five distinct corpora addressing five religions: introductory web pages, online magazines, encyclopedic entries etc., all downloaded from the Internet. The clustered keyword set X was consequently pre-partitioned to disjoint subsets {X w} w∈W, one for each religion4 (|X w| ≈ 200 for each w). We conducted experiments simultaneously involving religion pairs as well as all five religions. We took the features Y to be a set of words that commonly occur within all five corpora (|Y| ≈ 7000). x–y co-occurrences were recorded within ±5-word sliding window truncated by sentence boundaries. η was fixed to a value (1.0) enabling the formation of 20 clusters in all settings. The obtained clusters revealed interesting cross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing the theme of sacred festivals: the three highest p(c/x) members within each religion were Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas Table 1: Average correct assignment proportion scores for the synthetic task (A) and Jaccard-coefficient scores for the religion keyword classification task (B). A. Synthetic Data IB CP B. Religion Data IB Coupled Clustering [7] CP (cross-expert agreement on religion pairs .462±.232) equal-size clusters .305 .985 non-equal clusters .292 .827 4 religion pairs all five (one case) .200±.100 .220±.138 .407±.144 .104 ––––––– .167 A keyword x that appeared in the corpora of different religions was considered as a distinct element for each religion, so the Xw were kept disjointed. (Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, (Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster produced by the plain IB method was poorer by far, including Islamic Ramadan, and Id and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked high too), but no single related term from the other religions. Our external evaluation standards were cross-religion keyword classes constructed manually by experts of comparative religion studies. One such expert classification involved all five religions, and eight classifications addressed religions in pairs. Each of the eight religion-pair classifications was contributed by two independent experts using the same keywords, so we could also assess the agreement between experts. As an overlap measure we employed the Jaccard coefficient: the number of element pairs co-assigned together by both one of the evaluated clusters and one of the expert classes, divided by the number of pairs co-assigned by either our clusters or the expert (or both). We did not assume the number of expert classes is known in advance (as done in the synthetic experiments), so the results were averaged over all configurations of 2–16 cluster hierarchy, for each experiment. The results shown in Table 1B – clear improvement relatively to plain IB and the distance-based coupled clustering [7] – are, however, persistent when the number of clusters is taken to be equal to the number of classes, or if only the best score in hierarchy is considered. The level of cross-expert agreement indicates that our results are reasonably close to the scores expected in such subjective task. 5 C o mp a ri so n t o R e la t ed W o r k The information bottleneck framework served as the basis for several approaches that represent additional information in their clustering setting. The multivariate information bottleneck (MIB) adapts the IB framework for networks of multiple variables [3]. However, all variables in such networks are either compressed (like X), or predicted (like Y). The incorporation of an empirical variable to be masked or defocused in the sense of our W is not possible. Including such variables in the MIB framework might be explored in future work. Particularly relevant to our work is the IB-based method for extracting relevant constructs with side information [4]. This approach addresses settings in which two different types of features are distinguished explicitly: relevant versus irrelevant ones, denoted by Y+ and Y−. Both types of features are incorporated within a single functional to be minimized: L(IB-side-info) = I(C; X) − β ( I(C; Y +) − γ I(C; Y−) ), which directly drives clustering to de-correlate C and Y−. Formally, our setting can be mapped to the side information setting by regarding the pre-partition W simply as the additional set of irrelevant features, giving symmetric (and opposite) roles to W and Y. However, it seems that this view does not address properly the desired cross-partition setting. In our setting, it is assumed that clustering should be guided in general by Y, while W should only neutralize particular information within Y that would otherwise yield the undesired correlation between C and W (as described in Section 3.1). For that reason, the defocusing functional tie the three variables together by conditioning the de-correlation of C and W on Y, while its underlying assumption ensures the global de-correlation. Indeed, our method was found empirically superior on the cross-dataset task. The side-information IB method (the iterative algorithm with best scoring γ) achieves correct assignment proportion of 0.52 in both synthetic tasks, where our method scored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification task, Jaccard coefficient improved by 20% relatively to plain IB (compared to our 100% improvement, see Table 1B). 6 C o n c lu si o n s This paper addressed the problem of clustering a pre-partitioned dataset, aiming to detect new internal structures that are not correlated with the pre-given partition but rather cut across its components. The proposed framework extends the cross-dataset clustering algorithm [8], providing better formal grounding and representing any pre-given (soft) partition of the dataset. Supported by empirical evidence, we suggest that our framework is better suited for the cross-partition task than applying the side-information framework [4], which was originally developed to address a somewhat different setting. We also demonstrate substantial empirical advantage over the distance-based coupled-clustering algorithm [7]. As an applied real-world goal, the algorithm successfully detects cross-religion commonalities. This goal exemplifies the more general notion of detecting analogies across different systems, which is a somewhat vague and non-consensual task and therefore especially challenging for a computational framework. Our approach can be viewed as an initial step towards principled identification of “hidden” commonalities between substantially different real world systems, while suppressing the vast majority of attributes that are irrelevant for the analogy. Further research may study the role of defocusing in supervised learning, where some pre-given partitions might mask the role of underlying discriminative features. Additionally, it would be interesting to explore relationships to other disciplines, e.g., network information theory ([9], Ch. 14) which provided motivation for the side-information approach. Finally, both frameworks (ours and side-information) suggest the importance of dealing wisely with information that should not dictate the clustering output directly. A c k n ow l e d g me n t s We thank Yuval Krymolowski for helpful discussions and Tiina Mahlamäki, Eitan Reich and William Shepard, for contributing the religion keyword classifications. References [1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine Learning Research, 41(1):177-196. [2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means clustering with background knowledge. The 18th International Conference on Machine Learning (ICML-2001), pp 577-584. [3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information bottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152161. [4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. Advances in Neural Processing Information Systems 15 (NIPS'02). [5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307-1331. [6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. [7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. [8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding themes across multiple corpora. Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 15-21. [9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. Sons, Inc., New York, New York. John Wiley &
6 0.15487573 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
7 0.15417886 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
8 0.12804486 152 nips-2003-Pairwise Clustering and Graphical Models
9 0.11792072 1 nips-2003-1-norm Support Vector Machines
10 0.11393899 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution
11 0.10256699 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
12 0.10087711 12 nips-2003-A Model for Learning the Semantics of Pictures
13 0.093275994 88 nips-2003-Image Reconstruction by Linear Programming
14 0.091848724 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification
15 0.090926595 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
16 0.090199038 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
17 0.08893571 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
18 0.086323291 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
19 0.084532641 169 nips-2003-Sample Propagation
20 0.083053514 113 nips-2003-Learning with Local and Global Consistency
topicId topicWeight
[(0, -0.3), (1, -0.163), (2, -0.028), (3, 0.053), (4, -0.202), (5, 0.112), (6, -0.109), (7, 0.168), (8, -0.083), (9, 0.085), (10, -0.078), (11, -0.038), (12, -0.133), (13, -0.037), (14, -0.046), (15, 0.152), (16, -0.098), (17, -0.005), (18, -0.051), (19, 0.006), (20, -0.059), (21, 0.016), (22, 0.034), (23, -0.0), (24, 0.015), (25, 0.055), (26, 0.009), (27, -0.06), (28, -0.013), (29, -0.018), (30, -0.073), (31, -0.054), (32, 0.059), (33, -0.068), (34, -0.018), (35, 0.037), (36, 0.018), (37, -0.026), (38, -0.003), (39, 0.092), (40, -0.062), (41, -0.019), (42, 0.045), (43, 0.067), (44, -0.05), (45, -0.037), (46, 0.018), (47, -0.012), (48, 0.073), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.97434956 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
2 0.81340975 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
3 0.78743124 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
4 0.67492533 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
5 0.62784475 87 nips-2003-Identifying Structure across Pre-partitioned Data
Author: Zvika Marx, Ido Dagan, Eli Shamir
Abstract: We propose an information-theoretic clustering approach that incorporates a pre-known partition of the data, aiming to identify common clusters that cut across the given partition. In the standard clustering setting the formation of clusters is guided by a single source of feature information. The newly utilized pre-partition factor introduces an additional bias that counterbalances the impact of the features whenever they become correlated with this known partition. The resulting algorithmic framework was applied successfully to synthetic data, as well as to identifying text-based cross-religion correspondences. 1 In t ro d u c t i o n The standard task of feature-based data clustering deals with a single set of elements that are characterized by a unified set of features. The goal of the clustering task is to identify implicit constructs, or themes, within the clustered set, grouping together elements that are characterized similarly by the features. In recent years there has been growing interest in more complex clustering settings, in which additional information is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on the information bottleneck (IB) framework [6], which facilitates coherent information-theoretic representation of different information types. In a recent line of research we have investigated the cross-dataset clustering task [7], [8]. In this setting, some inherent a-priori partition of the clustered data to distinct subsets is given. The clustering goal it to identify corresponding (analogous) structures that cut across the different subsets, while ignoring internal structures that characterize individual subsets. To accomplish this task, those features that commonly characterize elements across the different subsets guide the clustering process, while within-subset regularities are neutralized. In [7], we presented a distance-based hard clustering algorithm for the coupledclustering problem, in which the clustered data is pre-partitioned to two subsets. In [8], our setting, generalized to pre-partitions of any number of subsets, was addressed by a heuristic extension of the probabilistic IB algorithm, yielding improved empirical results. Specifically, the algorithm in [8] was based on a modification of the IB stable-point equation, which amplified the impact of features characterizing a formed cluster across all, or most, subsets. This paper describes an information-theoretic framework that motivates and extends the algorithm proposed in [8]. The given pre-partitioning is represented via a probability distribution variable, which may represent “soft” pre-partitioning of the data, versus the strictly disjoint subsets assumed in the earlier cross-dataset framework. Further, we present a new functional that captures the cross-partition motivation. From the new functional, we derive a stable-point equation underlying our algorithmic framework in conjunction with the corresponding IB equation. Our algorithm was tested empirically on synthetic data and on a real-world textbased task that aimed to identify corresponding themes across distinct religions. We have cross-clustered five sets of keywords that were extracted from topical corpora of texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction from standard clustering results, our algorithm reveals themes that are common to all religions, such as sacred writings, festivals, narratives and myths and theological principles, and avoids topical clusters that correspond to individual religions (for example, ‘Christmas’ and ‘Easter’ are clustered together with ‘Ramadan’ rather than with ‘Church’). Finally, we have paid specific attention to the framework of clustering with side information [4]. While this approach was presented for a somewhat different mindset, it might be used directly to address clustering across pre-partitioned data. We compare the technical details of the two approaches and demonstrate empirically that clustering with side information does not seem appropriate for the kind of cross-partition tasks that we explored. 2 Th e In fo rmat i o n B ot t len eck M et h od Probabilistic (“soft”) data clustering outputs, for each element x of the set being clustered and each cluster c, an assignment probability p(c|x). The IB method [6] interprets probabilistic clustering as lossy data compression. The given data is represented by a random variable X ranging over the clustered elements. X is compressed through another random variable C, ranging over the clusters. Every element x is characterized by conditional probability distribution p(Y|x), where Y is a third random variable taking the members y of a given set of features as values. The IB method formalizes the clustering task as minimizing the IB functional: L(IB) = I(C; X) − β I(C; Y) . (1) As known from information theory (Ch. 13 of [9]), minimizing the mutual information I(C; X) optimizes distorted compression rate. A complementary bias to maximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to the obtained clustering, inferred from the level by which C can predict Y. β is a free parameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that minimize L(IB) satisfy the following equation: p(c|x) = 1 p (c )e −β DKL [ p ( Y |x )|| p (Y |c ) ] , z( β , x) (2) where DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, between two distributions and z(β ,x) is a normalization function over C. Eq. (2) implies that, optimally, x is assigned to c in proportion to their KL distance in a feature distribution space, where the distribution p(Y|c) takes the role of a Start at time t = 0 and iterate the following update-steps, till convergence: IB1: initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ IB2: pt (c) = IB3: pt (y|c) = pt −1 (c ) e ∑ x (t = 0) (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x) p ( y | x ) p ( x) p t (c ) x Figure 1: The Information Bottleneck iterative algorithm (with fixed β and |C|). representative, or centroid, of c. The feature variable Y is hence utilized as the (exclusive) means to guide clustering, beyond the random nature of compression. Figure 1 presents the IB iterative algorithm for a fixed value of β . The IB1 update step follows Eq. (2). The other two steps, which are derived from the IB functional as well, estimate the p(c) and p(y|c) values required for the next iteration. The algorithm converges to a local minimum of the IB functional. The IB setting, particularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and C are independent given X, that is: I(C; Y|X) = ∑x p(x) I(C|x; Y|x) = 0. The balancing parameter β affects the number of distinct clusters being formed in a manner that resembles (inverse) temperature in physical systems. The higher β is (i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters are required for encoding the data. For each |C| = 2, 3, …, there is a minimal β value, enabling the formation of |C| distinct clusters. Setting β to be smaller than this critical value corresponding to the current |C| would result in two or more clusters that are identical to one another. Based on this, the iterative algorithm is applied repeatedly within a gradual cooling-like (deterministic annealing) scheme: starting with random initialization of the p0 (c|x)'s, generate two clusters with the critical β value, found empirically, for |C| = 2. Then, use a perturbation on the obtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of clusters and execute additional runs of the algorithm to identify the critical β value for the larger |C|. And so on: each output configuration is used as a basis for a more granular one. The final outcome is a “soft hierarchy” of probabilistic clusters. 3 Cro ss- p a rt i t i o n Clu st eri n g Cross-partition (CP) clustering introduces a factor – a pre-given partition of the clustered data – additional to what considered in a standard clustering setting. For representing this factor we introduce the pre-partitioning variable W, ranging over all parts w of the pre-given partition. Every data element x is associated with W through a given probability distribution p(W|x). Our goal is to cluster the data, so that the clusters C would not be correlated with W. We notice that Y, which is intended to direct the formation of clusters, might be a-priori correlated with W, so the formed clusters might end up being correlated with W as well. Our method aims at eliminating this aspect of Y. 3.1 I n f or ma t i o n D e f oc us i n g As noted, some of the information conveyed by Y characterizes structures correlated with W, while the other part of the information characterizes the target cross-W structures. We are interested in detecting the latter while filtering out the former. However, there is no direct a-priori separation between the two parts of the Ymediated information. Our strategy in tackling this difficulty is: we follow in general Y's directions, as the IB method does, while avoiding Y's impact whenever it entails undesired inter-dependencies of C and W. Our strategy implies conflicting biases with regard to the mutual information I(C,Y): it should be maximized in order to form meaningful clusters, but be minimized as well in the specific context where Y entails C–W dependencies. Accordingly, we propose a computational procedure directed by two distinct cost-terms in tandem. The first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). With this bias alone, Y might dictate (or “explain”, in retrospect) substantial C–W dependencies, implying a low I(C;W|Y) value. 1 Hence, the guideline of preventing Y from accounting for C–W dependencies is realized through an opposing bias of maximizing I(C;W|Y) = ∑y p(y) I(C|y; W|y). The second cost term – the Information Defocusing (ID) functional – consequently counterbalances minimization of I(C,Y) against the new bias: L(ID) = I(C; Y) − η I(C;W|Y) , (3) where η is a free parameter articulating the tradeoff between the biases. The ID functional captures our goal of reducing the impact of Y selectively: “defocusing” a specific aspect of the information Y conveys: the information correlated with W. In a like manner to the stable-point equation of the IB functional (Eq. 2), we derive the following stable-point equation for the ID functional: η p ( w) 1 p ( c )∏ w p ( y | c, w) η +1 , p(c|y) = z (η , y ) (4) where z(η,y) is a normalization function over C. The derivation relies on an additional assumption, I(C;W) = 0, imposing the intended independence between C and W (the detailed derivation will be described elsewhere). The intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated with a cluster c in proportion to a weighted, though flattened, geometric mean of the “W-projected centroids” p(y|c,w), priored by p(c). 2 This scheme overweighs y's that contribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated around centroids biased towards evenly contributing features. The higher η is, heavier emphasis is put on suppressing disagreements between the w's. For η → ∞ a plain weighted geometric-mean scheme is obtained. The inclusion of a step derived from Eq. (4) in our algorithm (see below) facilitates convergence on a configuration with centroids dominated by features that are evenly distributed across W. 3.2 T h e Cr os s - p a r t i t i on C l us t e r i n g A l g or i t h m Our proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering configuration that optimizes simultaneously both the IB and ID functionals, 1 Notice that “Z explaining well the dependencies between A and B” is equivalent with “A and B sharing little information in common given Z”, i.e. low I(A;B|Z) . Complete conditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. 2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the subsets; in the current paper this scheme is analytically derived from the ID functional. Start at time t = 0 and iterate the following update-steps, till convergence: CP1: Initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ CP2: pt (c) = CP3: p*t (y|c,w) = CP4: (t = 0) p t −1 (c ) e ∑ x (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x ) p ( y | x ) p ( w | x ) p ( x ) p t ( c ) p ( w) x Initialize p*t (c) randomly or arbitrarily (t = 0) p*t (c) (t > 0) = ∑ y p *t −1 (c | y ) p ( y ) η CP5: p*t (c|y) ∝ p *t (c)∏w p *t ( y | c, w) η +1 CP6: pt (y|c) = p ( w) p *t (c | y ) p ( y ) p *t (c ) Figure 2: The cross-partition clustering iterative algorithm (with fixed β, η, and |C|). thus obtaining clusters that cut across the pre-given partition W. To this end, the algorithm interleaves an iterative computation of the stable-point equations, and the additional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 correspond to the computations related to the IB functional, while steps CP3, CP4 and CP5, which compute a separate set of parameters (denoted by an asterisk), correspond to the ID functional. Figure 3 summarizes the roles of the two functionals in the dynamics of the CP algorithm. The two components of the iterative cycle are tied together in steps CP3 and CP6, in which parameters from one set are used as input to compute a parameter of other set. The derivation of step CP3 relies on an additional assumption, namely that C, Y and W are jointly independent given X. This assumption, which extends to W the underlying assumption of the IB setting that C and Y are independent given X, still entails the IB stable point equation. At convergence, the stable point equations for both the IB and ID functionals are satisfied, each by its own set of parameters (in steps CP1 and CP5). The deterministic annealing scheme, which gradually increases β over repeated runs (see Sec. 2), is applied for the CP algorithm as well with η held fixed. For a given target number of clusters |C|, the algorithm empirically converges with a wide range of η values 3. I(C;X) ↓ IB β↑ I(C;Y) ↓ ID η↑ I(C; W|Y) I(C; Y; W|X) = 0 ← assumptions → I(C;W) = 0 Figure 3: The interplay of the IB and the ID functionals in the CP algorithm. High η values tend to dictate centroids with features that are unevenly distributed across W, resulting in shrinkage of some of the clusters. Further analysis will be provided in future work. 3 4 Exp e ri men t a l Resu lt s Our synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into three 25-element parts denoted X 1 , X2 and X3 (in our formalism, for each clustered element x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, we partitioned the data twice, getting two (exhaustive) clustering configurations: 1. Target cross-W clustering: five clusters, each with representatives from all X w's; 2. Masking within-w clustering: six clusters, each consisting of roughly half the elements of either X 1, X 2 or X3 with no representatives from the other X w's. Each cluster, of both configurations, was characterized by a designated subset of features. Masking clusters were designed to be more salient than target clusters: they had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) and their elements shared higher feature-element (virtual) co-occurrence counts with those designated features (900 vs. 450 per element-feature pair). Noise (random positive integer < 200) was added to all counts associating elements with their designated features (for both within-w and cross-W clusters), as well as to roughly quarter of the zero counts associating elements with the rest of the features. The plain IB method consistently produced configurations strongly correlated with the masking clustering, while the CP algorithm revealed the target configuration. We got (see Table 1A) almost perfect results in configurations of nearly equal-sized cross-W clusters, and somewhat less perfect reconstruction in configurations of diverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to optimal target-output cluster match by the proportion of elements correctly assigned, where assignment of an element x follows its highest p(c|x). The results indicated were averaged over 200 runs. They were obtained for the optimal η, which was found to be higher in the diverging-sizes task. In the text-based task, the clustered elements – keywords – were automatically extracted from five distinct corpora addressing five religions: introductory web pages, online magazines, encyclopedic entries etc., all downloaded from the Internet. The clustered keyword set X was consequently pre-partitioned to disjoint subsets {X w} w∈W, one for each religion4 (|X w| ≈ 200 for each w). We conducted experiments simultaneously involving religion pairs as well as all five religions. We took the features Y to be a set of words that commonly occur within all five corpora (|Y| ≈ 7000). x–y co-occurrences were recorded within ±5-word sliding window truncated by sentence boundaries. η was fixed to a value (1.0) enabling the formation of 20 clusters in all settings. The obtained clusters revealed interesting cross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing the theme of sacred festivals: the three highest p(c/x) members within each religion were Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas Table 1: Average correct assignment proportion scores for the synthetic task (A) and Jaccard-coefficient scores for the religion keyword classification task (B). A. Synthetic Data IB CP B. Religion Data IB Coupled Clustering [7] CP (cross-expert agreement on religion pairs .462±.232) equal-size clusters .305 .985 non-equal clusters .292 .827 4 religion pairs all five (one case) .200±.100 .220±.138 .407±.144 .104 ––––––– .167 A keyword x that appeared in the corpora of different religions was considered as a distinct element for each religion, so the Xw were kept disjointed. (Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, (Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster produced by the plain IB method was poorer by far, including Islamic Ramadan, and Id and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked high too), but no single related term from the other religions. Our external evaluation standards were cross-religion keyword classes constructed manually by experts of comparative religion studies. One such expert classification involved all five religions, and eight classifications addressed religions in pairs. Each of the eight religion-pair classifications was contributed by two independent experts using the same keywords, so we could also assess the agreement between experts. As an overlap measure we employed the Jaccard coefficient: the number of element pairs co-assigned together by both one of the evaluated clusters and one of the expert classes, divided by the number of pairs co-assigned by either our clusters or the expert (or both). We did not assume the number of expert classes is known in advance (as done in the synthetic experiments), so the results were averaged over all configurations of 2–16 cluster hierarchy, for each experiment. The results shown in Table 1B – clear improvement relatively to plain IB and the distance-based coupled clustering [7] – are, however, persistent when the number of clusters is taken to be equal to the number of classes, or if only the best score in hierarchy is considered. The level of cross-expert agreement indicates that our results are reasonably close to the scores expected in such subjective task. 5 C o mp a ri so n t o R e la t ed W o r k The information bottleneck framework served as the basis for several approaches that represent additional information in their clustering setting. The multivariate information bottleneck (MIB) adapts the IB framework for networks of multiple variables [3]. However, all variables in such networks are either compressed (like X), or predicted (like Y). The incorporation of an empirical variable to be masked or defocused in the sense of our W is not possible. Including such variables in the MIB framework might be explored in future work. Particularly relevant to our work is the IB-based method for extracting relevant constructs with side information [4]. This approach addresses settings in which two different types of features are distinguished explicitly: relevant versus irrelevant ones, denoted by Y+ and Y−. Both types of features are incorporated within a single functional to be minimized: L(IB-side-info) = I(C; X) − β ( I(C; Y +) − γ I(C; Y−) ), which directly drives clustering to de-correlate C and Y−. Formally, our setting can be mapped to the side information setting by regarding the pre-partition W simply as the additional set of irrelevant features, giving symmetric (and opposite) roles to W and Y. However, it seems that this view does not address properly the desired cross-partition setting. In our setting, it is assumed that clustering should be guided in general by Y, while W should only neutralize particular information within Y that would otherwise yield the undesired correlation between C and W (as described in Section 3.1). For that reason, the defocusing functional tie the three variables together by conditioning the de-correlation of C and W on Y, while its underlying assumption ensures the global de-correlation. Indeed, our method was found empirically superior on the cross-dataset task. The side-information IB method (the iterative algorithm with best scoring γ) achieves correct assignment proportion of 0.52 in both synthetic tasks, where our method scored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification task, Jaccard coefficient improved by 20% relatively to plain IB (compared to our 100% improvement, see Table 1B). 6 C o n c lu si o n s This paper addressed the problem of clustering a pre-partitioned dataset, aiming to detect new internal structures that are not correlated with the pre-given partition but rather cut across its components. The proposed framework extends the cross-dataset clustering algorithm [8], providing better formal grounding and representing any pre-given (soft) partition of the dataset. Supported by empirical evidence, we suggest that our framework is better suited for the cross-partition task than applying the side-information framework [4], which was originally developed to address a somewhat different setting. We also demonstrate substantial empirical advantage over the distance-based coupled-clustering algorithm [7]. As an applied real-world goal, the algorithm successfully detects cross-religion commonalities. This goal exemplifies the more general notion of detecting analogies across different systems, which is a somewhat vague and non-consensual task and therefore especially challenging for a computational framework. Our approach can be viewed as an initial step towards principled identification of “hidden” commonalities between substantially different real world systems, while suppressing the vast majority of attributes that are irrelevant for the analogy. Further research may study the role of defocusing in supervised learning, where some pre-given partitions might mask the role of underlying discriminative features. Additionally, it would be interesting to explore relationships to other disciplines, e.g., network information theory ([9], Ch. 14) which provided motivation for the side-information approach. Finally, both frameworks (ours and side-information) suggest the importance of dealing wisely with information that should not dictate the clustering output directly. A c k n ow l e d g me n t s We thank Yuval Krymolowski for helpful discussions and Tiina Mahlamäki, Eitan Reich and William Shepard, for contributing the religion keyword classifications. References [1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine Learning Research, 41(1):177-196. [2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means clustering with background knowledge. The 18th International Conference on Machine Learning (ICML-2001), pp 577-584. [3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information bottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152161. [4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. Advances in Neural Processing Information Systems 15 (NIPS'02). [5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307-1331. [6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. [7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. [8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding themes across multiple corpora. Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 15-21. [9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. Sons, Inc., New York, New York. John Wiley &
6 0.62178952 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
7 0.612113 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
8 0.59860927 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
9 0.56823748 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
10 0.56777602 152 nips-2003-Pairwise Clustering and Graphical Models
11 0.4901942 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
12 0.48621583 12 nips-2003-A Model for Learning the Semantics of Pictures
13 0.4711318 181 nips-2003-Statistical Debugging of Sampled Programs
14 0.46218252 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
15 0.44140747 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution
16 0.41868463 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
17 0.41575652 48 nips-2003-Convex Methods for Transduction
19 0.38400277 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
20 0.38271028 88 nips-2003-Image Reconstruction by Linear Programming
topicId topicWeight
[(0, 0.038), (5, 0.203), (11, 0.032), (29, 0.016), (30, 0.018), (35, 0.11), (53, 0.146), (69, 0.015), (71, 0.054), (76, 0.058), (85, 0.053), (91, 0.177), (99, 0.015)]
simIndex simValue paperId paperTitle
1 0.95128053 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering
Author: Benjamin M. Marlin
Abstract: In this paper we present a generative latent variable model for rating-based collaborative filtering called the User Rating Profile model (URP). The generative process which underlies URP is designed to produce complete user rating profiles, an assignment of one rating to each item for each user. Our model represents each user as a mixture of user attitudes, and the mixing proportions are distributed according to a Dirichlet random variable. The rating for each item is generated by selecting a user attitude for the item, and then selecting a rating according to the preference pattern associated with that attitude. URP is related to several models including a multinomial mixture model, the aspect model [7], and LDA [1], but has clear advantages over each. 1
same-paper 2 0.90471756 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
3 0.85384327 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
4 0.77361274 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
5 0.76988977 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth
Abstract: We propose a functional mixture model for simultaneous clustering and alignment of sets of curves measured on a discrete time grid. The model is specifically tailored to gene expression time course data. Each functional cluster center is a nonlinear combination of solutions of a simple linear differential equation that describes the change of individual mRNA levels when the synthesis and decay rates are constant. The mixture of continuous time parametric functional forms allows one to (a) account for the heterogeneity in the observed profiles, (b) align the profiles in time by estimating real-valued time shifts, (c) capture the synthesis and decay of mRNA in the course of an experiment, and (d) regularize noisy profiles by enforcing smoothness in the mean curves. We derive an EM algorithm for estimating the parameters of the model, and apply the proposed approach to the set of cycling genes in yeast. The experiments show consistent improvement in predictive power and within cluster variance compared to regular Gaussian mixtures. 1
6 0.76698953 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
7 0.76107168 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
8 0.76017427 81 nips-2003-Geometric Analysis of Constrained Curves
9 0.76000553 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
10 0.75592476 78 nips-2003-Gaussian Processes in Reinforcement Learning
11 0.75526863 68 nips-2003-Eye Movements for Reward Maximization
12 0.75236028 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
13 0.75207233 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
14 0.75161946 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
15 0.75118619 30 nips-2003-Approximability of Probability Distributions
16 0.75028348 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
17 0.74980092 55 nips-2003-Distributed Optimization in Adaptive Networks
18 0.74908066 146 nips-2003-Online Learning of Non-stationary Sequences
19 0.74652326 113 nips-2003-Learning with Local and Global Consistency
20 0.74515295 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons