nips nips2003 nips2003-111 knowledge-graph by maker-knowledge-mining

111 nips-2003-Learning the k in k-means


Source: pdf

Author: Greg Hamerly, Charles Elkan

Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Department of Computer Science and Engineering University of California, San Diego La Jolla, California 92093-0114 Abstract When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. [sent-3, score-0.499]

2 The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. [sent-5, score-0.28]

3 G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. [sent-6, score-0.417]

4 Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. [sent-7, score-0.272]

5 However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. [sent-12, score-0.461]

6 Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. [sent-15, score-0.31]

7 Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. [sent-16, score-0.415]

8 With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. [sent-17, score-0.198]

9 If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. [sent-18, score-1.353]

10 In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. [sent-19, score-0.356]

11 On the right, too many centers are used; one center is sufficient for representing the data. [sent-42, score-0.56]

12 This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. [sent-45, score-0.222]

13 Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. [sent-47, score-0.253]

14 Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. [sent-48, score-0.615]

15 Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. [sent-55, score-0.474]

16 With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. [sent-57, score-0.187]

17 One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5. [sent-58, score-0.254]

18 Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. [sent-62, score-0.569]

19 Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. [sent-63, score-0.611]

20 G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. [sent-65, score-0.238]

21 If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. [sent-66, score-0.278]

22 However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). [sent-67, score-0.475]

23 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . [sent-69, score-0.418]

24 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). [sent-70, score-0.216]

25 6: Repeat from step 2 until no more centers are added. [sent-73, score-0.439]

26 to be Gaussian, then we want to use multiple centers to model the data properly. [sent-74, score-0.513]

27 The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. [sent-76, score-0.508]

28 Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. [sent-77, score-0.459]

29 The Gaussian distribution test that we present below are valid for either covariance matrix assumption. [sent-78, score-0.141]

30 The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). [sent-79, score-0.439]

31 This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. [sent-80, score-0.382]

32 1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. [sent-82, score-0.613]

33 The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. [sent-83, score-0.188]

34 • H1 : The data around the center are not sampled from a Gaussian. [sent-84, score-0.188]

35 If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. [sent-85, score-0.507]

36 If we reject H0 and accept H1 , then we want to split the cluster. [sent-86, score-0.237]

37 This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). [sent-88, score-0.276]

38 Let c1 , c2 be the child centers chosen by k-means. [sent-99, score-0.472]

39 If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. [sent-108, score-0.234]

40 A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. [sent-110, score-0.283]

41 01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0. [sent-119, score-0.203]

42 To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. [sent-122, score-0.529]

43 Both approaches initialize with c ± m, where c is a center and m is chosen. [sent-126, score-0.166]

44 This deterministic method places the two centers in their expected locations under H0 . [sent-129, score-0.439]

45 2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0. [sent-133, score-0.74]

46 The critical value for the Anderson-Darling test is 1. [sent-135, score-0.143]

47 Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38. [sent-137, score-0.576]

48 This is much larger than the critical value, ∗ so we reject H0 and accept this split. [sent-139, score-0.197]

49 On the next iteration, we split each new center and repeat the statistical test. [sent-140, score-0.221]

50 We run 1000 tests for each data point plotted for both plots. [sent-149, score-0.181]

51 Starting with one center (left plot), G-means splits into two centers (middle). [sent-151, score-0.619]

52 The test for normality is significant, so G-means rejects H0 and keeps the split. [sent-152, score-0.147]

53 After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. [sent-153, score-0.576]

54 2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. [sent-164, score-0.531]

55 For the AD test we fix the significance level (α = 0. [sent-165, score-0.161]

56 The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. [sent-167, score-0.474]

57 The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). [sent-168, score-0.561]

58 plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i. [sent-171, score-0.414]

59 BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. [sent-174, score-0.513]

60 On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. [sent-183, score-0.506]

61 We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. [sent-184, score-0.46]

62 For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. [sent-185, score-0.233]

63 Items with a star are where X-means always chose the largest number of centers we allowed. [sent-186, score-0.439]

64 dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9. [sent-187, score-1.769]

65 3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. [sent-313, score-0.372]

66 On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. [sent-314, score-0.497]

67 The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. [sent-318, score-0.459]

68 For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. [sent-320, score-0.884]

69 Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. [sent-321, score-0.223]

70 We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). [sent-323, score-0.514]

71 Thus it overestimates the number of true clusters in non-spherical data. [sent-327, score-0.349]

72 This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. [sent-328, score-0.341]

73 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. [sent-329, score-0.259]

74 G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. [sent-330, score-0.274]

75 Figure 4 shows an example of overfitting on a dataset with 5 true clusters. [sent-332, score-0.188]

76 X-means chooses k = 20 while G-means finds all 5 true cluster centers. [sent-333, score-0.32]

77 Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. [sent-334, score-0.713]

78 1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. [sent-338, score-0.762]

79 The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. [sent-339, score-0.52]

80 The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. [sent-343, score-0.149]

81 We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . [sent-344, score-0.632]

82 We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. [sent-345, score-0.783]

83 This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. [sent-346, score-0.373]

84 This statistic is related to the Rand statistic for comparing partitions [8]. [sent-349, score-0.274]

85 For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0. [sent-350, score-0.374]

86 X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. [sent-352, score-0.595]

87 For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0. [sent-355, score-0.329]

88 196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0. [sent-356, score-0.329]

89 Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. [sent-358, score-0.393]

90 The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. [sent-359, score-0.143]

91 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. [sent-360, score-0.474]

92 The splitting uses dimension reduction and a powerful test for Gaussian fitness. [sent-361, score-0.289]

93 G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. [sent-362, score-0.507]

94 The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. [sent-363, score-0.151]

95 The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. [sent-364, score-0.341]

96 Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. [sent-365, score-0.542]

97 Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. [sent-367, score-0.488]

98 In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. [sent-368, score-0.296]

99 Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. [sent-369, score-0.187]

100 These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. [sent-372, score-0.217]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('centers', 0.439), ('bic', 0.35), ('clusters', 0.274), ('datapoints', 0.196), ('clustering', 0.187), ('cluster', 0.187), ('synthetic', 0.184), ('pq', 0.178), ('cance', 0.146), ('statistic', 0.137), ('center', 0.121), ('pendigits', 0.119), ('dataset', 0.113), ('splitting', 0.11), ('nist', 0.109), ('test', 0.1), ('accept', 0.1), ('bonferroni', 0.089), ('tests', 0.086), ('datasets', 0.079), ('true', 0.075), ('digit', 0.072), ('lc', 0.07), ('dimensions', 0.07), ('type', 0.062), ('lt', 0.062), ('level', 0.061), ('andrew', 0.061), ('distortion', 0.061), ('splits', 0.059), ('accepts', 0.059), ('bischof', 0.059), ('horst', 0.059), ('improperly', 0.059), ('spherically', 0.059), ('run', 0.059), ('incorrectly', 0.059), ('chooses', 0.058), ('kt', 0.056), ('statistical', 0.055), ('score', 0.055), ('cj', 0.054), ('reject', 0.054), ('hypothesis', 0.054), ('pelleg', 0.051), ('gaussian', 0.05), ('nds', 0.048), ('mdl', 0.047), ('overestimation', 0.047), ('kmeans', 0.047), ('normality', 0.047), ('wrapper', 0.047), ('assigned', 0.047), ('dimension', 0.047), ('san', 0.046), ('correct', 0.046), ('seconds', 0.045), ('split', 0.045), ('initialize', 0.045), ('kc', 0.044), ('clusterings', 0.044), ('francisco', 0.043), ('critical', 0.043), ('plot', 0.043), ('zi', 0.042), ('covariance', 0.041), ('unimodal', 0.041), ('charles', 0.039), ('ad', 0.039), ('power', 0.039), ('want', 0.038), ('choosing', 0.038), ('making', 0.038), ('spherical', 0.037), ('discard', 0.037), ('kaufmann', 0.037), ('data', 0.036), ('keep', 0.036), ('dence', 0.036), ('adjustment', 0.036), ('signi', 0.036), ('algorithm', 0.035), ('projection', 0.035), ('scoring', 0.035), ('labels', 0.035), ('prevents', 0.035), ('child', 0.033), ('morgan', 0.033), ('nding', 0.032), ('reduction', 0.032), ('merging', 0.031), ('spectral', 0.031), ('around', 0.031), ('peter', 0.03), ('xi', 0.03), ('digits', 0.03), ('segmentation', 0.03), ('chance', 0.029), ('cumulative', 0.029), ('eigenvector', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 111 nips-2003-Learning the k in k-means

Author: Greg Hamerly, Charles Elkan

Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.

2 0.22968702 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1

3 0.18985045 73 nips-2003-Feature Selection in Clustering Problems

Author: Volker Roth, Tilman Lange

Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1

4 0.1720601 46 nips-2003-Clustering with the Connectivity Kernel

Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann

Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1

5 0.16716361 87 nips-2003-Identifying Structure across Pre-partitioned Data

Author: Zvika Marx, Ido Dagan, Eli Shamir

Abstract: We propose an information-theoretic clustering approach that incorporates a pre-known partition of the data, aiming to identify common clusters that cut across the given partition. In the standard clustering setting the formation of clusters is guided by a single source of feature information. The newly utilized pre-partition factor introduces an additional bias that counterbalances the impact of the features whenever they become correlated with this known partition. The resulting algorithmic framework was applied successfully to synthetic data, as well as to identifying text-based cross-religion correspondences. 1 In t ro d u c t i o n The standard task of feature-based data clustering deals with a single set of elements that are characterized by a unified set of features. The goal of the clustering task is to identify implicit constructs, or themes, within the clustered set, grouping together elements that are characterized similarly by the features. In recent years there has been growing interest in more complex clustering settings, in which additional information is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on the information bottleneck (IB) framework [6], which facilitates coherent information-theoretic representation of different information types. In a recent line of research we have investigated the cross-dataset clustering task [7], [8]. In this setting, some inherent a-priori partition of the clustered data to distinct subsets is given. The clustering goal it to identify corresponding (analogous) structures that cut across the different subsets, while ignoring internal structures that characterize individual subsets. To accomplish this task, those features that commonly characterize elements across the different subsets guide the clustering process, while within-subset regularities are neutralized. In [7], we presented a distance-based hard clustering algorithm for the coupledclustering problem, in which the clustered data is pre-partitioned to two subsets. In [8], our setting, generalized to pre-partitions of any number of subsets, was addressed by a heuristic extension of the probabilistic IB algorithm, yielding improved empirical results. Specifically, the algorithm in [8] was based on a modification of the IB stable-point equation, which amplified the impact of features characterizing a formed cluster across all, or most, subsets. This paper describes an information-theoretic framework that motivates and extends the algorithm proposed in [8]. The given pre-partitioning is represented via a probability distribution variable, which may represent “soft” pre-partitioning of the data, versus the strictly disjoint subsets assumed in the earlier cross-dataset framework. Further, we present a new functional that captures the cross-partition motivation. From the new functional, we derive a stable-point equation underlying our algorithmic framework in conjunction with the corresponding IB equation. Our algorithm was tested empirically on synthetic data and on a real-world textbased task that aimed to identify corresponding themes across distinct religions. We have cross-clustered five sets of keywords that were extracted from topical corpora of texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction from standard clustering results, our algorithm reveals themes that are common to all religions, such as sacred writings, festivals, narratives and myths and theological principles, and avoids topical clusters that correspond to individual religions (for example, ‘Christmas’ and ‘Easter’ are clustered together with ‘Ramadan’ rather than with ‘Church’). Finally, we have paid specific attention to the framework of clustering with side information [4]. While this approach was presented for a somewhat different mindset, it might be used directly to address clustering across pre-partitioned data. We compare the technical details of the two approaches and demonstrate empirically that clustering with side information does not seem appropriate for the kind of cross-partition tasks that we explored. 2 Th e In fo rmat i o n B ot t len eck M et h od Probabilistic (“soft”) data clustering outputs, for each element x of the set being clustered and each cluster c, an assignment probability p(c|x). The IB method [6] interprets probabilistic clustering as lossy data compression. The given data is represented by a random variable X ranging over the clustered elements. X is compressed through another random variable C, ranging over the clusters. Every element x is characterized by conditional probability distribution p(Y|x), where Y is a third random variable taking the members y of a given set of features as values. The IB method formalizes the clustering task as minimizing the IB functional: L(IB) = I(C; X) − β I(C; Y) . (1) As known from information theory (Ch. 13 of [9]), minimizing the mutual information I(C; X) optimizes distorted compression rate. A complementary bias to maximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to the obtained clustering, inferred from the level by which C can predict Y. β is a free parameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that minimize L(IB) satisfy the following equation: p(c|x) = 1 p (c )e −β DKL [ p ( Y |x )|| p (Y |c ) ] , z( β , x) (2) where DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, between two distributions and z(β ,x) is a normalization function over C. Eq. (2) implies that, optimally, x is assigned to c in proportion to their KL distance in a feature distribution space, where the distribution p(Y|c) takes the role of a Start at time t = 0 and iterate the following update-steps, till convergence: IB1: initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ IB2: pt (c) = IB3: pt (y|c) = pt −1 (c ) e ∑ x (t = 0) (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x) p ( y | x ) p ( x) p t (c ) x Figure 1: The Information Bottleneck iterative algorithm (with fixed β and |C|). representative, or centroid, of c. The feature variable Y is hence utilized as the (exclusive) means to guide clustering, beyond the random nature of compression. Figure 1 presents the IB iterative algorithm for a fixed value of β . The IB1 update step follows Eq. (2). The other two steps, which are derived from the IB functional as well, estimate the p(c) and p(y|c) values required for the next iteration. The algorithm converges to a local minimum of the IB functional. The IB setting, particularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and C are independent given X, that is: I(C; Y|X) = ∑x p(x) I(C|x; Y|x) = 0. The balancing parameter β affects the number of distinct clusters being formed in a manner that resembles (inverse) temperature in physical systems. The higher β is (i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters are required for encoding the data. For each |C| = 2, 3, …, there is a minimal β value, enabling the formation of |C| distinct clusters. Setting β to be smaller than this critical value corresponding to the current |C| would result in two or more clusters that are identical to one another. Based on this, the iterative algorithm is applied repeatedly within a gradual cooling-like (deterministic annealing) scheme: starting with random initialization of the p0 (c|x)'s, generate two clusters with the critical β value, found empirically, for |C| = 2. Then, use a perturbation on the obtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of clusters and execute additional runs of the algorithm to identify the critical β value for the larger |C|. And so on: each output configuration is used as a basis for a more granular one. The final outcome is a “soft hierarchy” of probabilistic clusters. 3 Cro ss- p a rt i t i o n Clu st eri n g Cross-partition (CP) clustering introduces a factor – a pre-given partition of the clustered data – additional to what considered in a standard clustering setting. For representing this factor we introduce the pre-partitioning variable W, ranging over all parts w of the pre-given partition. Every data element x is associated with W through a given probability distribution p(W|x). Our goal is to cluster the data, so that the clusters C would not be correlated with W. We notice that Y, which is intended to direct the formation of clusters, might be a-priori correlated with W, so the formed clusters might end up being correlated with W as well. Our method aims at eliminating this aspect of Y. 3.1 I n f or ma t i o n D e f oc us i n g As noted, some of the information conveyed by Y characterizes structures correlated with W, while the other part of the information characterizes the target cross-W structures. We are interested in detecting the latter while filtering out the former. However, there is no direct a-priori separation between the two parts of the Ymediated information. Our strategy in tackling this difficulty is: we follow in general Y's directions, as the IB method does, while avoiding Y's impact whenever it entails undesired inter-dependencies of C and W. Our strategy implies conflicting biases with regard to the mutual information I(C,Y): it should be maximized in order to form meaningful clusters, but be minimized as well in the specific context where Y entails C–W dependencies. Accordingly, we propose a computational procedure directed by two distinct cost-terms in tandem. The first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). With this bias alone, Y might dictate (or “explain”, in retrospect) substantial C–W dependencies, implying a low I(C;W|Y) value. 1 Hence, the guideline of preventing Y from accounting for C–W dependencies is realized through an opposing bias of maximizing I(C;W|Y) = ∑y p(y) I(C|y; W|y). The second cost term – the Information Defocusing (ID) functional – consequently counterbalances minimization of I(C,Y) against the new bias: L(ID) = I(C; Y) − η I(C;W|Y) , (3) where η is a free parameter articulating the tradeoff between the biases. The ID functional captures our goal of reducing the impact of Y selectively: “defocusing” a specific aspect of the information Y conveys: the information correlated with W. In a like manner to the stable-point equation of the IB functional (Eq. 2), we derive the following stable-point equation for the ID functional: η p ( w) 1 p ( c )∏ w p ( y | c, w) η +1 , p(c|y) = z (η , y ) (4) where z(η,y) is a normalization function over C. The derivation relies on an additional assumption, I(C;W) = 0, imposing the intended independence between C and W (the detailed derivation will be described elsewhere). The intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated with a cluster c in proportion to a weighted, though flattened, geometric mean of the “W-projected centroids” p(y|c,w), priored by p(c). 2 This scheme overweighs y's that contribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated around centroids biased towards evenly contributing features. The higher η is, heavier emphasis is put on suppressing disagreements between the w's. For η → ∞ a plain weighted geometric-mean scheme is obtained. The inclusion of a step derived from Eq. (4) in our algorithm (see below) facilitates convergence on a configuration with centroids dominated by features that are evenly distributed across W. 3.2 T h e Cr os s - p a r t i t i on C l us t e r i n g A l g or i t h m Our proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering configuration that optimizes simultaneously both the IB and ID functionals, 1 Notice that “Z explaining well the dependencies between A and B” is equivalent with “A and B sharing little information in common given Z”, i.e. low I(A;B|Z) . Complete conditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. 2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the subsets; in the current paper this scheme is analytically derived from the ID functional. Start at time t = 0 and iterate the following update-steps, till convergence: CP1: Initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ CP2: pt (c) = CP3: p*t (y|c,w) = CP4: (t = 0) p t −1 (c ) e ∑ x (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x ) p ( y | x ) p ( w | x ) p ( x ) p t ( c ) p ( w) x Initialize p*t (c) randomly or arbitrarily (t = 0) p*t (c) (t > 0) = ∑ y p *t −1 (c | y ) p ( y ) η CP5: p*t (c|y) ∝ p *t (c)∏w p *t ( y | c, w) η +1 CP6: pt (y|c) = p ( w) p *t (c | y ) p ( y ) p *t (c ) Figure 2: The cross-partition clustering iterative algorithm (with fixed β, η, and |C|). thus obtaining clusters that cut across the pre-given partition W. To this end, the algorithm interleaves an iterative computation of the stable-point equations, and the additional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 correspond to the computations related to the IB functional, while steps CP3, CP4 and CP5, which compute a separate set of parameters (denoted by an asterisk), correspond to the ID functional. Figure 3 summarizes the roles of the two functionals in the dynamics of the CP algorithm. The two components of the iterative cycle are tied together in steps CP3 and CP6, in which parameters from one set are used as input to compute a parameter of other set. The derivation of step CP3 relies on an additional assumption, namely that C, Y and W are jointly independent given X. This assumption, which extends to W the underlying assumption of the IB setting that C and Y are independent given X, still entails the IB stable point equation. At convergence, the stable point equations for both the IB and ID functionals are satisfied, each by its own set of parameters (in steps CP1 and CP5). The deterministic annealing scheme, which gradually increases β over repeated runs (see Sec. 2), is applied for the CP algorithm as well with η held fixed. For a given target number of clusters |C|, the algorithm empirically converges with a wide range of η values 3. I(C;X) ↓ IB β↑ I(C;Y) ↓ ID η↑ I(C; W|Y) I(C; Y; W|X) = 0 ← assumptions → I(C;W) = 0 Figure 3: The interplay of the IB and the ID functionals in the CP algorithm. High η values tend to dictate centroids with features that are unevenly distributed across W, resulting in shrinkage of some of the clusters. Further analysis will be provided in future work. 3 4 Exp e ri men t a l Resu lt s Our synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into three 25-element parts denoted X 1 , X2 and X3 (in our formalism, for each clustered element x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, we partitioned the data twice, getting two (exhaustive) clustering configurations: 1. Target cross-W clustering: five clusters, each with representatives from all X w's; 2. Masking within-w clustering: six clusters, each consisting of roughly half the elements of either X 1, X 2 or X3 with no representatives from the other X w's. Each cluster, of both configurations, was characterized by a designated subset of features. Masking clusters were designed to be more salient than target clusters: they had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) and their elements shared higher feature-element (virtual) co-occurrence counts with those designated features (900 vs. 450 per element-feature pair). Noise (random positive integer < 200) was added to all counts associating elements with their designated features (for both within-w and cross-W clusters), as well as to roughly quarter of the zero counts associating elements with the rest of the features. The plain IB method consistently produced configurations strongly correlated with the masking clustering, while the CP algorithm revealed the target configuration. We got (see Table 1A) almost perfect results in configurations of nearly equal-sized cross-W clusters, and somewhat less perfect reconstruction in configurations of diverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to optimal target-output cluster match by the proportion of elements correctly assigned, where assignment of an element x follows its highest p(c|x). The results indicated were averaged over 200 runs. They were obtained for the optimal η, which was found to be higher in the diverging-sizes task. In the text-based task, the clustered elements – keywords – were automatically extracted from five distinct corpora addressing five religions: introductory web pages, online magazines, encyclopedic entries etc., all downloaded from the Internet. The clustered keyword set X was consequently pre-partitioned to disjoint subsets {X w} w∈W, one for each religion4 (|X w| ≈ 200 for each w). We conducted experiments simultaneously involving religion pairs as well as all five religions. We took the features Y to be a set of words that commonly occur within all five corpora (|Y| ≈ 7000). x–y co-occurrences were recorded within ±5-word sliding window truncated by sentence boundaries. η was fixed to a value (1.0) enabling the formation of 20 clusters in all settings. The obtained clusters revealed interesting cross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing the theme of sacred festivals: the three highest p(c/x) members within each religion were Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas Table 1: Average correct assignment proportion scores for the synthetic task (A) and Jaccard-coefficient scores for the religion keyword classification task (B). A. Synthetic Data IB CP B. Religion Data IB Coupled Clustering [7] CP (cross-expert agreement on religion pairs .462±.232) equal-size clusters .305 .985 non-equal clusters .292 .827 4 religion pairs all five (one case) .200±.100 .220±.138 .407±.144 .104 ––––––– .167 A keyword x that appeared in the corpora of different religions was considered as a distinct element for each religion, so the Xw were kept disjointed. (Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, (Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster produced by the plain IB method was poorer by far, including Islamic Ramadan, and Id and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked high too), but no single related term from the other religions. Our external evaluation standards were cross-religion keyword classes constructed manually by experts of comparative religion studies. One such expert classification involved all five religions, and eight classifications addressed religions in pairs. Each of the eight religion-pair classifications was contributed by two independent experts using the same keywords, so we could also assess the agreement between experts. As an overlap measure we employed the Jaccard coefficient: the number of element pairs co-assigned together by both one of the evaluated clusters and one of the expert classes, divided by the number of pairs co-assigned by either our clusters or the expert (or both). We did not assume the number of expert classes is known in advance (as done in the synthetic experiments), so the results were averaged over all configurations of 2–16 cluster hierarchy, for each experiment. The results shown in Table 1B – clear improvement relatively to plain IB and the distance-based coupled clustering [7] – are, however, persistent when the number of clusters is taken to be equal to the number of classes, or if only the best score in hierarchy is considered. The level of cross-expert agreement indicates that our results are reasonably close to the scores expected in such subjective task. 5 C o mp a ri so n t o R e la t ed W o r k The information bottleneck framework served as the basis for several approaches that represent additional information in their clustering setting. The multivariate information bottleneck (MIB) adapts the IB framework for networks of multiple variables [3]. However, all variables in such networks are either compressed (like X), or predicted (like Y). The incorporation of an empirical variable to be masked or defocused in the sense of our W is not possible. Including such variables in the MIB framework might be explored in future work. Particularly relevant to our work is the IB-based method for extracting relevant constructs with side information [4]. This approach addresses settings in which two different types of features are distinguished explicitly: relevant versus irrelevant ones, denoted by Y+ and Y−. Both types of features are incorporated within a single functional to be minimized: L(IB-side-info) = I(C; X) − β ( I(C; Y +) − γ I(C; Y−) ), which directly drives clustering to de-correlate C and Y−. Formally, our setting can be mapped to the side information setting by regarding the pre-partition W simply as the additional set of irrelevant features, giving symmetric (and opposite) roles to W and Y. However, it seems that this view does not address properly the desired cross-partition setting. In our setting, it is assumed that clustering should be guided in general by Y, while W should only neutralize particular information within Y that would otherwise yield the undesired correlation between C and W (as described in Section 3.1). For that reason, the defocusing functional tie the three variables together by conditioning the de-correlation of C and W on Y, while its underlying assumption ensures the global de-correlation. Indeed, our method was found empirically superior on the cross-dataset task. The side-information IB method (the iterative algorithm with best scoring γ) achieves correct assignment proportion of 0.52 in both synthetic tasks, where our method scored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification task, Jaccard coefficient improved by 20% relatively to plain IB (compared to our 100% improvement, see Table 1B). 6 C o n c lu si o n s This paper addressed the problem of clustering a pre-partitioned dataset, aiming to detect new internal structures that are not correlated with the pre-given partition but rather cut across its components. The proposed framework extends the cross-dataset clustering algorithm [8], providing better formal grounding and representing any pre-given (soft) partition of the dataset. Supported by empirical evidence, we suggest that our framework is better suited for the cross-partition task than applying the side-information framework [4], which was originally developed to address a somewhat different setting. We also demonstrate substantial empirical advantage over the distance-based coupled-clustering algorithm [7]. As an applied real-world goal, the algorithm successfully detects cross-religion commonalities. This goal exemplifies the more general notion of detecting analogies across different systems, which is a somewhat vague and non-consensual task and therefore especially challenging for a computational framework. Our approach can be viewed as an initial step towards principled identification of “hidden” commonalities between substantially different real world systems, while suppressing the vast majority of attributes that are irrelevant for the analogy. Further research may study the role of defocusing in supervised learning, where some pre-given partitions might mask the role of underlying discriminative features. Additionally, it would be interesting to explore relationships to other disciplines, e.g., network information theory ([9], Ch. 14) which provided motivation for the side-information approach. Finally, both frameworks (ours and side-information) suggest the importance of dealing wisely with information that should not dictate the clustering output directly. A c k n ow l e d g me n t s We thank Yuval Krymolowski for helpful discussions and Tiina Mahlamäki, Eitan Reich and William Shepard, for contributing the religion keyword classifications. References [1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine Learning Research, 41(1):177-196. [2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means clustering with background knowledge. The 18th International Conference on Machine Learning (ICML-2001), pp 577-584. [3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information bottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152161. [4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. Advances in Neural Processing Information Systems 15 (NIPS'02). [5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307-1331. [6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. [7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. [8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding themes across multiple corpora. Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 15-21. [9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. Sons, Inc., New York, New York. John Wiley &

6 0.16266951 107 nips-2003-Learning Spectral Clustering

7 0.13063437 152 nips-2003-Pairwise Clustering and Graphical Models

8 0.12769219 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

9 0.11215192 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

10 0.096542202 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

11 0.086139478 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

12 0.080661982 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs

13 0.075365767 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

14 0.074394852 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

15 0.073653951 113 nips-2003-Learning with Local and Global Consistency

16 0.070287324 169 nips-2003-Sample Propagation

17 0.068973266 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

18 0.066196181 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

19 0.062511325 126 nips-2003-Measure Based Regularization

20 0.061725639 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.232), (1, -0.121), (2, -0.062), (3, 0.123), (4, -0.158), (5, 0.146), (6, -0.153), (7, 0.23), (8, -0.083), (9, 0.151), (10, -0.08), (11, 0.01), (12, -0.048), (13, -0.06), (14, 0.012), (15, 0.085), (16, -0.003), (17, -0.152), (18, 0.046), (19, 0.019), (20, 0.01), (21, -0.011), (22, -0.01), (23, -0.036), (24, 0.04), (25, 0.024), (26, -0.01), (27, 0.029), (28, 0.03), (29, -0.009), (30, 0.029), (31, -0.0), (32, 0.086), (33, -0.029), (34, 0.029), (35, 0.093), (36, 0.059), (37, -0.054), (38, -0.014), (39, 0.073), (40, -0.072), (41, -0.016), (42, -0.048), (43, -0.01), (44, 0.054), (45, -0.116), (46, 0.029), (47, 0.008), (48, 0.072), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96221536 111 nips-2003-Learning the k in k-means

Author: Greg Hamerly, Charles Elkan

Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.

2 0.76790547 46 nips-2003-Clustering with the Connectivity Kernel

Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann

Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1

3 0.74686217 73 nips-2003-Feature Selection in Clustering Problems

Author: Volker Roth, Tilman Lange

Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1

4 0.72545761 87 nips-2003-Identifying Structure across Pre-partitioned Data

Author: Zvika Marx, Ido Dagan, Eli Shamir

Abstract: We propose an information-theoretic clustering approach that incorporates a pre-known partition of the data, aiming to identify common clusters that cut across the given partition. In the standard clustering setting the formation of clusters is guided by a single source of feature information. The newly utilized pre-partition factor introduces an additional bias that counterbalances the impact of the features whenever they become correlated with this known partition. The resulting algorithmic framework was applied successfully to synthetic data, as well as to identifying text-based cross-religion correspondences. 1 In t ro d u c t i o n The standard task of feature-based data clustering deals with a single set of elements that are characterized by a unified set of features. The goal of the clustering task is to identify implicit constructs, or themes, within the clustered set, grouping together elements that are characterized similarly by the features. In recent years there has been growing interest in more complex clustering settings, in which additional information is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on the information bottleneck (IB) framework [6], which facilitates coherent information-theoretic representation of different information types. In a recent line of research we have investigated the cross-dataset clustering task [7], [8]. In this setting, some inherent a-priori partition of the clustered data to distinct subsets is given. The clustering goal it to identify corresponding (analogous) structures that cut across the different subsets, while ignoring internal structures that characterize individual subsets. To accomplish this task, those features that commonly characterize elements across the different subsets guide the clustering process, while within-subset regularities are neutralized. In [7], we presented a distance-based hard clustering algorithm for the coupledclustering problem, in which the clustered data is pre-partitioned to two subsets. In [8], our setting, generalized to pre-partitions of any number of subsets, was addressed by a heuristic extension of the probabilistic IB algorithm, yielding improved empirical results. Specifically, the algorithm in [8] was based on a modification of the IB stable-point equation, which amplified the impact of features characterizing a formed cluster across all, or most, subsets. This paper describes an information-theoretic framework that motivates and extends the algorithm proposed in [8]. The given pre-partitioning is represented via a probability distribution variable, which may represent “soft” pre-partitioning of the data, versus the strictly disjoint subsets assumed in the earlier cross-dataset framework. Further, we present a new functional that captures the cross-partition motivation. From the new functional, we derive a stable-point equation underlying our algorithmic framework in conjunction with the corresponding IB equation. Our algorithm was tested empirically on synthetic data and on a real-world textbased task that aimed to identify corresponding themes across distinct religions. We have cross-clustered five sets of keywords that were extracted from topical corpora of texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction from standard clustering results, our algorithm reveals themes that are common to all religions, such as sacred writings, festivals, narratives and myths and theological principles, and avoids topical clusters that correspond to individual religions (for example, ‘Christmas’ and ‘Easter’ are clustered together with ‘Ramadan’ rather than with ‘Church’). Finally, we have paid specific attention to the framework of clustering with side information [4]. While this approach was presented for a somewhat different mindset, it might be used directly to address clustering across pre-partitioned data. We compare the technical details of the two approaches and demonstrate empirically that clustering with side information does not seem appropriate for the kind of cross-partition tasks that we explored. 2 Th e In fo rmat i o n B ot t len eck M et h od Probabilistic (“soft”) data clustering outputs, for each element x of the set being clustered and each cluster c, an assignment probability p(c|x). The IB method [6] interprets probabilistic clustering as lossy data compression. The given data is represented by a random variable X ranging over the clustered elements. X is compressed through another random variable C, ranging over the clusters. Every element x is characterized by conditional probability distribution p(Y|x), where Y is a third random variable taking the members y of a given set of features as values. The IB method formalizes the clustering task as minimizing the IB functional: L(IB) = I(C; X) − β I(C; Y) . (1) As known from information theory (Ch. 13 of [9]), minimizing the mutual information I(C; X) optimizes distorted compression rate. A complementary bias to maximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to the obtained clustering, inferred from the level by which C can predict Y. β is a free parameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that minimize L(IB) satisfy the following equation: p(c|x) = 1 p (c )e −β DKL [ p ( Y |x )|| p (Y |c ) ] , z( β , x) (2) where DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, between two distributions and z(β ,x) is a normalization function over C. Eq. (2) implies that, optimally, x is assigned to c in proportion to their KL distance in a feature distribution space, where the distribution p(Y|c) takes the role of a Start at time t = 0 and iterate the following update-steps, till convergence: IB1: initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ IB2: pt (c) = IB3: pt (y|c) = pt −1 (c ) e ∑ x (t = 0) (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x) p ( y | x ) p ( x) p t (c ) x Figure 1: The Information Bottleneck iterative algorithm (with fixed β and |C|). representative, or centroid, of c. The feature variable Y is hence utilized as the (exclusive) means to guide clustering, beyond the random nature of compression. Figure 1 presents the IB iterative algorithm for a fixed value of β . The IB1 update step follows Eq. (2). The other two steps, which are derived from the IB functional as well, estimate the p(c) and p(y|c) values required for the next iteration. The algorithm converges to a local minimum of the IB functional. The IB setting, particularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and C are independent given X, that is: I(C; Y|X) = ∑x p(x) I(C|x; Y|x) = 0. The balancing parameter β affects the number of distinct clusters being formed in a manner that resembles (inverse) temperature in physical systems. The higher β is (i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters are required for encoding the data. For each |C| = 2, 3, …, there is a minimal β value, enabling the formation of |C| distinct clusters. Setting β to be smaller than this critical value corresponding to the current |C| would result in two or more clusters that are identical to one another. Based on this, the iterative algorithm is applied repeatedly within a gradual cooling-like (deterministic annealing) scheme: starting with random initialization of the p0 (c|x)'s, generate two clusters with the critical β value, found empirically, for |C| = 2. Then, use a perturbation on the obtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of clusters and execute additional runs of the algorithm to identify the critical β value for the larger |C|. And so on: each output configuration is used as a basis for a more granular one. The final outcome is a “soft hierarchy” of probabilistic clusters. 3 Cro ss- p a rt i t i o n Clu st eri n g Cross-partition (CP) clustering introduces a factor – a pre-given partition of the clustered data – additional to what considered in a standard clustering setting. For representing this factor we introduce the pre-partitioning variable W, ranging over all parts w of the pre-given partition. Every data element x is associated with W through a given probability distribution p(W|x). Our goal is to cluster the data, so that the clusters C would not be correlated with W. We notice that Y, which is intended to direct the formation of clusters, might be a-priori correlated with W, so the formed clusters might end up being correlated with W as well. Our method aims at eliminating this aspect of Y. 3.1 I n f or ma t i o n D e f oc us i n g As noted, some of the information conveyed by Y characterizes structures correlated with W, while the other part of the information characterizes the target cross-W structures. We are interested in detecting the latter while filtering out the former. However, there is no direct a-priori separation between the two parts of the Ymediated information. Our strategy in tackling this difficulty is: we follow in general Y's directions, as the IB method does, while avoiding Y's impact whenever it entails undesired inter-dependencies of C and W. Our strategy implies conflicting biases with regard to the mutual information I(C,Y): it should be maximized in order to form meaningful clusters, but be minimized as well in the specific context where Y entails C–W dependencies. Accordingly, we propose a computational procedure directed by two distinct cost-terms in tandem. The first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). With this bias alone, Y might dictate (or “explain”, in retrospect) substantial C–W dependencies, implying a low I(C;W|Y) value. 1 Hence, the guideline of preventing Y from accounting for C–W dependencies is realized through an opposing bias of maximizing I(C;W|Y) = ∑y p(y) I(C|y; W|y). The second cost term – the Information Defocusing (ID) functional – consequently counterbalances minimization of I(C,Y) against the new bias: L(ID) = I(C; Y) − η I(C;W|Y) , (3) where η is a free parameter articulating the tradeoff between the biases. The ID functional captures our goal of reducing the impact of Y selectively: “defocusing” a specific aspect of the information Y conveys: the information correlated with W. In a like manner to the stable-point equation of the IB functional (Eq. 2), we derive the following stable-point equation for the ID functional: η p ( w) 1 p ( c )∏ w p ( y | c, w) η +1 , p(c|y) = z (η , y ) (4) where z(η,y) is a normalization function over C. The derivation relies on an additional assumption, I(C;W) = 0, imposing the intended independence between C and W (the detailed derivation will be described elsewhere). The intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated with a cluster c in proportion to a weighted, though flattened, geometric mean of the “W-projected centroids” p(y|c,w), priored by p(c). 2 This scheme overweighs y's that contribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated around centroids biased towards evenly contributing features. The higher η is, heavier emphasis is put on suppressing disagreements between the w's. For η → ∞ a plain weighted geometric-mean scheme is obtained. The inclusion of a step derived from Eq. (4) in our algorithm (see below) facilitates convergence on a configuration with centroids dominated by features that are evenly distributed across W. 3.2 T h e Cr os s - p a r t i t i on C l us t e r i n g A l g or i t h m Our proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering configuration that optimizes simultaneously both the IB and ID functionals, 1 Notice that “Z explaining well the dependencies between A and B” is equivalent with “A and B sharing little information in common given Z”, i.e. low I(A;B|Z) . Complete conditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. 2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the subsets; in the current paper this scheme is analytically derived from the ID functional. Start at time t = 0 and iterate the following update-steps, till convergence: CP1: Initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ CP2: pt (c) = CP3: p*t (y|c,w) = CP4: (t = 0) p t −1 (c ) e ∑ x (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x ) p ( y | x ) p ( w | x ) p ( x ) p t ( c ) p ( w) x Initialize p*t (c) randomly or arbitrarily (t = 0) p*t (c) (t > 0) = ∑ y p *t −1 (c | y ) p ( y ) η CP5: p*t (c|y) ∝ p *t (c)∏w p *t ( y | c, w) η +1 CP6: pt (y|c) = p ( w) p *t (c | y ) p ( y ) p *t (c ) Figure 2: The cross-partition clustering iterative algorithm (with fixed β, η, and |C|). thus obtaining clusters that cut across the pre-given partition W. To this end, the algorithm interleaves an iterative computation of the stable-point equations, and the additional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 correspond to the computations related to the IB functional, while steps CP3, CP4 and CP5, which compute a separate set of parameters (denoted by an asterisk), correspond to the ID functional. Figure 3 summarizes the roles of the two functionals in the dynamics of the CP algorithm. The two components of the iterative cycle are tied together in steps CP3 and CP6, in which parameters from one set are used as input to compute a parameter of other set. The derivation of step CP3 relies on an additional assumption, namely that C, Y and W are jointly independent given X. This assumption, which extends to W the underlying assumption of the IB setting that C and Y are independent given X, still entails the IB stable point equation. At convergence, the stable point equations for both the IB and ID functionals are satisfied, each by its own set of parameters (in steps CP1 and CP5). The deterministic annealing scheme, which gradually increases β over repeated runs (see Sec. 2), is applied for the CP algorithm as well with η held fixed. For a given target number of clusters |C|, the algorithm empirically converges with a wide range of η values 3. I(C;X) ↓ IB β↑ I(C;Y) ↓ ID η↑ I(C; W|Y) I(C; Y; W|X) = 0 ← assumptions → I(C;W) = 0 Figure 3: The interplay of the IB and the ID functionals in the CP algorithm. High η values tend to dictate centroids with features that are unevenly distributed across W, resulting in shrinkage of some of the clusters. Further analysis will be provided in future work. 3 4 Exp e ri men t a l Resu lt s Our synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into three 25-element parts denoted X 1 , X2 and X3 (in our formalism, for each clustered element x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, we partitioned the data twice, getting two (exhaustive) clustering configurations: 1. Target cross-W clustering: five clusters, each with representatives from all X w's; 2. Masking within-w clustering: six clusters, each consisting of roughly half the elements of either X 1, X 2 or X3 with no representatives from the other X w's. Each cluster, of both configurations, was characterized by a designated subset of features. Masking clusters were designed to be more salient than target clusters: they had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) and their elements shared higher feature-element (virtual) co-occurrence counts with those designated features (900 vs. 450 per element-feature pair). Noise (random positive integer < 200) was added to all counts associating elements with their designated features (for both within-w and cross-W clusters), as well as to roughly quarter of the zero counts associating elements with the rest of the features. The plain IB method consistently produced configurations strongly correlated with the masking clustering, while the CP algorithm revealed the target configuration. We got (see Table 1A) almost perfect results in configurations of nearly equal-sized cross-W clusters, and somewhat less perfect reconstruction in configurations of diverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to optimal target-output cluster match by the proportion of elements correctly assigned, where assignment of an element x follows its highest p(c|x). The results indicated were averaged over 200 runs. They were obtained for the optimal η, which was found to be higher in the diverging-sizes task. In the text-based task, the clustered elements – keywords – were automatically extracted from five distinct corpora addressing five religions: introductory web pages, online magazines, encyclopedic entries etc., all downloaded from the Internet. The clustered keyword set X was consequently pre-partitioned to disjoint subsets {X w} w∈W, one for each religion4 (|X w| ≈ 200 for each w). We conducted experiments simultaneously involving religion pairs as well as all five religions. We took the features Y to be a set of words that commonly occur within all five corpora (|Y| ≈ 7000). x–y co-occurrences were recorded within ±5-word sliding window truncated by sentence boundaries. η was fixed to a value (1.0) enabling the formation of 20 clusters in all settings. The obtained clusters revealed interesting cross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing the theme of sacred festivals: the three highest p(c/x) members within each religion were Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas Table 1: Average correct assignment proportion scores for the synthetic task (A) and Jaccard-coefficient scores for the religion keyword classification task (B). A. Synthetic Data IB CP B. Religion Data IB Coupled Clustering [7] CP (cross-expert agreement on religion pairs .462±.232) equal-size clusters .305 .985 non-equal clusters .292 .827 4 religion pairs all five (one case) .200±.100 .220±.138 .407±.144 .104 ––––––– .167 A keyword x that appeared in the corpora of different religions was considered as a distinct element for each religion, so the Xw were kept disjointed. (Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, (Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster produced by the plain IB method was poorer by far, including Islamic Ramadan, and Id and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked high too), but no single related term from the other religions. Our external evaluation standards were cross-religion keyword classes constructed manually by experts of comparative religion studies. One such expert classification involved all five religions, and eight classifications addressed religions in pairs. Each of the eight religion-pair classifications was contributed by two independent experts using the same keywords, so we could also assess the agreement between experts. As an overlap measure we employed the Jaccard coefficient: the number of element pairs co-assigned together by both one of the evaluated clusters and one of the expert classes, divided by the number of pairs co-assigned by either our clusters or the expert (or both). We did not assume the number of expert classes is known in advance (as done in the synthetic experiments), so the results were averaged over all configurations of 2–16 cluster hierarchy, for each experiment. The results shown in Table 1B – clear improvement relatively to plain IB and the distance-based coupled clustering [7] – are, however, persistent when the number of clusters is taken to be equal to the number of classes, or if only the best score in hierarchy is considered. The level of cross-expert agreement indicates that our results are reasonably close to the scores expected in such subjective task. 5 C o mp a ri so n t o R e la t ed W o r k The information bottleneck framework served as the basis for several approaches that represent additional information in their clustering setting. The multivariate information bottleneck (MIB) adapts the IB framework for networks of multiple variables [3]. However, all variables in such networks are either compressed (like X), or predicted (like Y). The incorporation of an empirical variable to be masked or defocused in the sense of our W is not possible. Including such variables in the MIB framework might be explored in future work. Particularly relevant to our work is the IB-based method for extracting relevant constructs with side information [4]. This approach addresses settings in which two different types of features are distinguished explicitly: relevant versus irrelevant ones, denoted by Y+ and Y−. Both types of features are incorporated within a single functional to be minimized: L(IB-side-info) = I(C; X) − β ( I(C; Y +) − γ I(C; Y−) ), which directly drives clustering to de-correlate C and Y−. Formally, our setting can be mapped to the side information setting by regarding the pre-partition W simply as the additional set of irrelevant features, giving symmetric (and opposite) roles to W and Y. However, it seems that this view does not address properly the desired cross-partition setting. In our setting, it is assumed that clustering should be guided in general by Y, while W should only neutralize particular information within Y that would otherwise yield the undesired correlation between C and W (as described in Section 3.1). For that reason, the defocusing functional tie the three variables together by conditioning the de-correlation of C and W on Y, while its underlying assumption ensures the global de-correlation. Indeed, our method was found empirically superior on the cross-dataset task. The side-information IB method (the iterative algorithm with best scoring γ) achieves correct assignment proportion of 0.52 in both synthetic tasks, where our method scored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification task, Jaccard coefficient improved by 20% relatively to plain IB (compared to our 100% improvement, see Table 1B). 6 C o n c lu si o n s This paper addressed the problem of clustering a pre-partitioned dataset, aiming to detect new internal structures that are not correlated with the pre-given partition but rather cut across its components. The proposed framework extends the cross-dataset clustering algorithm [8], providing better formal grounding and representing any pre-given (soft) partition of the dataset. Supported by empirical evidence, we suggest that our framework is better suited for the cross-partition task than applying the side-information framework [4], which was originally developed to address a somewhat different setting. We also demonstrate substantial empirical advantage over the distance-based coupled-clustering algorithm [7]. As an applied real-world goal, the algorithm successfully detects cross-religion commonalities. This goal exemplifies the more general notion of detecting analogies across different systems, which is a somewhat vague and non-consensual task and therefore especially challenging for a computational framework. Our approach can be viewed as an initial step towards principled identification of “hidden” commonalities between substantially different real world systems, while suppressing the vast majority of attributes that are irrelevant for the analogy. Further research may study the role of defocusing in supervised learning, where some pre-given partitions might mask the role of underlying discriminative features. Additionally, it would be interesting to explore relationships to other disciplines, e.g., network information theory ([9], Ch. 14) which provided motivation for the side-information approach. Finally, both frameworks (ours and side-information) suggest the importance of dealing wisely with information that should not dictate the clustering output directly. A c k n ow l e d g me n t s We thank Yuval Krymolowski for helpful discussions and Tiina Mahlamäki, Eitan Reich and William Shepard, for contributing the religion keyword classifications. References [1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine Learning Research, 41(1):177-196. [2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means clustering with background knowledge. The 18th International Conference on Machine Learning (ICML-2001), pp 577-584. [3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information bottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152161. [4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. Advances in Neural Processing Information Systems 15 (NIPS'02). [5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307-1331. [6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. [7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. [8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding themes across multiple corpora. Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 15-21. [9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. Sons, Inc., New York, New York. John Wiley &

5 0.71049207 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1

6 0.66762125 107 nips-2003-Learning Spectral Clustering

7 0.62712789 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

8 0.48880666 152 nips-2003-Pairwise Clustering and Graphical Models

9 0.44725403 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

10 0.41804674 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

11 0.39316681 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

12 0.36296842 113 nips-2003-Learning with Local and Global Consistency

13 0.35815004 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

14 0.34902132 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

15 0.34464848 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

16 0.33575356 172 nips-2003-Semi-Supervised Learning with Trees

17 0.33236799 181 nips-2003-Statistical Debugging of Sampled Programs

18 0.33095038 48 nips-2003-Convex Methods for Transduction

19 0.33084753 92 nips-2003-Information Bottleneck for Gaussian Variables

20 0.3280665 126 nips-2003-Measure Based Regularization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (11, 0.011), (30, 0.015), (35, 0.027), (53, 0.629), (71, 0.042), (76, 0.025), (85, 0.053), (91, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99453729 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Author: Maneesh Sahani, Srikantan S. Nagarajan

Abstract: Existing source location and recovery algorithms used in magnetoencephalographic imaging generally assume that the source activity at different brain locations is independent or that the correlation structure is known. However, electrophysiological recordings of local field potentials show strong correlations in aggregate activity over significant distances. Indeed, it seems very likely that stimulus-evoked activity would follow strongly correlated time-courses in different brain areas. Here, we present, and validate through simulations, a new approach to source reconstruction in which the correlation between sources is modelled and estimated explicitly by variational Bayesian methods, facilitating accurate recovery of source locations and the time-courses of their activation. 1

same-paper 2 0.99319988 111 nips-2003-Learning the k in k-means

Author: Greg Hamerly, Charles Elkan

Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.

3 0.98768091 45 nips-2003-Circuit Optimization Predicts Dynamic Networks for Chemosensory Orientation in Nematode C. elegans

Author: Nathan A. Dunn, John S. Conery, Shawn R. Lockery

Abstract: The connectivity of the nervous system of the nematode Caenorhabditis elegans has been described completely, but the analysis of the neuronal basis of behavior in this system is just beginning. Here, we used an optimization algorithm to search for patterns of connectivity sufficient to compute the sensorimotor transformation underlying C. elegans chemotaxis, a simple form of spatial orientation behavior in which turning probability is modulated by the rate of change of chemical concentration. Optimization produced differentiator networks with inhibitory feedback among all neurons. Further analysis showed that feedback regulates the latency between sensory input and behavior. Common patterns of connectivity between the model and biological networks suggest new functions for previously identified connections in the C. elegans nervous system. 1

4 0.9858942 92 nips-2003-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was addressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters. We give a formal definition of the general continuous IB problem and obtain an analytic solution for the optimal representation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix Σx|y Σ−1 , which x is also the basis obtained in Canonical Correlation Analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeoff - the information curve - in terms of the eigenvalue spectrum. 1

5 0.98058075 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1

6 0.972422 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

7 0.86322474 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

8 0.83838463 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

9 0.83561897 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

10 0.80663037 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface

11 0.80523026 66 nips-2003-Extreme Components Analysis

12 0.79641092 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

13 0.78269601 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

14 0.77244627 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

15 0.77063298 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

16 0.76890701 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

17 0.76859093 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

18 0.76606673 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

19 0.7594077 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

20 0.75875843 185 nips-2003-The Doubly Balanced Network of Spiking Neurons: A Memory Model with High Capacity