nips nips2006 nips2006-158 knowledge-graph by maker-knowledge-mining

158 nips-2006-PG-means: learning the number of clusters in data


Source: pdf

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 PG-means: learning the number of clusters in data Yu Feng Greg Hamerly Computer Science Department Baylor University Waco, Texas 76798 {yu feng, greg hamerly}@baylor. [sent-1, score-0.432]

2 edu Abstract We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. [sent-2, score-0.468]

3 Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. [sent-3, score-0.474]

4 We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. [sent-5, score-0.347]

5 Further, our new method provides a much more stable estimate of the number of clusters than existing methods. [sent-6, score-0.398]

6 However, most of them require that the user know the number of clusters (k) beforehand, while an appropriate value for k is not always clear. [sent-9, score-0.367]

7 In this paper, we present an algorithm called PG-means (PG stands for projected Gaussian) which is able to discover an appropriate number of Gaussian clusters and their locations and orientations. [sent-12, score-0.482]

8 Our method is a wrapper around the standard and widely used Gaussian mixture model. [sent-13, score-0.201]

9 The paper’s primary contribution is a novel method of determining if a whole mixture model fits its data well, based on projections and statistical tests. [sent-14, score-0.358]

10 We show that the new approach works well not only in simple cases in which the clusters are well separated, but also in the situations where the clusters are overlapped, eccentric, in high dimension, and even non-Gaussian. [sent-15, score-0.775]

11 One drawback of the X-means algorithm is that the cluster covariances are all assumed to be spherical and the same width. [sent-24, score-0.406]

12 G-means uses projection and a statistical test for the hypothesis that the data in a cluster come from a Gaussian distribution. [sent-27, score-0.537]

13 It applies a statistical test to each cluster and those which are not accepted as Gaussians are split into two clusters. [sent-29, score-0.383]

14 While this method does not assume spherical clusters and works well if true clusters is well-separated, it has difficulties when true clusters overlap since the hard assignment of k-means can clip data into subsets that look non-Gaussian. [sent-31, score-1.607]

15 [13] proposed the Gap statistic, which compares the likelihood of a learned model with the distribution of the likelihood of models trained on data drawn from a null distribution. [sent-38, score-0.209]

16 3 Methodology Our approach is called PG-means, where PG stands for projected Gaussian and refers to the fact that the method applies projections to the clustering model as well as the data, before performing each hypothesis test for model fitness. [sent-46, score-0.61]

17 PG-means uses the standard Gaussian mixture model with Expectation-Maximization training, but any underlying algorithm for training a Gaussian mixture might be used. [sent-47, score-0.288]

18 Each time EM learning converges, PG-means projects both the dataset and the learned model to one dimension, and then applies the Kolmogorov-Smirnov (KS) test to determine whether the projected model fits the projected data. [sent-50, score-0.662]

19 PG-means repeats this projection and test step several times for a single learned model. [sent-51, score-0.377]

20 If any test rejects the null hypothesis that the data follows the model’s distribution, then it adds one cluster and starts again with EM learning. [sent-52, score-0.578]

21 If every test accepts the null hypothesis for a given model, then the algorithm terminates. [sent-53, score-0.258]

22 When adding a new cluster PG-means preserves the k clusters it has learned and adds a new cluster. [sent-55, score-0.733]

23 To find the best new model, PG-means runs EM 10 times each time it adds a cluster with a different initial location for the new cluster. [sent-57, score-0.273]

24 The mean of each new cluster is chosen from a set of randomly chosen examples, and also points with low model-assigned probability density. [sent-58, score-0.191]

25 The initial covariance of the new cluster is based on the average of the existing clusters’ covariances, and the new cluster prior is assigned 1/k and all priors are re-normalized. [sent-59, score-0.382]

26 Initialize the cluster with the mean and covariance of X. [sent-63, score-0.191]

27 Use the KS test at significance level α to test if the projected model fits the projected dataset. [sent-68, score-0.447]

28 If the test rejects the null hypothesis, then break out of the loop. [sent-69, score-0.2]

29 end for if any test rejected the null hypothesis then for i = 1 . [sent-70, score-0.225]

30 10 do Initialize k + 1 clusters as the k previously learned plus one new cluster. [sent-73, score-0.46]

31 end for Retain the model of k + 1 clusters with the best likelihood. [sent-75, score-0.408]

32 end if Every test accepts the null hypothesis; stop and return the model. [sent-77, score-0.196]

33 1 Projection of the model and the dataset PG-means is novel because it applies projection to the learned model as well as to the dataset prior to testing for model fitness. [sent-81, score-0.579]

34 First, a mixture of Gaussians remains a mixture of Gaussians after being linearly projected. [sent-83, score-0.202]

35 Second, there are many effective and efficient tests for model fitness in one dimension, but in higher dimensions such testing is more difficult. [sent-84, score-0.197]

36 We can project each cluster model to obtain a one-dimensional projection of an entire mixture along P . [sent-89, score-0.504]

37 Then we wish to test whether the projected model fits the projected data. [sent-90, score-0.359]

38 The G-means and X-means algorithms both perform statistical tests for each cluster individually. [sent-91, score-0.33]

39 However, this approach is problematic when clusters overlap, since the hard assignment results in ‘clipped’ clusters, making them appear very non-Gaussian. [sent-93, score-0.404]

40 PG-means tests all clusters and all data at once. [sent-94, score-0.477]

41 Then if two true clusters overlap, the additive probability of the learned Gaussians representing those clusters will correctly model the increased density in the overlapping region. [sent-95, score-1.0]

42 2 The Kolmogorov-Smirnov test and critical values After projection, PG-means uses the univariate Kolmogorov-Smirnov [7] test for model fitness. [sent-97, score-0.394]

43 The KS test statistic is D = maxX |F (X) − S(X)| – the maximum absolute difference between the true CDF F (X) with the sample CDF S(X). [sent-98, score-0.246]

44 Lilliefors [6] gave a table of smaller critical values for the KS test which correct for estimated parameters of a single univariate Gaussian. [sent-102, score-0.22]

45 Along this vein, we create our own test critical values for a mixture of univariate Gaussians. [sent-104, score-0.321]

46 To generate the critical values for the KS test statistic, we use the projected one-dimensional model that has been learned to generate many different datasets, and then measure the KS test statistic for each dataset. [sent-105, score-0.601]

47 Then we find the KS test statistic that is in the α range we desire, which is the critical value we want. [sent-106, score-0.264]

48 It is much more efficient than if we were to generate datasets from the full dimensional data and project these to obtain the statistic distribution, yet they are equivalent approaches. [sent-108, score-0.229]

49 3 Number of projections We wish to use a small but sufficient number of projections and tests to discover when a model does not fit data well. [sent-116, score-0.525]

50 However, a projection can cause the data from two or more true clusters to be collapsed together, so that the test cannot see that there should be multiple densities used to model them. [sent-118, score-0.719]

51 Therefore multiple projections are necessary to see these model and data discrepancies. [sent-119, score-0.228]

52 Other possible methods include using the leading directions from principal components analysis, which gives a stable set of vectors which can be re-used, or choosing k − 1 vectors that span the same subspace spanned by the k cluster centers. [sent-122, score-0.222]

53 Consider two cluster centers µ1 and µ2 in d dimensions and the vector which connects them, m = µ2 −µ1 . [sent-123, score-0.287]

54 We assume for simplicity that the two clusters have the same spherical covariance Σ and are c-separated, that is, ||m|| ≥ c trace(Σ). [sent-124, score-0.542]

55 that it maintains c-separation between the cluster means when projected, is √ P r |P T m| ≥ c P T ΣP > 1 − Erf c dP T ΣP 2c2 trace(Σ) = 1 − Erf 1/2 where Erf is the standard Gaussian error function. [sent-130, score-0.228]

56 If we perform p random projections, we wish that the probability that all p projections are ‘bad’ to be less than some ε: p P r(p bad projections) = Erf 1/2 <ε Therefore we need approximately p < log(ε)/ log(Erf( 1/2)) ≈ −2. [sent-135, score-0.216]

57 6198 log(ε) projections to find a projection that keeps the two cluster means c-separated. [sent-136, score-0.537]

58 Let K be the final learned number of clusters on n data points. [sent-142, score-0.46]

59 PG-means uses a fixed number of projections for each model and each projection is linear in n, d, and k; therefore the projections do not increase the algorithm’s asymptotic run time. [sent-145, score-0.627]

60 Note also that EM starts with k learned centers and one new randomly initialized center, so EM convergence is much faster in practice than if all k + 1 clusters were randomly initialized. [sent-146, score-0.553]

61 We must also factor in the cost of the Monte Carlo simulations for determining the KS test critical value, which are O(Kd2 n log(n)/α) for each simulation. [sent-147, score-0.223]

62 Eccentricity=4 VI metric score Eccentricity=1 VI metric score c=2 c=4 c=6 1 0. [sent-152, score-0.23]

63 5 0 0 10 dimension 20 0 0 10 dimension 20 0 0 10 dimension 20 Figure 2: Each point represents the average VI metric comparing the learned clustering to the correct labels for various types of synthetic datasets. [sent-161, score-0.51]

64 For each model, PG-means uses 12 projections and tests, corresponding to an error rate of ε < 0. [sent-168, score-0.232]

65 Figure 1 shows the number of clusters found by running PG-means, G-means, X-means and BKM on many synthetic datasets. [sent-173, score-0.473]

66 PG−means G−means X−means Figure 3: The leftmost dataset has 10 true clusters with significant overlap (c = 1). [sent-175, score-0.624]

67 On the right are the results for PG-means, G-means, and X-means on a dataset with 5 true eccentric and overlapping clusters. [sent-177, score-0.41]

68 The centers of the clusters in each dataset are chosen randomly, and each cluster generates the same number of points. [sent-180, score-0.712]

69 Each Gaussian mixture dataset is specified by the average c-separation between each cluster center and its nearest neighbor (either 2, 4 or 6) and each cluster’s eccentricity (either 1 or 4). [sent-181, score-0.622]

70 The eccentricity of is defined as Ecc = λmax /λmin where λmax and λmin are the maximum and minimum eigenvalues of the cluster covariance. [sent-182, score-0.388]

71 We generate 10 datasets of each type and run PG-means, G-means and X-means on each, and we run BKM on only one of them due to the running time of BKM. [sent-184, score-0.214]

72 It is clear that PG-means performs better than G-means and X-means when the data are eccentric (Ecc=4), especially when the clusters overlap (c = 2). [sent-186, score-0.651]

73 Figure 1 only gives the information regarding the learned number of clusters, which is not enough to measure the true quality of learned models. [sent-192, score-0.256]

74 In order to better evaluate the approaches, we use Meila’s VI (Variation of Information) metric [8] to compare the induced clustering to the true labels. [sent-193, score-0.197]

75 It is zero when the two compared clusterings are identical (modulo clusters being relabeled). [sent-195, score-0.402]

76 Figure 2 shows the average VI metric obtained by running PG-means, G-means, X-means, and BKM on the same synthetic datasets as in Figure 1. [sent-196, score-0.282]

77 However, the top-left plot shows that PG-means does not perform as well as G-means, X-means and BKM for spherical and overlapping data. [sent-198, score-0.237]

78 The reason is that two spherical clusters overlap, they can look like a single eccentric cluster. [sent-199, score-0.746]

79 Since PG-means can capture eccentric clusters effectively, it will accept these two overlapped spherical clusters as one cluster. [sent-200, score-1.152]

80 Therefore, although PG-means gives fewer clusters for spherical and overlapping data, the models it learns are reasonable. [sent-202, score-0.604]

81 Figure 3 shows how 10 true overlapping clusters may look like far fewer clusters, and that PG-means can find an appropriate model with only 4 clusters. [sent-203, score-0.57]

82 These datasets are generated in a similar way of generating our synthetic Gaussian datasets, except that each true cluster has a uniform distribution. [sent-208, score-0.427]

83 Each cluster is not necessarily axis-aligned or square; it is scaled for eccentricity and rotated. [sent-209, score-0.388]

84 The eccentricity and c-separation values for the datasets are both 4. [sent-211, score-0.289]

85 We run PG-means, G-means and X-means on 10 different datasets and BKM Table 1: Results for synthetic non-Gaussian data and the handwritten digits dataset. [sent-212, score-0.332]

86 Each nonGaussian dataset contains 4000 points in 8 dimensions sampled from 20 true clusters each having uniform distribution. [sent-213, score-0.587]

87 Algorithm PG-means G-means X-means BKM Non-Gaussian datasets (20 true clusters) Learned k VI metric 20 ± 0 0±0 42. [sent-217, score-0.246]

88 059 20 0 Handwritten digits dataset (10 true classes) Learned k VI metric 14 2. [sent-225, score-0.322]

89 G-means and X-means overfit the non-Gaussian datasets, while PG-means and BKM both perform excellently in the number of clusters learned and in learning the true labels according to the VI metric. [sent-231, score-0.53]

90 Postal Service handwritten digits dataset (both the train and test portions, obtained from http://www-stat. [sent-234, score-0.313]

91 Our goal is to cluster the dataset without knowing the true labels and analyze the result to find out how well PG-means captures the true classes. [sent-241, score-0.435]

92 We use random linear projection to project the dataset to 16 dimensions and run PG-means, Gmeans, X-means, and BKM on it. [sent-242, score-0.366]

93 5 Conclusions and future work We presented a new algorithm called PG-means for learning the number of Gaussian clusters k in data. [sent-247, score-0.367]

94 Then it projects both the model and the dataset to one dimension and tests for model fitness with the Kolmogorov-Smirnov test and its own critical values. [sent-250, score-0.576]

95 It performs multiple projections and tests per model, to avoid being fooled by a poorly chosen projection. [sent-251, score-0.297]

96 PG-means finds better models than G-means and X-means when the true clusters are eccentric or overlap, especially in low-dimension. [sent-257, score-0.638]

97 PG-means gives far more stable estimates of the number of clusters than the other two methods over many different types of data. [sent-259, score-0.398]

98 Our techniques would also be applicable as a wrapper around the k-means algorithm, which is really just a mixture of spherical Gaussians, or any other mixture of Gaussians with limited covariance. [sent-262, score-0.477]

99 On the real-world handwritten digits dataset PG-means finds a very good clustering with nearly the correct number of classes, and PG-means and BKM are equally close to identifying the original labels among the algorithms we tested. [sent-263, score-0.268]

100 Estimating the number of clusters in a dataset via the Gap statistic. [sent-315, score-0.471]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bkm', 0.473), ('clusters', 0.367), ('eccentricity', 0.197), ('cluster', 0.191), ('projections', 0.187), ('ks', 0.181), ('spherical', 0.175), ('eccentric', 0.174), ('em', 0.136), ('projection', 0.122), ('projected', 0.115), ('tests', 0.11), ('tness', 0.108), ('erf', 0.108), ('vi', 0.108), ('dataset', 0.104), ('mixture', 0.101), ('hamerly', 0.1), ('wrapper', 0.1), ('learned', 0.093), ('pg', 0.092), ('datasets', 0.092), ('critical', 0.088), ('test', 0.088), ('statistic', 0.088), ('metric', 0.084), ('overlap', 0.083), ('adds', 0.082), ('null', 0.075), ('lilliefors', 0.075), ('synthetic', 0.074), ('repeats', 0.074), ('dimension', 0.072), ('true', 0.07), ('gaussians', 0.067), ('pelleg', 0.065), ('greg', 0.065), ('digits', 0.064), ('overlapping', 0.062), ('hypothesis', 0.062), ('handwritten', 0.057), ('bic', 0.052), ('centers', 0.05), ('dallal', 0.05), ('ecc', 0.05), ('elkan', 0.05), ('gmeans', 0.05), ('pgmeans', 0.05), ('repairing', 0.05), ('sand', 0.05), ('sanjoy', 0.05), ('project', 0.049), ('carlo', 0.048), ('monte', 0.048), ('simulations', 0.047), ('dimensions', 0.046), ('uses', 0.045), ('run', 0.045), ('gaussian', 0.044), ('univariate', 0.044), ('annealing', 0.044), ('starts', 0.043), ('dasgupta', 0.043), ('kurihara', 0.043), ('feng', 0.043), ('nds', 0.043), ('clustering', 0.043), ('accepted', 0.042), ('model', 0.041), ('works', 0.041), ('trace', 0.041), ('covariances', 0.04), ('overlapped', 0.039), ('cdf', 0.039), ('assignment', 0.037), ('means', 0.037), ('rejects', 0.037), ('bayesian', 0.035), ('dan', 0.035), ('clusterings', 0.035), ('american', 0.035), ('association', 0.034), ('accepts', 0.033), ('applies', 0.033), ('ts', 0.032), ('running', 0.032), ('charles', 0.032), ('projects', 0.032), ('moore', 0.032), ('score', 0.031), ('stable', 0.031), ('cause', 0.031), ('accept', 0.03), ('tend', 0.03), ('look', 0.03), ('bad', 0.029), ('center', 0.029), ('statistical', 0.029), ('tibshirani', 0.028), ('especially', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

2 0.17030445 80 nips-2006-Fundamental Limitations of Spectral Clustering

Author: Boaz Nadler, Meirav Galun

Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1

3 0.13790795 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

Author: Jason V. Davis, Inderjit S. Dhillon

Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1

4 0.094202206 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby

Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1

5 0.093114048 7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

6 0.087788515 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

7 0.08745864 175 nips-2006-Simplifying Mixture Models through Function Approximation

8 0.08721301 5 nips-2006-A Kernel Method for the Two-Sample-Problem

9 0.080193326 131 nips-2006-Mixture Regression for Covariate Shift

10 0.075965956 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

11 0.072118178 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

12 0.071517244 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

13 0.06820479 33 nips-2006-Analysis of Representations for Domain Adaptation

14 0.066677347 65 nips-2006-Denoising and Dimension Reduction in Feature Space

15 0.060423907 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

16 0.059619941 98 nips-2006-Inferring Network Structure from Co-Occurrences

17 0.059340689 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

18 0.055512365 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

19 0.053950686 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering

20 0.053628426 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.194), (1, 0.067), (2, 0.04), (3, 0.102), (4, 0.026), (5, -0.06), (6, 0.078), (7, 0.009), (8, 0.062), (9, 0.049), (10, -0.02), (11, 0.152), (12, 0.005), (13, 0.096), (14, -0.022), (15, -0.037), (16, 0.066), (17, -0.027), (18, 0.093), (19, -0.018), (20, -0.019), (21, -0.011), (22, 0.018), (23, 0.114), (24, -0.037), (25, -0.002), (26, 0.087), (27, 0.175), (28, 0.003), (29, -0.014), (30, -0.012), (31, -0.08), (32, -0.11), (33, -0.024), (34, -0.068), (35, -0.12), (36, 0.007), (37, 0.093), (38, 0.094), (39, -0.001), (40, -0.02), (41, -0.069), (42, 0.127), (43, -0.235), (44, 0.042), (45, -0.138), (46, 0.125), (47, -0.027), (48, 0.114), (49, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96009696 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

2 0.61643285 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

Author: Jason V. Davis, Inderjit S. Dhillon

Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1

3 0.58242244 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby

Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1

4 0.55623913 131 nips-2006-Mixture Regression for Covariate Shift

Author: Masashi Sugiyama, Amos J. Storkey

Abstract: In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. In practice this assumption is commonly violated. The situations where the training and test data are from different distributions is called covariate shift. Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. As yet the literature lacks a Bayesian generative perspective on this problem. This paper tackles this issue for regression models. Recent work on covariate shift can be understood in terms of mixture regression. Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. The main advantages of this new formulation over previous models for covariate shift are that we no longer need to presume the test and training densities are known, the regression and density estimation are combined into a single procedure, and previous methods are reproduced as special cases of this procedure, shedding light on the implicit assumptions the methods are making. 1

5 0.54365128 5 nips-2006-A Kernel Method for the Two-Sample-Problem

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.

6 0.53250813 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

7 0.51848245 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data

8 0.50280029 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

9 0.46768251 80 nips-2006-Fundamental Limitations of Spectral Clustering

10 0.44810405 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

11 0.43510437 33 nips-2006-Analysis of Representations for Domain Adaptation

12 0.43056649 109 nips-2006-Learnability and the doubling dimension

13 0.42680544 175 nips-2006-Simplifying Mixture Models through Function Approximation

14 0.39474505 7 nips-2006-A Local Learning Approach for Clustering

15 0.38900653 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

16 0.37223119 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

17 0.35905504 129 nips-2006-Map-Reduce for Machine Learning on Multicore

18 0.35239369 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

19 0.33775979 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

20 0.33594289 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.094), (3, 0.011), (6, 0.014), (7, 0.09), (9, 0.11), (11, 0.011), (20, 0.034), (22, 0.074), (44, 0.06), (57, 0.085), (65, 0.036), (69, 0.062), (89, 0.221)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82743573 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

2 0.67875916 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

3 0.66495156 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

4 0.66205579 7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

5 0.65940338 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

6 0.65913182 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

7 0.65823889 80 nips-2006-Fundamental Limitations of Spectral Clustering

8 0.64913011 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

9 0.6483652 65 nips-2006-Denoising and Dimension Reduction in Feature Space

10 0.64746457 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

11 0.64549124 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

12 0.63947022 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

13 0.6389209 175 nips-2006-Simplifying Mixture Models through Function Approximation

14 0.6384421 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

15 0.63728911 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

16 0.63572901 34 nips-2006-Approximate Correspondences in High Dimensions

17 0.63523221 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

18 0.63434339 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

19 0.63411582 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

20 0.63350612 43 nips-2006-Bayesian Model Scoring in Markov Random Fields