nips nips2007 nips2007-61 knowledge-graph by maker-knowledge-mining

61 nips-2007-Convex Clustering with Exemplar-Based Models


Source: pdf

Author: Danial Lashkari, Polina Golland

Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. [sent-6, score-0.457]

2 This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. [sent-9, score-0.26]

3 The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. [sent-10, score-0.728]

4 The input is either vectorial data, that is, vectors of data points in the feature space, or proximity data, the pairwise similarity or dissimilarity values between the data points. [sent-13, score-0.394]

5 The choice of the clustering cost function and the optimization algorithm employed to solve the problem determines the resulting clustering [1]. [sent-14, score-0.608]

6 Intuitively, most methods seek compact clusters of data points, namely, clusters with relatively small intra-cluster and high inter-cluster distances. [sent-15, score-0.55]

7 Other approaches, such as Spectral Clustering [2], look for clusters of more complex shapes lying on some low dimensional manifolds in the feature space. [sent-16, score-0.286]

8 Although this approach yields satisfactory results for problems with a small number of clusters and is relatively fast, its use of a gradient-descent algorithm for minimization of a cost function with many local optima makes it sensitive to initialization. [sent-20, score-0.489]

9 As the search space grows, that is, the number of data points or clusters increases, it becomes harder to find a good initialization. [sent-21, score-0.388]

10 This problem often arises in emerging applications of clustering for large biological data sets such as gene-expression. [sent-22, score-0.301]

11 We aim to circumvent the initialization procedure by designing a convex problem whose global optimum can be found with a simple algorithm. [sent-25, score-0.261]

12 It has been shown that mixture modeling can 1 be formulated as an instance of iterative distance minimization between two sets of probability distributions [3]. [sent-26, score-0.323]

13 This formulation shows that the non-convexity of mixture modeling cost function comes from the parametrization of the model components . [sent-27, score-0.324]

14 More precisely, any mixture model is, by definition, a convex combination of some set of distributions. [sent-28, score-0.254]

15 However, for a fixed number of mixture components, the set of all such mixture models is usually not convex when the distributions have, say, free mean parameters in the case of normal distributions. [sent-29, score-0.474]

16 Inspired by combinatorial, non-parametric methods such as k-medoids [5] and affinity propagation [6], our main idea is to employ the notion of exemplar finding, namely, finding the data points which could best describe the data set. [sent-30, score-0.333]

17 We assume that the clusters are dense enough such that there is always a data point very close to the real cluster centroid and, thus, restrict the set of possible cluster means to the set of data points. [sent-31, score-0.685]

18 Further, by taking all data points as exemplar candidates, the modeling cost function becomes convex. [sent-32, score-0.351]

19 Moreover, since the number of clusters is not specified a priori, the algorithm automatically finds the number of clusters depending only on one temperature-like parameter. [sent-35, score-0.509]

20 This parameter, which is equivalent to a common fixed variance in case of Gaussian models, defines the width scale of the desired clusters in the feature space. [sent-36, score-0.274]

21 Our method works exactly in the same way with both proximity and vectorial data, unifying their treatment and providing insights into the modeling assumptions underlying the conversion of feature vectors into pairwise proximity data. [sent-37, score-0.314]

22 Thus, we can restrict the set of mixture components to the distributions centered at the data points, i. [sent-49, score-0.257]

23 Yet, for a specified number of clusters k, the problem still has a combinatorial nature of choosing the right k cluster centers among n data points. [sent-52, score-0.564]

24 The new log-likelihood function is l({qj }n ; X ) = j=1 1 n n n log i=1 qj fj (xi ) = j=1 1 n n n qj e−βdφ (xi ,xj ) + const. [sent-54, score-1.328]

25 We n maximize l(·; X ) over the set of all mixture distributions Q = Q|Q(·) = j=1 qj fj (·) . [sent-57, score-0.882]

26 Consequently, the maximum likelihood problem can be equivalently stated as j=1 ˆ the minimization of the KL-divergence between P and the set of mixture distributions Q. [sent-60, score-0.289]

27 3 Connection to Rate-Distortion Problems Now, we present an equivalent statement of our problem on the product set of exemplars and data points. [sent-66, score-0.236]

28 Let Q be the set of distributions of the complete data random variable (J, X) ∈ {1, · · · , n} × I d with elements Q (j, x) = qj fj (x). [sent-69, score-0.757]

29 Specifically, we define: Q (j, x) = qj C(x)e−βdφ (x,xj ) ˆ P (j, x) = P (x)P (j|x) = 1 n rij , 0, x = xi ∈ X ; otherwise (6) where qj and rij = P (j|x = xi ) are probability distributions over the set {j}n . [sent-74, score-1.898]

30 This formulation j=1 ensures that P ∈ P , Q ∈ Q and the objective function is expressed only in terms of variables qj and P (j|x) for x ∈ X . [sent-75, score-0.663]

31 Our goal is then to solve the minimization problem in the space of distributions of random variable (J, I) ∈ {j}n ×{j}n , namely, in the product space of exemplar j=1 j=1 × data point indices. [sent-76, score-0.331]

32 Substituting expressions (6) into the KL-divergence D(P Q ), we obtain the equivalent cost function: n D(P Q ) = rij 1 rij log + βdφ (xi , xj ) + const. [sent-77, score-0.714]

33 n i,j=1 qj (7) 1 It is straightforward to show that for any set of values rij , setting qj = n i rij minimizes (7). [sent-78, score-1.787]

34 Substituting this expression into the cost function, we obtain the final expression n ∗ D(P Q (P )) = 1 rij log n i,j=1 rij 1 n i ri j + βdφ (xi , xj ) + const. [sent-79, score-0.714]

35 The n2 unknown values of rij lie on n separate n-dimensional simplices. [sent-82, score-0.325]

36 These parameters have the same role as cluster responsibilities in soft k-means: they stand for the probability of data point xi choosing data point xj as its cluster-center. [sent-83, score-0.695]

37 We cannot compress the information source beyond this rate without tolerating some distortion, when the original data points are encoded into other points with nonzero distances between them. [sent-94, score-0.354]

38 We can then consider rij ’s as a probabilistic encoding of our data set onto itself with the corresponding average distortion D = EI,J dφ (xi , xj ) ∗ and the rate I(I; J). [sent-95, score-0.506]

39 A solution rij that minimizes (8) for some β yields the least rate that can be achieved having no more than the corresponding average distortion D. [sent-96, score-0.412]

40 The weight qj for the data point xj is a measure of how likely this point is to appear in the compressed representation of the data set, i. [sent-99, score-0.823]

41 Here, we can rigorously quantify our intuitive idea that higher number of clusters (corresponding to higher rates) is the inherent cost of attaining lower average distortion. [sent-102, score-0.361]

42 At the fixed point, the values of ηj are equal to 1 for all data points in the support of qj and are less than 1 otherwise [10]. [sent-106, score-0.753]

43 In practice, we compute the gap between maxj (log ηj ) and j qj log ηj in each iteration and stop the algorithm when this gap becomes less than a small threshold. [sent-107, score-0.665]

44 Note (t) (t) (t) that the soft assignments rij = qj sij /nzi need to be computed only once after the algorithm has converged. [sent-108, score-1.263]

45 Any value of β ∈ [0, ∞) yields a different solution to (8) with different number of nonzero qj values. [sent-109, score-0.675]

46 Smaller values of β correspond to having wider clusters and greater values correspond to narrower clusters. [sent-110, score-0.301]

47 Neither extreme, one assigning all data points to the central exemplar and the other taking all data points as exemplars, is interesting. [sent-111, score-0.38]

48 For reasonable ranges of β, the solution is sparse and the resulting number of nonzero components of qj determines the final number of clusters. [sent-112, score-0.727]

49 Similar to other interior-point methods, the convergence of our algorithm becomes slow as we move close to the vertices of the probability simplex where some qj ’s are very small. [sent-113, score-0.663]

50 In order to improve the convergence rate, after each iteration, we identify all qj ’s that are below a certain threshold (10−3 /n in our experiments,) set them to zero and re-normalize the entire distribution over the remaining indices. [sent-114, score-0.61]

51 This effectively excludes the corresponding points as possible exemplars and reduces the cost of the following iterations. [sent-115, score-0.365]

52 In order to further speed up the algorithm for very large data sets, we can search over values of sij for any i and keep only the largest no values in any row turning the proximity matrix into a sparse one. [sent-116, score-0.373]

53 The reasoning is simply that we expect any point to be represented in the final solution with exemplars relatively close to it. [sent-117, score-0.272]

54 Right: the exponential of rate (dotted line) and number of hard clusters for different values of beta (solid line. [sent-123, score-0.368]

55 To visualize the clustering results, we turn the soft responsibilities into hard assignments. [sent-131, score-0.539]

56 Here, we first choose the set of exemplars to be the set of all indices j that are MAP estimate exemplars for some data point i under P (j|xi ). [sent-132, score-0.479]

57 Figure 2 illustrates the shapes of the resulting hard clusters for different values of β. [sent-134, score-0.378]

58 Since β has dimensions of inverse variance in the case of Gaussian models, we chose an empirical value βo = n2 log n/ i,j xi − xj 2 so that values β around βo give reasonable results. [sent-135, score-0.241]

59 We can see how clusters split when we increase β. [sent-136, score-0.243]

60 Such cluster splitting behavior also occurs in the case of a Gaussian mixture model with unconstrained cluster centers and has been studied as the phase transitions of a corresponding statistical system [9]. [sent-137, score-0.558]

61 The resulting number of hard clusters for different values of β are shown in Figure 1 (right). [sent-139, score-0.335]

62 The distribution of data points in Figure 2 shows that this is a reasonable choice of number of clusters for this data set. [sent-141, score-0.392]

63 However, we also observe some fluctuations in the number of clusters even in the more stable regime of values of β. [sent-142, score-0.272]

64 Comparing this behavior with the monotonicity of our rate shows how, by turning the soft assignments into the hard ones, we lose the strong optimality guarantees we have for the original soft solution. [sent-143, score-0.659]

65 Nevertheless, since our global optimum is minimum to a well justified cost function, we expect to obtain relatively good hard assignments. [sent-144, score-0.239]

66 The main motivation for developing a convex formulation of clustering is to avoid the well-known problem of local optima and sensitivity to initialization. [sent-146, score-0.467]

67 We will refer to this regular mixture model as the soft k-means. [sent-148, score-0.423]

68 The kmeans algorithm is a limiting case of this mixture-model problem when β → ∞, hence the name soft k-means. [sent-149, score-0.243]

69 We use synthetic data sets by drawing points from unit variance Gaussian distributions centered around a set of vectors. [sent-151, score-0.271]

70 There is an important distinction between the soft k-means and our algorithm: although the results of both algorithms depend on the choice of β, only the soft k-means needs the number of clusters k as an input. [sent-152, score-0.683]

71 The exemplar data point of each cluster is denoted by a cross. [sent-159, score-0.394]

72 The range of normal distributions for any mixture model is illustrated here by circles around these exemplar points with radius equal to the square root of the variance corresponding to the value of β used by the algorithm (σ = (2β)−1/2 ). [sent-160, score-0.569]

73 As a measure of clustering quality, we use micro-averaged precision. [sent-163, score-0.237]

74 We form the contingency tables for the cluster assignments found by the algorithm and the true cluster labels. [sent-164, score-0.399]

75 The percentage of the total number of data points assigned to the right cluster is taken as the precision value of the clustering result. [sent-165, score-0.645]

76 In the first experiment, we look at the performance of the two algorithms as the number of clusters increases. [sent-167, score-0.243]

77 Different data sets are generated by drawing 3000 data points around some number of cluster centers in I 20 with all clusters having the same number of data points. [sent-168, score-0.75]

78 Each component of R any data-point vector comes from an independent Gaussian distribution with unit variance around the value of the corresponding component of its cluster center. [sent-169, score-0.234]

79 In this experiment, for any value of β, we repeat soft k-means 1000 times with random initialization and pick the solution with the highest likelihood value. [sent-171, score-0.326]

80 Figure 3 (left) presents the precision values as a function of the number of clusters in the mixture distribution that generates the 3000 data points. [sent-172, score-0.633]

81 We can see that performance of soft k-means drops as the number of clusters increases while our performance remains relatively stable. [sent-174, score-0.492]

82 Since the total number of data points remains the same, increasing the number of clusters results in increasing complexity of the problem with presumably more local minima to the cost function. [sent-178, score-0.488]

83 We draw 100 random vectors, with unit variance Gaussian distribution in each component, around any of the 40 cluster centers to make data sets of total 4000 √ data points. [sent-181, score-0.401]

84 The cluster centers are chosen to be of the form (0, · · · , 0, 50, 0, · · · , 0) where we change the position of the nonzero component to make different cluster centers. [sent-182, score-0.463]

85 In this way, the pairwise distance between all cluster centers is 50 by formation. [sent-183, score-0.31]

86 Figure 4 (left) presents the precision values found for the two algorithms when 4000 points lie in spaces with different dimensionality. [sent-184, score-0.301]

87 Again, the relative performance of Convex Clustering when compared to soft k-means improves with the increasing problem complexity. [sent-186, score-0.243]

88 This is another evidence that for larger data sets the less precise nature of our constrained search, as compared to the full mixture models, is well compensated by its ability to always find its global optimum. [sent-187, score-0.274]

89 6 Discussion and Related Work Since only the distances take part in our formulation and the values of data point vectors are not required, we can extend this method to any proximity data. [sent-190, score-0.314]

90 Given a matrix Dn×n = [dij ] that describes the pairwise symmetric or asymmetric dissimilarities between data points, we can replace dφ (xi , xj )’s in (8) with dij ’s and solve the same minimization problem whose convexity can be directly verified. [sent-191, score-0.277]

91 A previous application of rate-distortion theoretic ideas in clustering led to the deterministic annealing (DA). [sent-193, score-0.284]

92 It finds the exemplars by forming a factor graph and running a message passing algorithm on the graph as a way to minimize the clustering cost function [6]. [sent-198, score-0.569]

93 The second term is needed to put some cost on picking any point as an exemplar to prevent the trivial case of sending any point to itself. [sent-200, score-0.321]

94 We can interpret our algorithm as a relaxation of this combinatorial problem to the soft assignment case by introducing probabilities P(ci = j) = rij of associating point i with an exemplar j. [sent-204, score-0.759]

95 The 1 marginal distribution qj = n i rij is the probability that point j is an exemplar. [sent-205, score-0.945]

96 A possible choice might be H(q), entropy of distribution qj , which is bounded above by log k. [sent-207, score-0.687]

97 However, the entropy function is concave and any local or global minimum of a concave minimization problem over a simplex occurs in an extreme point of the feasible domain which in our case corresponds to the original combinatorial hard assignments [11]. [sent-208, score-0.435]

98 In contrast, using mutual information I(I, J) induced by rij as the regularizing term turns the problem into a convex problem. [sent-209, score-0.396]

99 In conclusion, we proposed a framework for constraining the search space of general mixture models to achieve global optimality of the solution. [sent-213, score-0.289]

100 In particular, our method promises to be useful in problems with large data sets where regular mixture models fail to yield consistent results due to their sensitivity to initialization. [sent-214, score-0.307]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qj', 0.61), ('rij', 0.269), ('clusters', 0.243), ('clustering', 0.237), ('soft', 0.22), ('exemplars', 0.201), ('cluster', 0.165), ('mixture', 0.16), ('exemplar', 0.152), ('precision', 0.129), ('proximity', 0.098), ('sij', 0.095), ('convex', 0.094), ('cost', 0.085), ('initialization', 0.079), ('points', 0.079), ('distortion', 0.078), ('fj', 0.076), ('centers', 0.068), ('minimization', 0.066), ('nonzero', 0.065), ('xj', 0.059), ('formulation', 0.053), ('combinatorial', 0.053), ('xi', 0.052), ('danial', 0.051), ('polina', 0.051), ('bregman', 0.051), ('em', 0.051), ('global', 0.05), ('optimality', 0.048), ('annealing', 0.047), ('assignments', 0.046), ('mj', 0.045), ('pairwise', 0.045), ('entropy', 0.045), ('responsibilities', 0.045), ('vectorial', 0.045), ('nity', 0.044), ('af', 0.043), ('regular', 0.043), ('shapes', 0.043), ('optima', 0.043), ('point', 0.042), ('sensitivity', 0.04), ('around', 0.038), ('optimum', 0.038), ('convexity', 0.038), ('presents', 0.037), ('hard', 0.037), ('distributions', 0.036), ('data', 0.035), ('source', 0.035), ('divergences', 0.034), ('dij', 0.034), ('average', 0.033), ('mutual', 0.033), ('divergence', 0.033), ('turning', 0.033), ('concave', 0.033), ('distance', 0.032), ('propagation', 0.032), ('rate', 0.032), ('log', 0.032), ('variance', 0.031), ('gain', 0.031), ('proposition', 0.031), ('search', 0.031), ('simplex', 0.03), ('sets', 0.029), ('distances', 0.029), ('values', 0.029), ('euclidean', 0.029), ('relatively', 0.029), ('da', 0.028), ('compression', 0.028), ('vectors', 0.028), ('namely', 0.028), ('connection', 0.028), ('exponential', 0.027), ('likelihood', 0.027), ('lie', 0.027), ('substituting', 0.027), ('components', 0.026), ('bits', 0.026), ('resulting', 0.026), ('illustrated', 0.026), ('family', 0.025), ('nds', 0.024), ('consequently', 0.024), ('marginal', 0.024), ('normal', 0.024), ('globally', 0.024), ('increasing', 0.023), ('algorithm', 0.023), ('nal', 0.023), ('guarantees', 0.023), ('drawing', 0.023), ('plan', 0.023), ('passing', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 61 nips-2007-Convex Clustering with Exemplar-Based Models

Author: Danial Lashkari, Polina Golland

Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1

2 0.20215751 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

Author: Francis R. Bach, Zaïd Harchaoui

Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.

3 0.17875537 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

Author: Ozgur Sumer, Umut Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of adaptive inference in tree-structured Bayesian networks. We describe an algorithm for adaptive inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm. 1

4 0.17347904 80 nips-2007-Ensemble Clustering using Semidefinite Programming

Author: Vikas Singh, Lopamudra Mukherjee, Jiming Peng, Jinhui Xu

Abstract: We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases. 1

5 0.1474113 46 nips-2007-Cluster Stability for Finite Samples

Author: Ohad Shamir, Naftali Tishby

Abstract: Over the past few years, the notion of stability in data clustering has received growing attention as a cluster validation criterion in a sample-based framework. However, recent work has shown that as the sample size increases, any clustering model will usually become asymptotically stable. This led to the conclusion that stability is lacking as a theoretical and practical tool. The discrepancy between this conclusion and the success of stability in practice has remained an open question, which we attempt to address. Our theoretical approach is that stability, as used by cluster validation algorithms, is similar in certain respects to measures of generalization in a model-selection framework. In such cases, the model chosen governs the convergence rate of generalization bounds. By arguing that these rates are more important than the sample size, we are led to the prediction that stability-based cluster validation algorithms should not degrade with increasing sample size, despite the asymptotic universal stability. This prediction is substantiated by a theoretical analysis as well as some empirical results. We conclude that stability remains a meaningful cluster validation criterion over finite samples. 1

6 0.12767012 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

7 0.098732509 84 nips-2007-Expectation Maximization and Posterior Constraints

8 0.090083323 58 nips-2007-Consistent Minimization of Clustering Objective Functions

9 0.081703499 63 nips-2007-Convex Relaxations of Latent Variable Training

10 0.076094419 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

11 0.075919583 8 nips-2007-A New View of Automatic Relevance Determination

12 0.070814505 193 nips-2007-The Distribution Family of Similarity Distances

13 0.070103042 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

14 0.06877172 143 nips-2007-Object Recognition by Scene Alignment

15 0.066456869 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

16 0.066050708 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

17 0.063707069 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

18 0.062107891 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

19 0.061886244 70 nips-2007-Discriminative K-means for Clustering

20 0.059574146 145 nips-2007-On Sparsity and Overcompleteness in Image Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.22), (1, 0.034), (2, -0.132), (3, 0.022), (4, -0.019), (5, -0.149), (6, 0.006), (7, 0.082), (8, -0.08), (9, 0.03), (10, -0.048), (11, 0.054), (12, 0.094), (13, -0.25), (14, -0.077), (15, 0.249), (16, 0.025), (17, 0.018), (18, -0.025), (19, 0.028), (20, -0.046), (21, -0.043), (22, -0.124), (23, 0.065), (24, 0.007), (25, 0.037), (26, -0.008), (27, -0.105), (28, 0.029), (29, 0.048), (30, 0.131), (31, -0.092), (32, 0.019), (33, -0.05), (34, 0.021), (35, 0.076), (36, -0.014), (37, -0.095), (38, -0.025), (39, 0.025), (40, -0.042), (41, 0.067), (42, 0.005), (43, -0.044), (44, 0.02), (45, -0.035), (46, -0.042), (47, -0.085), (48, -0.015), (49, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95406139 61 nips-2007-Convex Clustering with Exemplar-Based Models

Author: Danial Lashkari, Polina Golland

Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1

2 0.84616208 80 nips-2007-Ensemble Clustering using Semidefinite Programming

Author: Vikas Singh, Lopamudra Mukherjee, Jiming Peng, Jinhui Xu

Abstract: We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases. 1

3 0.7800222 46 nips-2007-Cluster Stability for Finite Samples

Author: Ohad Shamir, Naftali Tishby

Abstract: Over the past few years, the notion of stability in data clustering has received growing attention as a cluster validation criterion in a sample-based framework. However, recent work has shown that as the sample size increases, any clustering model will usually become asymptotically stable. This led to the conclusion that stability is lacking as a theoretical and practical tool. The discrepancy between this conclusion and the success of stability in practice has remained an open question, which we attempt to address. Our theoretical approach is that stability, as used by cluster validation algorithms, is similar in certain respects to measures of generalization in a model-selection framework. In such cases, the model chosen governs the convergence rate of generalization bounds. By arguing that these rates are more important than the sample size, we are led to the prediction that stability-based cluster validation algorithms should not degrade with increasing sample size, despite the asymptotic universal stability. This prediction is substantiated by a theoretical analysis as well as some empirical results. We conclude that stability remains a meaningful cluster validation criterion over finite samples. 1

4 0.68137056 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

Author: Francis R. Bach, Zaïd Harchaoui

Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.

5 0.62941569 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

Author: Ozgur Sumer, Umut Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of adaptive inference in tree-structured Bayesian networks. We describe an algorithm for adaptive inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm. 1

6 0.57100332 70 nips-2007-Discriminative K-means for Clustering

7 0.50042498 58 nips-2007-Consistent Minimization of Clustering Objective Functions

8 0.44678932 84 nips-2007-Expectation Maximization and Posterior Constraints

9 0.4145278 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

10 0.40925354 63 nips-2007-Convex Relaxations of Latent Variable Training

11 0.4055064 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

12 0.39935714 8 nips-2007-A New View of Automatic Relevance Determination

13 0.38092712 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

14 0.37876457 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta

15 0.3719525 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

16 0.35589433 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

17 0.35231069 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

18 0.35196123 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

19 0.34026012 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

20 0.33795959 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.047), (13, 0.023), (16, 0.013), (21, 0.044), (31, 0.017), (34, 0.014), (35, 0.025), (47, 0.062), (49, 0.01), (83, 0.592), (85, 0.014), (90, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99775147 61 nips-2007-Convex Clustering with Exemplar-Based Models

Author: Danial Lashkari, Polina Golland

Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1

2 0.99771255 159 nips-2007-Progressive mixture rules are deviation suboptimal

Author: Jean-yves Audibert

Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1

3 0.99684548 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1

4 0.99653888 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields

Author: Simon Osindero, Geoffrey E. Hinton

Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1

5 0.99594307 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

Author: Venkat Chandrasekaran, Alan S. Willsky, Jason K. Johnson

Abstract: We consider the estimation problem in Gaussian graphical models with arbitrary structure. We analyze the Embedded Trees algorithm, which solves a sequence of problems on tractable subgraphs thereby leading to the solution of the estimation problem on an intractable graph. Our analysis is based on the recently developed walk-sum interpretation of Gaussian estimation. We show that non-stationary iterations of the Embedded Trees algorithm using any sequence of subgraphs converge in walk-summable models. Based on walk-sum calculations, we develop adaptive methods that optimize the choice of subgraphs used at each iteration with a view to achieving maximum reduction in error. These adaptive procedures provide a significant speedup in convergence over stationary iterative methods, and also appear to converge in a larger class of models. 1

6 0.99591362 109 nips-2007-Kernels on Attributed Pointsets with Applications

7 0.94979018 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

8 0.93791169 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

9 0.92203194 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

10 0.92144322 116 nips-2007-Learning the structure of manifolds using random projections

11 0.92091715 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

12 0.9205209 40 nips-2007-Bundle Methods for Machine Learning

13 0.91759497 21 nips-2007-Adaptive Online Gradient Descent

14 0.91593075 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

15 0.90946406 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

16 0.90945214 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

17 0.9041431 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

18 0.90390188 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

19 0.90289462 46 nips-2007-Cluster Stability for Finite Samples

20 0.8990736 101 nips-2007-How SVMs can estimate quantiles and the median