nips nips2002 nips2002-27 knowledge-graph by maker-knowledge-mining

27 nips-2002-An Impossibility Theorem for Clustering


Source: pdf

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. [sent-2, score-0.912]

2 Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. [sent-3, score-0.575]

3 A standard approach is to represent the collection of objects as a set of abstract points, and define distances among the points to represent similarities — the closer the points, the more similar they are. [sent-5, score-0.352]

4 Thus, clustering is centered around an intuitively compelling but vaguely defined goal: given an underlying set of points, partition them into a collection of clusters so that points in the same cluster are close together, while points in different clusters are far apart. [sent-6, score-1.525]

5 Given the scope of the issue, there has been relatively little work aimed at reasoning about clustering independently of any particular algorithm, objective function, or generative data model. [sent-10, score-0.544]

6 In some striking cases, as in Arrow’s celebrated theorem on social choice functions [2], the result is impossibility — there is no solution that simultaneously satisfies a small collection of simple properties. [sent-13, score-0.664]

7 First, as is standard, we define a clustering function to be any function f that takes a set S of n points with pairwise distances between them, and returns a partition of S. [sent-15, score-1.091]

8 (The points in S are not assumed to belong to any ambient space; the pairwise distances are the only data one has about them. [sent-16, score-0.41]

9 ) We then consider the effect of requiring the clustering function to obey certain natural properties. [sent-17, score-0.545]

10 None of these properties is redundant, in the sense that it is easy to construct clustering functions satisfying any two of the three. [sent-19, score-0.709]

11 We also show, by way of contrast, that certain natural relaxations of this set of properties are satisfied by versions of well-known clustering functions, including those derived from single-linkage and sum-of-pairs. [sent-20, score-0.666]

12 In particular, we fully characterize the set of possible outputs of a clustering function that satisfies the scale-invariance and consistency properties. [sent-21, score-0.701]

13 How should one interpret an impossibility result in this setting? [sent-22, score-0.319]

14 ” It indicates a set of basic trade-offs that are inherent in the clustering problem, and offers a way to distinguish between clustering methods based not simply on operational grounds, but on the ways in which they resolve the choices implicit in these trade-offs. [sent-24, score-1.014]

15 As discussed above, the vast majority of approaches to clustering are derived from the application of specific algorithms, the optima of specific objective functions, or the consequences of particular probabilistic generative models for the data. [sent-27, score-0.544]

16 Jardine and Sibson [7] and Puzicha, Hofmann, and Buhmann [12] have considered axiomatic approaches to clustering, although they operate in formalisms quite different from ours, and they do not seek impossibility results. [sent-29, score-0.501]

17 consider properties of cost functions on partitions; these implicitly define clustering functions through the process of choosing a minimum-cost partition. [sent-33, score-0.721]

18 They investigate a particular class of clustering functions that arises if one requires the cost function to decompose into a certain additive form. [sent-34, score-0.651]

19 2 The Impossibility Theorem A clustering function operates on a set S of n ≥ 2 points and the pairwise distances among them. [sent-40, score-0.888]

20 Since we wish to deal with point sets that do not necessarily belong to an ambient space, we identify the points with the set S = {1, 2, . [sent-41, score-0.241]

21 We then define a distance function to be any function d : S × S → R such that for distinct i, j ∈ S, we have d(i, j) ≥ 0, d(i, j) = 0 if and only if i = j, and d(i, j) = d(j, i). [sent-45, score-0.24]

22 One can optionally restrict attention to distance functions that are metrics by imposing the triangle inequality: d(i, k) ≤ d(i, j) + d(j, k) for all i, j, k ∈ S. [sent-46, score-0.257]

23 A clustering function is a function f that takes a distance function d on S and returns a partition Γ of S. [sent-48, score-0.981]

24 We note that, as written, a clustering function is defined only on point sets of a particular size (n); however, all the specific clustering functions we consider here will be defined for all values of n larger than some small base value. [sent-50, score-1.112]

25 Here is a first property one could require of a clustering function. [sent-51, score-0.532]

26 If d is a distance function, we write α·d to denote the distance function in which the distance between i and j is αd(i, j). [sent-52, score-0.458]

27 This is simply the requirement that the clustering function not be sensitive to changes in the units of distance measurement — it should not have a built-in “length scale. [sent-55, score-0.685]

28 ” A second property is that the output of the clustering function should be “rich” — every partition of S is a possible output. [sent-56, score-0.818]

29 To state this more compactly, let Range(f ) denote the set of all partitions Γ such that f (d) = Γ for some distance function d. [sent-57, score-0.427]

30 Richness requires that for any desired partition Γ, it should be possible to construct a distance function d on S for which f (d) = Γ. [sent-63, score-0.424]

31 We think of a clustering function as being “consistent” if it exhibits the following behavior: when we shrink distances between points inside a cluster and expand distances between points in different clusters, we get the same result. [sent-65, score-1.239]

32 Let Γ be a partition of S, and d and d two distance functions on S. [sent-67, score-0.42]

33 We say that d is a Γ-transformation of d if (a) for all i, j ∈ S belonging to the same cluster of Γ, we have d (i, j) ≤ d(i, j); and (b) for all i, j ∈ S belonging to different clusters of Γ, we have d (i, j) ≥ d(i, j). [sent-68, score-0.399]

34 In other words, suppose that the clustering Γ arises from the distance function d. [sent-72, score-0.76]

35 If we now produce d by reducing distances within the clusters and enlarging distance between the clusters then the same clustering Γ should arise from d . [sent-73, score-1.05]

36 We can now state the impossibility theorem very simply. [sent-74, score-0.448]

37 1 For each n ≥ 2, there is no clustering function f that satisfies ScaleInvariance, Richness, and Consistency. [sent-76, score-0.545]

38 Before doing this, we reflect on the relation of these properties to one another by showing that there exist natural clustering functions satisfying any two of the three properties. [sent-79, score-0.708]

39 [6]), which in fact defines a family of clustering functions. [sent-82, score-0.507]

40 Intuitively, single-linkage operates by initializing each point as its own cluster, and then repeatedly merging the pair of clusters whose distance to one another (as measured from their closest points of approach) is minimum. [sent-83, score-0.454]

41 It then orders the edges of Gd by non-decreasing weight (breaking ties lexicographically), and adds edges one at a time until a specified stopping condition is reached. [sent-85, score-0.464]

42 Let Hd denote the subgraph consisting of all edges that are added before the stopping condition is reached; the connected components of Hd are the clusters. [sent-86, score-0.45]

43 Thus, by choosing a stopping condition for the single-linkage procedure, one obtains a clustering function, which maps the input distance function to the set of connected components that results at the end of the procedure. [sent-87, score-1.05]

44 1, one can choose a single-linkage stopping condition so that the resulting clustering function satisfies these two properties. [sent-89, score-0.916]

45 Here are the three types of stopping conditions we will consider. [sent-90, score-0.264]

46 It is clear that these various stopping conditions qualitatively trade off certain of the properties in Theorem 2. [sent-101, score-0.332]

47 Thus, for example, the k-cluster stopping condition does not attempt to produce all possible partitions, while the distance-r stopping condition builds in a fundamental length scale, and hence is not scale-invariant. [sent-103, score-0.678]

48 However, by the appropriate choice of one of these stopping conditions, one can achieve any two of the three properties in Theorem 2. [sent-104, score-0.358]

49 2 (a) For any k ≥ 1, and any n ≥ k, single-linkage with the k-cluster stopping condition satisfies Scale-Invariance and Consistency. [sent-107, score-0.339]

50 (b) For any positive α < 1, and any n ≥ 3, single-linkage with the scale-α stopping condition satisfies Scale-Invariance and Richness. [sent-108, score-0.339]

51 (c) For any r > 0, and any n ≥ 2, single-linkage with the distance-r stopping condition satisfies Richness and Consistency. [sent-109, score-0.339]

52 3 Antichains of Partitions We now state and prove a strengthening of the impossibility result. [sent-110, score-0.358]

53 We say that a partition Γ is a refinement of a partition Γ if for every set C ∈ Γ , there is a set C ∈ Γ such that C ⊆ C. [sent-111, score-0.504]

54 Following the terminology of partially ordered sets, we say that a collection of partitions is an antichain if it does not contain two distinct partitions such that one is a refinement of the other. [sent-113, score-0.687]

55 For a set of n ≥ 2 points, the collection of all partitions does not form an antichain; thus, Theorem 2. [sent-114, score-0.278]

56 1 If a clustering function f satisfies Scale-Invariance and Consistency, then Range(f ) is an antichain. [sent-116, score-0.545]

57 For a partition Γ, we say that a distance function d (a, b)-conforms to Γ if, for all pairs of points i, j that belong to the same cluster of Γ, we have d(i, j) ≤ a, while for all pairs of points i, j that belong to different clusters, we have d(i, j) ≥ b. [sent-118, score-1.111]

58 With respect to a given clustering function f , we say that a pair of positive real numbers (a, b) is Γ-forcing if, for all distance functions d that (a, b)-conform to Γ, we have f (d) = Γ. [sent-119, score-0.819]

59 Let f be a clustering function that satisfies Consistency. [sent-120, score-0.545]

60 We claim that for any partition Γ ∈ Range(f ), there exist positive real numbers a < b such that the pair (a, b) is Γ-forcing. [sent-121, score-0.308]

61 Now, let a be the minimum distance among pairs of points in the same cluster of Γ, and let b be the maximum distance among pairs of points that do not belong to the same cluster of Γ. [sent-123, score-1.166]

62 Now suppose further that the clustering function f satisfies Scale-Invariance, and that there exist distinct partitions Γ0 , Γ1 ∈ Range(f ) such that Γ0 is a refinement of Γ1 . [sent-127, score-0.835]

63 But for points i, j in the same cluster of Γ0 we have d (i, j) ≤ εb0 a−1 < a0 , 2 while for points i, j that do not belong to the same cluster of Γ0 we have d (i, j) ≥ a2 b0 a−1 ≥ b0 . [sent-135, score-0.644]

64 The proof above uses our assumption that the clustering function f is defined on the set of all distance functions on n points. [sent-138, score-0.775]

65 However, essentially the same proof yields a corresponding impossibility result for clustering functions f that are defined only on metrics, or only on distance functions arising from n points in a Euclidean space of some dimension. [sent-139, score-1.261]

66 To adapt the proof, one need only be careful to choose the constant a2 and distance function d to satisfy the required properties. [sent-140, score-0.264]

67 1, this completely characterizes the possible values of Range(f ) for clustering functions f that satisfy Scale-Invariance and Consistency. [sent-142, score-0.621]

68 2 For every antichain of partitions A, there is a clustering function f satisfying Scale-Invariance and Consistency for which Range(f ) = A. [sent-144, score-0.97]

69 Given an arbitrary antichain A, it is not clear how to produce a stopping condition for the single-linkage procedure that gives rise to a clustering function f with Range(f ) = A. [sent-146, score-1.021]

70 (Note that the k-cluster stopping condition yields a clustering function whose range is the antichain consisting of all partitions into k sets. [sent-147, score-1.312]

71 ) Thus, to prove this result, we use a variant of the sum-of-pairs clustering function (see e. [sent-148, score-0.584]

72 For a partition Γ ∈ A, we write (i, j) ∼ Γ if both i and j belong to the same cluster in Γ. [sent-152, score-0.467]

73 The A-sum-of-pairs function f seeks the partition Γ ∈ A that minimizes the sum of all distances between pairs of points in the same cluster; in other words, it seeks the Γ ∈ A minimizing the objective function Φd (Γ) = (i,j)∼Γ d(i, j). [sent-153, score-0.707]

74 ) It is crucial that the minimization is only over partitions in A; clearly, if we wished to minimize this objective function over all partitions, we would choose the partition in which each point forms its own cluster. [sent-155, score-0.539]

75 For any partition Γ ∈ A, construct a distance function d with the following properties: d(i, j) < n−3 for every pair of points i, j belonging to the same cluster of Γ, and d(i, j) ≥ 1 for every pair of points i, j belonging to different clusters of Γ. [sent-158, score-1.157]

76 For any partition Γ , let ∆(Γ ) = Φd (Γ ) − Φd (Γ ). [sent-163, score-0.257]

77 4 Centroid-Based Clustering and Consistency In a widely-used approach to clustering, one selects k of the input points as centroids, and then defines clusters by assigning each point in S to its nearest centroid. [sent-166, score-0.253]

78 This overall approach arises both from combinatorial optimization perspectives, where it has roots in facility location problems [9], and in maximumlikelihood methods, where the centroids may represent centers of probability density functions [4, 6]. [sent-168, score-0.331]

79 We show here that for a fairly general class of centroid-based clustering functions, including k-means and k-median, none of the functions in the class satisfies the Consistency property. [sent-169, score-0.591]

80 Specifically, for any natural number k ≥ 2, and any continuous, non-decreasing, and unbounded function g : R+ → R+ , we define the (k, g)-centroid clustering function as follows. [sent-171, score-0.583]

81 ) Then we define a partition of S into k clusters by assigning each point to the element of T closest to it. [sent-174, score-0.354]

82 The k-median function [9] is obtained by setting g to be the identity function, while the objective function underlying k-means clustering [4, 6] is obtained by setting g(d) = d2 . [sent-175, score-0.62]

83 1 For every k ≥ 2 and every function g chosen as above, and for n sufficiently large relative to k, the (k, g)-centroid clustering function does not satisfy the Consistency property. [sent-177, score-0.693]

84 The distance between points in X is r, the distance between points in Y is ε < r, and the distance from a point in X to a point in Y is r + δ, for a small number δ > 0. [sent-181, score-0.658]

85 By choosing γ, r, ε, and δ appropriately, the optimal choice of k = 2 centroids will consist of one point from X and one from Y , and the resulting partition Γ will have clusters X and Y . [sent-182, score-0.505]

86 Now, suppose we divide X into sets X0 and X1 of equal size, and reduce the distances between points in the same Xi to be r < r (keeping all other distances the same). [sent-183, score-0.442]

87 This can be done, for r small enough, so that the optimal choice of two centroids will now consist of one point from each Xi , yielding a different partition of S. [sent-184, score-0.345]

88 5 Relaxing the Properties In addition to looking for clustering functions that satisfy subsets of the basic properties, we can also study the effect of relaxing the properties themselves. [sent-186, score-0.723]

89 As an another example, it is interesting to note that single-linkage with the distance-r stopping condition satisfies a natural relaxation of Scale-Invariance: if α > 1, then f (α · d) is a refinement of f (d). [sent-189, score-0.378]

90 Let f be a clustering function, and d a distance function such that f (d) = Γ. [sent-191, score-0.685]

91 If we reduce distances within clusters and expand distances between clusters, Consistency requires that f output the same partition Γ. [sent-192, score-0.651]

92 But one could imagine requiring something less: perhaps changing distances this way should be allowed to create additional sub-structure, leading to a new partition in which each cluster is a subset of one of the original clusters. [sent-193, score-0.514]

93 1 still holds: there is no clustering function that satisfies Scale-Invariance, Richness, and Refinement-Consistency. [sent-196, score-0.545]

94 Then there exist clustering functions f that satisfy Scale-Invariance and Refinement-Consistency, and for which Range(f ) consists of all partitions except Γ∗ . [sent-202, score-0.858]

95 (One example is single-linkage with the distance-(αδ) stopping condition, where n δ = mini,j d(i, j) is the minimum inter-point distance, and α ≥ 1. [sent-203, score-0.264]

96 ) Such functions f , in addition to Scale-Invariance and Refinement-Consistency, thus satisfy a kind of Near-Richness property: one can obtain every partition as output except for a single, trivial partition. [sent-204, score-0.362]

97 It is in this sense that our impossibility result for Refinement-Consistency, unlike Theorem 2. [sent-205, score-0.319]

98 It is possible to construct clustering functions f that satisfy this even weaker variant of Consistency, together with Scale-Invariance and Richness. [sent-209, score-0.67]

99 Roberts, “An impossibility result in axiomatic location theory,” Mathematics of Operations Research 21(1996). [sent-228, score-0.535]

100 Giles, “Social choice theory and recommender systems: Analysis of the axiomatic foundations of collaborative filtering,” Proc. [sent-250, score-0.256]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clustering', 0.507), ('impossibility', 0.319), ('stopping', 0.264), ('partition', 0.22), ('partitions', 0.212), ('axiomatic', 0.182), ('cluster', 0.159), ('consistency', 0.156), ('distance', 0.14), ('antichain', 0.137), ('richness', 0.137), ('distances', 0.135), ('clusters', 0.134), ('theorem', 0.129), ('points', 0.119), ('satis', 0.114), ('nement', 0.111), ('centroids', 0.099), ('jardine', 0.091), ('relaxations', 0.091), ('sibson', 0.091), ('erent', 0.09), ('belong', 0.088), ('di', 0.085), ('condition', 0.075), ('es', 0.075), ('facility', 0.068), ('properties', 0.068), ('collection', 0.066), ('social', 0.064), ('functions', 0.06), ('re', 0.06), ('satisfy', 0.054), ('range', 0.052), ('pairs', 0.052), ('satisfying', 0.048), ('edges', 0.048), ('collaborative', 0.048), ('arrow', 0.048), ('puzicha', 0.048), ('arises', 0.046), ('kalai', 0.046), ('papadimitriou', 0.046), ('pennock', 0.04), ('vempala', 0.04), ('vetta', 0.04), ('prove', 0.039), ('relaxation', 0.039), ('function', 0.038), ('pair', 0.038), ('objective', 0.037), ('let', 0.037), ('say', 0.036), ('hansen', 0.036), ('subgraph', 0.036), ('belonging', 0.035), ('location', 0.034), ('pairwise', 0.034), ('seeks', 0.034), ('compelling', 0.034), ('ambient', 0.034), ('relaxing', 0.034), ('uni', 0.034), ('intuitively', 0.033), ('among', 0.032), ('choose', 0.032), ('roberts', 0.032), ('proof', 0.03), ('suppose', 0.029), ('ne', 0.029), ('hd', 0.029), ('ties', 0.029), ('triangle', 0.029), ('wiley', 0.028), ('every', 0.028), ('metrics', 0.028), ('award', 0.028), ('gd', 0.028), ('diverse', 0.028), ('buhmann', 0.028), ('consisting', 0.027), ('ltering', 0.027), ('uniquely', 0.027), ('expand', 0.027), ('ect', 0.027), ('construct', 0.026), ('arising', 0.026), ('choice', 0.026), ('choosing', 0.026), ('claim', 0.025), ('exist', 0.025), ('property', 0.025), ('combinatorial', 0.024), ('collections', 0.024), ('distinct', 0.024), ('de', 0.024), ('divide', 0.024), ('none', 0.024), ('operates', 0.023), ('together', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

2 0.20406401 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

Author: Eric P. Xing, Michael I. Jordan, Stuart Russell, Andrew Y. Ng

Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar.” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. £ ¤¢ £ ¥¢

3 0.1827646 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).

4 0.16685344 98 nips-2002-Going Metric: Denoising Pairwise Data

Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1

5 0.15868884 90 nips-2002-Feature Selection in Mixture-Based Clustering

Author: Martin H. Law, Anil K. Jain, Mário Figueiredo

Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.

6 0.1551163 188 nips-2002-Stability-Based Model Selection

7 0.13145724 83 nips-2002-Extracting Relevant Structures with Side Information

8 0.12950294 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

9 0.11667128 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains

10 0.099692047 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

11 0.093521543 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering

12 0.087415881 40 nips-2002-Bayesian Models of Inductive Generalization

13 0.083054587 142 nips-2002-Maximum Likelihood and the Information Bottleneck

14 0.079794683 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

15 0.079524465 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

16 0.076205678 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

17 0.075686112 87 nips-2002-Fast Transformation-Invariant Factor Analysis

18 0.069956839 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

19 0.06130635 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

20 0.058676317 4 nips-2002-A Differential Semantics for Jointree Algorithms


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.186), (1, -0.078), (2, 0.004), (3, 0.007), (4, -0.102), (5, 0.095), (6, -0.087), (7, -0.242), (8, -0.219), (9, 0.209), (10, 0.141), (11, 0.067), (12, -0.118), (13, 0.069), (14, -0.106), (15, -0.101), (16, 0.047), (17, -0.157), (18, -0.058), (19, 0.084), (20, -0.001), (21, -0.015), (22, 0.041), (23, 0.149), (24, 0.12), (25, 0.034), (26, 0.028), (27, 0.094), (28, 0.013), (29, -0.11), (30, -0.021), (31, 0.037), (32, 0.099), (33, 0.108), (34, 0.125), (35, 0.009), (36, 0.025), (37, -0.077), (38, 0.0), (39, -0.078), (40, 0.009), (41, -0.031), (42, 0.064), (43, -0.034), (44, -0.03), (45, 0.061), (46, -0.044), (47, -0.019), (48, -0.015), (49, -0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98157084 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

2 0.78513491 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

Author: Eric P. Xing, Michael I. Jordan, Stuart Russell, Andrew Y. Ng

Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar.” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. £ ¤¢ £ ¥¢

3 0.70193499 188 nips-2002-Stability-Based Model Selection

Author: Tilman Lange, Mikio L. Braun, Volker Roth, Joachim M. Buhmann

Abstract: Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to cross-validation in the supervised case, but extends to semi-supervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area.

4 0.63446009 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).

5 0.61718011 98 nips-2002-Going Metric: Denoising Pairwise Data

Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1

6 0.60730648 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains

7 0.52655053 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering

8 0.52576405 83 nips-2002-Extracting Relevant Structures with Side Information

9 0.45379546 107 nips-2002-Identity Uncertainty and Citation Matching

10 0.40873891 142 nips-2002-Maximum Likelihood and the Information Bottleneck

11 0.38037443 40 nips-2002-Bayesian Models of Inductive Generalization

12 0.37426785 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

13 0.35094297 90 nips-2002-Feature Selection in Mixture-Based Clustering

14 0.34906536 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

15 0.34039247 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

16 0.32862973 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

17 0.3280952 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

18 0.32346442 4 nips-2002-A Differential Semantics for Jointree Algorithms

19 0.30386701 87 nips-2002-Fast Transformation-Invariant Factor Analysis

20 0.26021516 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.027), (14, 0.03), (23, 0.026), (34, 0.211), (42, 0.053), (54, 0.138), (55, 0.04), (67, 0.034), (68, 0.019), (74, 0.139), (92, 0.085), (98, 0.111)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86956668 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives

Author: Auke J. Ijspeert, Jun Nakanishi, Stefan Schaal

Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-offreedom humanoid robot exemplify the feasibility and robustness of our approach. 1

same-paper 2 0.84115815 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

3 0.73273939 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡   £  £  £ §¤¢ £ © ¨ ¦ ¥ ©   ¡     ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤  ¡ parents B % % 9 C0A@ ! 9  @8 § ¥   ¢   2 ' % % 310  parents    ©¢   £ ¡ !    ' % #!  

4 0.72355932 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1

5 0.72348416 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter

Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.

6 0.72328317 53 nips-2002-Clustering with the Fisher Score

7 0.72323453 89 nips-2002-Feature Selection by Maximum Marginal Diversity

8 0.72263384 2 nips-2002-A Bilinear Model for Sparse Coding

9 0.72194648 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

10 0.72192174 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

11 0.71984208 10 nips-2002-A Model for Learning Variance Components of Natural Images

12 0.71783078 83 nips-2002-Extracting Relevant Structures with Side Information

13 0.71753848 74 nips-2002-Dynamic Structure Super-Resolution

14 0.71583706 3 nips-2002-A Convergent Form of Approximate Policy Iteration

15 0.71566385 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

16 0.71501952 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

17 0.71467328 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

18 0.71284705 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

19 0.71239215 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

20 0.71236038 124 nips-2002-Learning Graphical Models with Mercer Kernels