jmlr jmlr2010 jmlr2010-19 knowledge-graph by maker-knowledge-mining

19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Source: pdf

Author: Gunnar Carlsson, Facundo Mémoli

Abstract: We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. Keywords: clustering, hierarchical clustering, stability of clustering, Gromov-Hausdorff distance

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. [sent-8, score-1.392]

2 Standard clustering methods take as input a ﬁnite metric space ÔX, d Õ and output a partition of X. [sent-21, score-0.426]

3 Fix a standard clustering method f and a metric space ÔX, d Õ and let f ÔX, d Õ Π È P ÔX Õ. [sent-27, score-0.425]

4 Hierarchical methods take as input a ﬁnite metric space ÔX, d Õ and output a hierarchical family of partitions of X. [sent-50, score-0.374]

5 For this reason we opt to carry out our analysis under the model that hierarchical methods take as input a ﬁnite metric space X and output a proximity dendrogram over X, see Remark 3. [sent-57, score-0.674]

6 We remind the reader that we are using the term standard clustering methods to refer to procedures that take a ﬁnite metric space as input and output a ﬁxed single partition of the metric space. [sent-58, score-0.656]

7 The unique HC method characterized by our theorem turns out to be single linkage hierarchical clustering. [sent-61, score-0.391]

8 The curvature conditions can be formulated in terms of properties of triangles within the metric spaces, and the most extreme of these properties is that embodied in ultrametric spaces. [sent-69, score-0.547]

9 A second idea of Gromov’s is to make the collection of all metric spaces into its own metric space, and the resulting metric gives a very useful and natural way to distinguish between metric spaces (Gromov, 2007). [sent-70, score-1.104]

10 This metric is known as the Gromov-Hausdorff distance and its restriction to the subclass of ultrametric spaces is therefore a very natural object to study. [sent-71, score-0.666]

11 Since input data are modelled as ﬁnite metric spaces, and the output of hierarchical methods can be regarded as ﬁnite ultrametric spaces, the Gromov-Hausdorff distance provides a natural tool for studying variability or perturbation of the inputs and outputs of hierarchical clustering methods. [sent-74, score-0.929]

12 We formalize this concept by equivalently representing dendrogram as ultrametrics and then computing the Gromov-Hausdorff distance between the resulting metrics. [sent-80, score-0.418]

13 These results describe in a very precise way the fact that for compact metric spaces X, the results of clustering the ﬁnite subsets of X yields a collection of dendrograms which ultimately converge to the dendrogram for X. [sent-86, score-1.214]

14 In order for this to happen, one needs the metric on the ultrametric spaces as well as the behavior of the clustering construction on the Gromov-Hausdorff distance, which is what Proposition 2 does. [sent-87, score-0.793]

15 3 discusses the representation of dendrograms as ultrametric spaces and establishes the equivalence of both repersentations; and §3. [sent-101, score-0.832]

16 5 delves into the issue of constructing a notion of distance between dendrograms which is based in the equivalence of dendrograms and ultrametrics; §3. [sent-102, score-0.909]

17 A metric space ÔX, uÕ is an ultrametric space if and only if for all x, x½ , x¾ È X, maxÔuÔx, x½ Õ, uÔx½ , x¾ ÕÕ uÔx, x¾ Õ. [sent-117, score-0.593]

18 (1) Ultrametric spaces are therefore metric spaces which satisfy a stronger type of triangle inequality. [sent-118, score-0.386]

19 Let ç ç X n 1 Xn denote the collection of all ﬁnite metric spaces and U n 1 Un all ﬁnite ultrametric spaces. [sent-131, score-0.664]

20 Indeed, assume that all sides a, b, c of a triangle in a given ultrametric space are different. [sent-160, score-0.362]

21 Hierarchical Clustering: Formulation In this section we formaly deﬁne hierarchical clustering methods as maps that assign a dendrogram to a ﬁnite metric space. [sent-177, score-0.784]

22 3 we prove that the collection of all dendrograms over a ﬁnite set is in a one to one correspondence with the collection of all ultrametrics on this set. [sent-182, score-0.588]

23 We then redeﬁne HC methods as maps from the collection of ﬁnite metric spaces to the collection all ﬁnite ultrametric spaces. [sent-183, score-0.744]

24 5, we discuss the construction of notions of distance between dendrograms by appealing to the ultrametric representation. [sent-186, score-0.813]

25 We formally describe dendrograms as pairs ÔX, θÕ, where X is a ﬁnite set and θ : Ö0, Õ P ÔX Õ. [sent-194, score-0.409]

26 (technical condition) Let D ÔX Õ denote the collection of all possible dendrograms over a given ﬁnite set X. [sent-218, score-0.459]

27 When understood from context, we will omit the ﬁrst component of a dendrogram ÔX, θÕ È D ÔX Õ and refer to θ as a dendrogram over X. [sent-219, score-0.574]

28 Remark 3 (About our deﬁnition of dendrogram) Our deﬁnition coincides with what Jain and Dubes call proximity dendrograms in Jain and Dubes (1988, §3. [sent-220, score-0.45]

29 Usually, Hierarchical Clustering methods are deﬁned as those maps that to each ﬁnite metric space ÔX, d Õ assign a dendrogram over X. [sent-226, score-0.57]

30 Whereas this tie breaking strategy seems reasonable from a computational point of view, it invariably leads to dendrograms that depend on the ordering of the points. [sent-262, score-0.409]

31 The construction of the dendrogram θ is completed by repeating the previous steps until all points have been merged into a single cluster. [sent-323, score-0.383]

32 This metric space arises from considering the graph metric on the graph depicted in Figure 6. [sent-327, score-0.483]

33 The three dendrograms shown on the right are the three possible outputs one ﬁnds when choosing different orders for the agglomeration processes. [sent-361, score-0.409]

34 Clearly, these three dendrograms are not equivalent under permutations of the labels of their base points. [sent-362, score-0.409]

35 This construction reﬂects the fact that at step i one agglomerates those clusters at distance Ri from eachother (as measured by the linkage function ℓ). [sent-384, score-0.423]

36 Example 4 Note for example that for the ﬁve point metric space in Example 3, the result of applying CL (according to the permutation invariant formulation) is the dendrogram in Figureª (a). [sent-396, score-0.54]

37 (b) shows the dendrogram that one obtains as output of (the permutation invariant formulation of) SL, AL and CL applied to the metric space L3 . [sent-399, score-0.54]

38 Proposition 8 Let ÔX, d Õ be a ﬁnite metric space and θSL be the dendrogram over X obtained by the single linkage agglomerative procedure described above, and let θ¦ be the dendrogram over X constructed in Example 2. [sent-412, score-1.158]

39 3 Dendrograms as Ultrametric Spaces The representation of dendrograms as ultrametrics is well known and it appears in the book by Jardine and Sibson (1971), it has already been used in the work of Hartigan (1985), and is touched upon in the classical reference of Jain and Dubes (1988, §3. [sent-415, score-0.488]

40 The formulation of the output of hierarchical clustering algorithms as ultrametric spaces is powerful when one is proving stability results, as well as results about the approximation of the dendrograms of metric spaces by their ﬁnite subspaces. [sent-419, score-1.382]

41 This is so because of the fact that once a dendrogram is regarded as a metric space, the Gromov-Hausdorff metric provides a very natural notion of distance on the output, in which the right kind of stability results are easily formulated. [sent-420, score-0.854]

42 The main result in this section is that dendrograms and ultrametrics are equivalent. [sent-422, score-0.488]

43 Note that condition (4) in the deﬁnition of dendrograms guarantees that uθ is well deﬁned. [sent-431, score-0.434]

44 x1 x2 uθ x3 x1 x2 x3 x4 x1 0 r1 r3 r3 x2 r1 0 r3 r3 x3 r3 r3 0 r2 x4 r3 r3 r2 0 x4 r1 r2 r3 Figure 9: A graphical representation of a dendrogram θ over X Øx1 , x2 , x3 , x3 Ù and the ultrametric uθ . [sent-434, score-0.604]

45 2 F ROM U LTRAMETRICS TO D ENDROGRAMS Conversely, given an ultrametric u : X ¢ X R , its associated dendrogram θu : Ö0, Õ P ÔX Õ can be obtained as follows: for each r 0 let θu ÔrÕ be the collection of equivalence classes of X under the relation x x½ if and only if uÔx, x½ Õ r. [sent-439, score-0.721]

46 It is easy to check that (1) given any dendrogram θ on X, θuθ θ and (2) given any ultrametric u on X, uθu u. [sent-457, score-0.604]

47 From (3) we see that Ψ satisﬁes From now, whenever given a dendrogram θX over a set X, we will be using the notation ΨÔθX Õ for the ultrametric associated to X given by Theorem 9. [sent-462, score-0.604]

48 In a similar manner, given an ultrametric u on X, Ψ¡1 ÔuÕ will denote the dendrogram over X given by Theorem 9. [sent-463, score-0.604]

49 4 Reformulation of Hierarchical Clustering using Ultrametrics In the sequel, appealing to Theorem 9 which states the equivalence between ultrametrics and dendrograms, we represent dendrograms as ultrametric spaces. [sent-465, score-0.844]

50 Then, any hierarchical clustering method can be regarded as a map from ﬁnite metric spaces into ﬁnite ultrametric spaces. [sent-466, score-0.851]

51 Example 6 For a given ﬁnite metric space ÔX, d Õ consider the HC method TSL given by TSL ÔX, d Õ ÔX, ΨÔθSLÕÕ, where θSL is the single linkage dendrogram over X deﬁned in §3. [sent-470, score-0.804]

52 This construction is sometimes known as the maximal sub-dominant ultrametric and it has the property that if u d is any other ultrametric on X, then u u¦ . [sent-489, score-0.669]

53 The Lemma below proves that this canonical construction is equivalent to the ultrametric induced by the equivalence relation in Example 1. [sent-490, score-0.391]

54 Lemma 12 For ÔX, d Õ È X write T¦ ÔX, d Õ ÔX, u¦ Õ and let ÔX, θ¦ Õ È D ÔX Õ be the dendrogram arising from the construction in Example 2. [sent-491, score-0.377]

55 r It turns out that T¦ yields exactly single linkage clustering as deﬁned in §3. [sent-494, score-0.408]

56 1442 x½ if and only if C HARACTERIZATION , S TABILITY AND C ONVERGENCE OF H IERARCHICAL C LUSTERING M ETHODS Equivalently, for any ﬁnite metric space X, the single linkage dendrogram θSL on X agrees with Ψ¡1 Ôu¦ Õ. [sent-497, score-0.804]

57 5 Comparing results of Hierarchical Clustering Methods One of the goals of this paper is to study the stability of clustering methods to perturbations in the input metric space. [sent-516, score-0.452]

58 We choose to do this by appealing to the ultrametric representation of dendrograms, which provides a natural way of deﬁning a distance between hierarchical clusterings. [sent-518, score-0.462]

59 Consider ﬁrst the simple case of two different dendrograms ÔX, αÕ and ÔX, βÕ over the same ﬁxed ﬁnite set X. [sent-520, score-0.409]

60 In this case, as a tentative measure of dissimilarity between the dendrograms we look at the maximal difference between the associated ultrametrics given by Theorem 9: uα ΨÔαÕ and uβ ΨÔβÕ: maxx,x½ ÈX uα Ôx, x½ Õ¡ uβ Ôx, x½ Õ . [sent-521, score-0.488]

61 There is a natural interpretation of the condition that maxx,x½ ÈX uα Ôx, x½ Õ¡ uβ Ôx, x½ Õ ε: if we look at the graphical representation of the dendrograms α and β, then the transition horizontal lines in Figure 11 have to occur within ε of eachother. [sent-522, score-0.434]

62 1443 ´ C ARLSSON AND M E MOLI α β r3 r3 r2 r2 r1 r1 x4 x3 x2 x1 Figure 11: Two different dendrograms ÔX, αÕ and ÔX, βÕ over the same underlying set X Øx1, x2, x3, x4Ù. [sent-527, score-0.409]

63 Now, in a slightly more general situation we may be faced with the task of comparing two different dendrograms α and β without knowing (or caring about) the exact labels of the points. [sent-529, score-0.409]

64 The most general case arises when we do not know whether the dendrograms come from the same underlying set or not. [sent-531, score-0.409]

65 For the next section we do not need to make use of the full generality in these considerations: there we only compare dendrograms deﬁned over the same underlying set. [sent-554, score-0.409]

66 We ﬁnish this section with a precise result regarding the stability of dendrograms arising from SLHC. [sent-556, score-0.491]

67 1446 C HARACTERIZATION , S TABILITY AND C ONVERGENCE OF H IERARCHICAL C LUSTERING M ETHODS X, α x1 x2 x3 x4 r3 r2 r1 r2 r1 y1 y2 y3 Y, β ½ Figure 13: These are the same dendrograms as in Figure 12. [sent-579, score-0.409]

68 Similarly, let ∆n be the metric space with the same underlying set and metric d∆n Ô pi , p j Õ 1, i, j È Ø1, . [sent-599, score-0.545]

69 1 if i x1 T¦ ÔP, d∆n Õ 1 x2 1 xn x1 x2 xn 1 ∆n ∆2 ∆3 ∆4 Figure 14: The metric spaces Ln and ∆n both have n points. [sent-607, score-0.381]

70 Single linkage HC applied to either of them yields the dendrogram in the center. [sent-608, score-0.551]

71 Deﬁne Ln to be the metric space with underlying set P and metric dLn Ô pi , p j Õ i¡ j 1448 C HARACTERIZATION , S TABILITY AND C ONVERGENCE OF H IERARCHICAL C LUSTERING M ETHODS ai ¡ a j . [sent-617, score-0.517]

72 Similarly, deﬁne ∆ε to be the metric space with underlying set P and metric d∆ε Ô pi , p j Õ n n si ¡ s j bi ¡ b j . [sent-618, score-0.517]

73 In Carlsson and M´ moli (2009) we study multiparameter clustering methods, which are similar to e HC methods but we track connected components in a multiparameter landscape. [sent-643, score-0.359]

74 We make the notion of similarity between dendrogram precise in §5 by interpreting dendrograms as ultrametric spaces and by computing the Gromov-Hausdorff distance between these ultrametric spaces. [sent-661, score-1.449]

75 The behavior is that the map of metric spaces should induce a map of clusters, that is, that if two points in the domain space belong to the same cluster, then so do their images in the clustering of the image metric space. [sent-663, score-0.741]

76 Condition (I) is clear, the two-point metric space contains only one degree of freedom which has to determine unambiguously the behavior of any clustering method T. [sent-675, score-0.397]

77 In terms of dendrograms, this ¢ ª means that the two point metric space ØA, BÙ, 0δ δ0 ¨ must be mapped to the dendrogram where A and B are merged at parameter value δ, see Figure 16. [sent-676, score-0.575]

78 Condition (II) is crucial and roughly says that whenever one shrinks some distances (even to zero) to obtain a new (pseudo) metric space, then the corresponding efforts in this new space have to be smaller than the efforts in the original metric space. [sent-677, score-0.483]

79 Let θX Ψ¡1 ÔuX Õ and θY Ψ¡1 ÔuY Õ be the dendrograms associated to uX and uY . [sent-679, score-0.409]

80 Condition (III) expresses the fact that in order to join two points x, x½ È X, any clustering method T has to make an effort of at least the separation sepÔX, d Õ of the metric space. [sent-682, score-0.4]

81 y1 T δ y2 y1 y2 δ Figure 16: Interpretation of Condition I: For all δ 0 the two point metric space on the left must be mapped by T into the dendrogram on the right. [sent-685, score-0.54]

82 Remark 20 It is interesting to point out why complete linkage and average linkage hierarchical clustering, as deﬁned in §3. [sent-686, score-0.621]

83 Consider the metric spaces X ØA, B,CÙ with metric given by the edge lengths Ø4, 3, 5Ù and Y ÔA½ , B½ ,C½ Õ with metric given by the edge lengths Ø4, 3, 2Ù, as given in Figure 19. [sent-690, score-0.757]

84 This notion of distance permits regarding the collection of all compact metric spaces as a metric space in itself. [sent-710, score-0.679]

85 Finite metric spaces are by now ubiquitous in virtually all areas of data analysis, and the idea of assigning a metric to the collection of all of them is in fact quite an old one. [sent-712, score-0.577]

86 For Euclidean metric spaces, for example, the idea of constructing a metric was used by Kendall et al. [sent-713, score-0.46]

87 ¨ The Gromov-Hausdorff distance dGH X,Y between compact metric spaces ÔX, dX Õ and ÔY, dY Õ ç was orignally deﬁned to be the inﬁmal ε 0 s. [sent-728, score-0.376]

88 3 A Probabilistic Convergence Result In this section, we prove a precise result which describes how the dendrograms attached to compact metric spaces by single linkage clustering can be obtained as the limits of the dendrograms attached to ﬁnite subsets of the metric space. [sent-821, score-1.78]

89 This kind of result is of great importance, since we are often interested in inﬁnite metric spaces but typically do not have access to more than ﬁnitely many random samples from the metric space. [sent-823, score-0.527]

90 A triple ÔX, dX , µX Õ, where ÔX, dX Õ is a metric space and µX is a Borel probability measure on X with compact support will be called an mm-space (short for measure metric space). [sent-834, score-0.51]

91 Our stability results seem to be novel and complement classical observations that CL and AL are discontinuous as maps from ﬁnite metric spaces into dendrograms. [sent-868, score-0.382]

92 The authors of the present paper reported a characterization for proximity dendrograms (Carlsson and M´ moli, 2008) using the language of category theory. [sent-878, score-0.45]

93 7 More classical is the work of Jardine and Sibson (1971) who also ultimately view HC methods as maps form ﬁnite metric spaces to ﬁnite ultrametric spaces. [sent-880, score-0.644]

94 Recall that the difference between these two types of dendrograms is that proximity dendrograms retain the linkage value at which mergings take place whereas threshold dendrograms only record the order, see Remark 3. [sent-882, score-1.532]

95 Collection of all dendrograms over the ﬁnite set X, page 1431. [sent-908, score-0.473]

96 An ultrametric obtained from the dendrogram θ, page 1440. [sent-914, score-0.668]

97 A dendrogram obtained from the ultrametric u, page 1441. [sent-915, score-0.668]

98 Hausdorff distance between subsets of the metric space Z, page 1455. [sent-919, score-0.369]

99 Standard linkage based HC methods seen as maps from X to U , page 1442. [sent-920, score-0.358]

100 Condition (1) in the deﬁnition of dendrograms implies that x x½ . [sent-998, score-0.409]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dendrograms', 0.409), ('ultrametric', 0.317), ('dendrogram', 0.287), ('linkage', 0.264), ('metric', 0.23), ('dgh', 0.198), ('sl', 0.164), ('moli', 0.163), ('arlsson', 0.152), ('haracterization', 0.145), ('ierarchical', 0.145), ('clustering', 0.144), ('hc', 0.144), ('ux', 0.132), ('fx', 0.126), ('lustering', 0.124), ('bx', 0.117), ('dx', 0.114), ('tability', 0.112), ('dln', 0.106), ('cl', 0.098), ('da', 0.096), ('hierarchical', 0.093), ('dis', 0.089), ('dh', 0.086), ('dz', 0.084), ('onvergence', 0.082), ('ultrametrics', 0.079), ('dubes', 0.073), ('ethods', 0.072), ('spaces', 0.067), ('carlsson', 0.066), ('uln', 0.066), ('supp', 0.066), ('kleinberg', 0.065), ('sep', 0.065), ('page', 0.064), ('dy', 0.062), ('uy', 0.061), ('xk', 0.058), ('al', 0.058), ('block', 0.057), ('ua', 0.056), ('stability', 0.055), ('distance', 0.052), ('hartigan', 0.051), ('wa', 0.051), ('collection', 0.05), ('jain', 0.049), ('ri', 0.046), ('xn', 0.042), ('proximity', 0.041), ('hausdorff', 0.041), ('burago', 0.04), ('eachother', 0.04), ('gromov', 0.04), ('agglomerative', 0.039), ('equivalence', 0.039), ('xr', 0.037), ('remark', 0.037), ('construction', 0.035), ('dr', 0.035), ('merged', 0.035), ('maxi', 0.035), ('pi', 0.034), ('theorem', 0.034), ('dxn', 0.033), ('dzn', 0.033), ('tcl', 0.033), ('clusters', 0.032), ('zn', 0.032), ('fix', 0.031), ('merge', 0.031), ('dl', 0.031), ('proposition', 0.03), ('maps', 0.03), ('partition', 0.029), ('partitions', 0.028), ('let', 0.028), ('xi', 0.028), ('arising', 0.027), ('compact', 0.027), ('multiparameter', 0.026), ('tsl', 0.026), ('points', 0.026), ('condition', 0.025), ('xt', 0.024), ('gn', 0.023), ('perturbations', 0.023), ('space', 0.023), ('ys', 0.023), ('triangle', 0.022), ('impossibility', 0.022), ('multiscale', 0.022), ('belong', 0.021), ('claim', 0.021), ('proof', 0.021), ('undesirable', 0.021), ('notice', 0.02), ('pick', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Author: Gunnar Carlsson, Facundo Mémoli

2 0.075403355 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be deﬁned as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that deﬁnes the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difﬁcult to check, our conditions are straightforward and intuitive: integrally strictly positive deﬁnite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und

3 0.075182252 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Author: Nguyen Xuan Vinh, Julien Epps, James Bailey

Abstract: Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0, 1] range better than other normalized variants. Keywords: clustering comparison, information theory, adjustment for chance, normalized information distance

4 0.059946936 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

5 0.04745381 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

Author: Fei Ye, Cun-Hui Zhang

Abstract: We consider the estimation of regression coefﬁcients in a high-dimensional linear model. For regression coefﬁcients in ℓr balls, we provide lower bounds for the minimax ℓq risk and minimax quantiles of the ℓq loss for all design matrices. Under an ℓ0 sparsity condition on a target coefﬁcient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefﬁcient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufﬁcient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the ℓq risk and loss in ℓr balls, 0 ≤ r ≤ 1 ≤ q ≤ ∞. By allowing q = ∞, our risk bounds imply the variable selection consistency of threshold Lasso and Dantzig selectors. Keywords: variable selection, estimation, oracle inequality, minimax, linear regression, penalized least squares, linear programming

6 0.041525405 10 jmlr-2010-An Exponential Model for Infinite Rankings

7 0.039463662 60 jmlr-2010-Learnability, Stability and Uniform Convergence

8 0.038362559 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

9 0.035330553 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

10 0.034430861 69 jmlr-2010-Lp-Nested Symmetric Distributions

11 0.029289652 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

12 0.029243484 22 jmlr-2010-Classification Using Geometric Level Sets

13 0.029211901 72 jmlr-2010-Matrix Completion from Noisy Entries

14 0.028665736 40 jmlr-2010-Fast and Scalable Local Kernel Machines

15 0.028066956 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

16 0.027970374 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

17 0.02768518 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

18 0.027610496 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

19 0.026275888 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

20 0.025863631 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.133), (1, -0.032), (2, 0.023), (3, -0.009), (4, -0.078), (5, -0.001), (6, -0.093), (7, 0.047), (8, 0.031), (9, 0.079), (10, -0.043), (11, 0.106), (12, 0.022), (13, 0.02), (14, -0.095), (15, -0.119), (16, -0.151), (17, -0.015), (18, 0.073), (19, -0.113), (20, 0.0), (21, -0.066), (22, 0.01), (23, -0.148), (24, -0.001), (25, 0.079), (26, -0.257), (27, -0.094), (28, 0.027), (29, -0.001), (30, -0.01), (31, -0.033), (32, 0.003), (33, -0.122), (34, 0.103), (35, 0.027), (36, 0.199), (37, 0.009), (38, 0.092), (39, -0.03), (40, 0.219), (41, -0.0), (42, -0.069), (43, 0.109), (44, -0.035), (45, -0.053), (46, 0.165), (47, -0.064), (48, -0.01), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94525862 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Author: Gunnar Carlsson, Facundo Mémoli

2 0.59954482 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Author: Nguyen Xuan Vinh, Julien Epps, James Bailey

3 0.39314252 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

4 0.36544111 10 jmlr-2010-An Exponential Model for Infinite Rankings

Author: Marina Meilă, Le Bao

Abstract: This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large. A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with ﬁnite ordered lists called top-t orderings over an inﬁnite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufﬁcient statistics, being thus an exponential family model with continuous and discrete parameters. We describe its conjugate prior and other statistical properties. Then, we extend the estimation problem to multimodal data by introducing an Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of our model and demonstrate that inﬁnite models over permutations can be simple, elegant and practical. Keywords: permutations, partial orderings, Mallows model, distance based ranking model, exponential family, non-parametric clustering, branch-and-bound

5 0.33115056 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano

Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufﬁcient sets of ACs to be considered for ﬁnding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search

6 0.31007379 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

7 0.29790339 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

8 0.29663551 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

9 0.27227679 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

10 0.25419623 60 jmlr-2010-Learnability, Stability and Uniform Convergence

11 0.24574417 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

12 0.23157457 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

13 0.21998602 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

14 0.20704682 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

15 0.19951442 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate

16 0.18912828 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

17 0.18597889 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

18 0.18423326 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

19 0.17501317 69 jmlr-2010-Lp-Nested Symmetric Distributions

20 0.16016537 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.05), (4, 0.024), (8, 0.013), (21, 0.021), (22, 0.492), (32, 0.045), (36, 0.026), (37, 0.036), (75, 0.096), (81, 0.03), (85, 0.044), (96, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72924918 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Author: Gunnar Carlsson, Facundo Mémoli

2 0.70975697 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

Abstract: Probabilistic grammars offer great ﬂexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference

3 0.34779397 53 jmlr-2010-Inducing Tree-Substitution Grammars

Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater

Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difﬁculty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and ﬁrst order dependency grammars, which are not sufﬁciently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efﬁcacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process

4 0.27000928 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

Abstract: We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of ﬁrst language acquisition. Preliminary experimental results conﬁrm the effectiveness of this approach. Keywords: grammatical inference, context-free language, positive data only, membership queries

5 0.26924098 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efﬁciently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efﬁciency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efﬁcient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing

6 0.26358011 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

7 0.25777769 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

8 0.25283396 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

9 0.25160548 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

10 0.2496713 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

11 0.2485247 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

12 0.24739452 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

13 0.24736097 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

14 0.24613057 102 jmlr-2010-Semi-Supervised Novelty Detection

15 0.24579127 72 jmlr-2010-Matrix Completion from Noisy Entries

16 0.24538015 69 jmlr-2010-Lp-Nested Symmetric Distributions

17 0.24369092 82 jmlr-2010-On Learning with Integral Operators

18 0.24317156 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

19 0.24284874 109 jmlr-2010-Stochastic Composite Likelihood

20 0.24274459 84 jmlr-2010-On Spectral Learning