jmlr jmlr2005 jmlr2005-20 knowledge-graph by maker-knowledge-mining

20 jmlr-2005-Clustering with Bregman Divergences


Source: pdf

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. [sent-13, score-0.421]

2 This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. [sent-16, score-0.407]

3 In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. [sent-17, score-0.331]

4 This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. [sent-18, score-0.425]

5 Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. [sent-19, score-0.569]

6 One can think of hard clustering as a special case of soft clustering where these probabilities only take values 0 or 1. [sent-28, score-0.421]

7 Among the hard clustering algorithms, the most well-known is the iterative relocation scheme for the Euclidean kmeans algorithm (MacQueen, 1967; Jain and Dubes, 1988; Duda et al. [sent-31, score-0.276]

8 We pose the hard clustering problem as one of obtaining an optimal quantization in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory. [sent-48, score-0.451]

9 Partitional hard clustering to minimize the loss in mutual information, otherwise known as information theoretic clustering (Dhillon et al. [sent-50, score-0.444]

10 In this paper, we formally prove an observation made by Forster and Warmuth (2000) that there exists a unique Bregman divergence corresponding to every regular exponential family. [sent-55, score-0.259]

11 In fact, we show that there is a bijection between regular exponential families and a class of Bregman divergences, that we call regular Bregman divergences. [sent-56, score-0.331]

12 Hard clustering with Bregman divergences is posed as a quantization problem that involves minimizing loss of Bregman information. [sent-68, score-0.361]

13 Based on our analysis of the Bregman clustering problem, we present a meta hard clustering algorithm that is applicable to all Bregman divergences (Section 3). [sent-72, score-0.499]

14 The meta clustering algorithm retains the simplicity and scalability of kmeans and is a direct generalization of all previously known centroid-based parametric hard clustering algorithms. [sent-73, score-0.418]

15 To obtain a similar generalization for the soft clustering case, we show (Theorem 4, Section 4) that there is a uniquely determined Bregman divergence corresponding to every regular exponential family. [sent-75, score-0.495]

16 4, we define regular Bregman divergences using exponentially convex functions and show that there is a bijection between regular exponential families and regular Bregman divergences. [sent-80, score-0.646]

17 Using the correspondence between exponential families and Bregman divergences, we show that the mixture estimation problem based on regular exponential families is identical to a Bregman soft clustering problem (Section 5). [sent-82, score-0.557]

18 Finally, we study the relationship between Bregman clustering and rate distortion theory (Section 6). [sent-87, score-0.333]

19 (2004a), we observe that the Bregman hard and soft clustering formulations correspond to the “scalar” and asymptotic rate distortion problems respectively, where distortion is measured using a regular Bregman divergence. [sent-89, score-0.67]

20 Then, we motivate the Bregman hard clustering problem as a quantization problem that involves minimizing the loss in Bregman information and show its equivalence to a more direct formulation, i. [sent-131, score-0.28]

21 We then present a clustering algorithm that generalizes the iterative relocation scheme of kmeans to monotonically decrease the loss in Bregman information. [sent-142, score-0.256]

22 The achieved distortion is called the distortion rate function, i. [sent-147, score-0.318]

23 Let the distortion be measured by a Bregman divergence dφ . [sent-151, score-0.244]

24 , for all random variables X, if E[X] minimizes the expected distortion of X to a fixed point for a smooth distortion function F(x, y) (see Appendix B for details), then F(x, y) has to be a Bregman divergence (Banerjee et al. [sent-182, score-0.391]

25 We choose the information theoretic viewpoint to pose the problem, since we will study the connections of both the hard and soft clustering problems to rate distortion theory in Section 6. [sent-239, score-0.456]

26 Thus we define the Bregman hard clustering problem as that of finding a partitioning of X , or, equivalently, finding the random variable M, such that the loss in Bregman information due to quantization, Lφ (M) = Iφ (X) − Iφ (M), is minimized. [sent-240, score-0.251]

27 Exhaustiveness: The Bregman hard clustering algorithm with cluster centroids as optimal representatives works for all Bregman divergences and only for Bregman divergences since the arithmetic mean is the best predictor only for Bregman divergences (Banerjee et al. [sent-280, score-0.657]

28 Scalability: The computational complexity of each iteration of the Bregman hard clustering algorithm is linear in the number of data points and the number of desired clusters for all Bregman divergences, which makes the algorithm scalable and appropriate for large clustering problems. [sent-289, score-0.384]

29 The Bregman divergence corresponding to a convex combination of the component convex functions can then be used to cluster the data. [sent-292, score-0.283]

30 To accomplish our goal, we first establish that there is a unique Bregman divergence corresponding to every regular exponential 1716 C LUSTERING WITH B REGMAN D IVERGENCES family distribution. [sent-295, score-0.295]

31 Later, we make this relation more precise by establishing a bijection between regular exponential families and regular Bregman divergences. [sent-296, score-0.331]

32 For our analysis, it is convenient to work with the minimal natural sufficient statistic x and hence, we redefine regular exponential families in terms of the probability density of x ∈ Rd , noting that the original probability space can actually be quite general. [sent-331, score-0.248]

33 Definition 3 A multivariate parametric family Fψ of distributions {p(ψ,θ) |θ ∈ Θ = int(Θ) = dom(ψ) ⊆ Rd } is called a regular exponential family if each probability density is of the form p(ψ,θ) (x) = exp( x, θ − ψ(θ))p0 (x), ∀x ∈ Rd , where x is a minimal sufficient statistic for the family. [sent-332, score-0.292]

34 Lemma 1 Let ψ be the cumulant function of a regular exponential family with natural parameter space Θ = dom(ψ). [sent-340, score-0.248]

35 1] remarked that the log-likelihood of the density of an exponential family distribution p(ψ,θ) can be written as the sum of the negative of a uniquely determined Bregman divergence dφ (x, µ) and a function that does not depend on the distribution parameters. [sent-383, score-0.244]

36 We are now ready to formally show that there is a unique Bregman divergence corresponding to every regular exponential family distribution. [sent-420, score-0.295]

37 4 Bijection with Regular Bregman Divergences From Theorem 4 we note that every regular exponential family corresponds to a unique and distinct Bregman divergence (one-to-one mapping). [sent-437, score-0.295]

38 Now, we investigate whether there is a regular exponential family corresponding to every choice of Bregman divergence (onto mapping). [sent-438, score-0.295]

39 For regular exponential families, the cumulant function ψ as well as its conjugate φ are convex functions of Legendre type. [sent-439, score-0.329]

40 Hence, for a Bregman divergence generated from a convex function φ to correspond to a regular exponential family, it is necessary that φ be of Legendre type. [sent-440, score-0.338]

41 Further, it is necessary that the Legendre conjugate ψ of φ to be C∞ , since cumulant functions of regular exponential families are C∞ . [sent-441, score-0.293]

42 Although it is well known that the logarithm of an exponentially convex function is a convex function (Akhizer, 1965), we are interested in the case where the logarithm is strictly convex with an open domain. [sent-447, score-0.273]

43 Using this class of exponentially convex functions, we now define a class of Bregman divergences called regular Bregman divergences. [sent-448, score-0.315]

44 We will now prove that there is a bijection between regular exponential families and regular Bregman divergences. [sent-452, score-0.331]

45 Then we show that there exists a unique regular exponential family determined by every regular Bregman divergence (onto). [sent-467, score-0.388]

46 Theorem 6 There is a bijection between regular exponential families and regular Bregman divergences. [sent-468, score-0.331]

47 , there is a regular Bregman divergence corresponding to every regular exponential family Fψ with cumulant function ψ and natural parameter space Θ. [sent-471, score-0.438]

48 , every regular Bregman divergence corresponds to a unique regular exponential family. [sent-478, score-0.352]

49 Bregman Soft Clustering Using the correspondence between regular exponential families and regular Bregman divergences, we now pose the Bregman soft clustering problem as a parameter estimation problem for mixture models based on regular exponential family distributions. [sent-542, score-0.736]

50 We revisit the expectation maximization (EM) framework for estimating mixture densities and develop a Bregman soft clustering algorithm (Algorithm 3) for regular Bregman divergences. [sent-543, score-0.353]

51 Finally, we show how the hard clustering algorithm can be interpreted as a special case of the soft clustering algorithm and also discuss an alternative formulation of hard clustering in terms of a dual divergence derived from the conjugate function. [sent-545, score-0.781]

52 The model yields a soft clustering where clusters correspond to the components of the mixture model, and the soft membership of a data point in each cluster is proportional to the probability of the data point being generated by the corresponding density function. [sent-555, score-0.349]

53 Using the Bregman divergence viewpoint, we get a simplified version of the above EM algorithm that we call the Bregman soft clustering algorithm (Algorithm 3). [sent-564, score-0.309]

54 In fact, in some situations it may be easier to design regular Bregman divergences for mixture modeling of data than to come up with an appropriate exponential family. [sent-581, score-0.318]

55 Using this result, we then show that Algorithms 2 and 3 are exactly equivalent for regular Bregman divergences and exponential families. [sent-584, score-0.29]

56 Hence, for a mixture model with density given by (16), the EM algorithm (Algorithm 2) reduces to the Bregman soft clustering algorithm (Algorithm 3). [sent-597, score-0.258]

57 So far we have considered the Bregman soft clustering problem for a set X where all the elements are equally important and assumed to have been independently sampled from some particular exponential distribution. [sent-598, score-0.281]

58 ∑n νi p(h|xi ) i=1 µh = Finally, we note that the Bregman hard clustering algorithm is a limiting case of the above soft clustering algorithm. [sent-602, score-0.421]

59 For every convex function φ and positive constant β, βφ is also a convex function with the corresponding Bregman divergence dβφ = βdφ . [sent-603, score-0.255]

60 In the limit, when β → ∞, the posterior probabilities in the E-step take values in {0, 1} and hence, the E and M steps of the soft clustering algorithm reduce to the assignment and re-estimation steps of the hard clustering algorithm. [sent-604, score-0.421]

61 Since Bregman divergences are not symmetric (with the exception of squared Euclidean distance), we now consider an alternative formulation of Bregman clustering where cluster representatives are the first argument of the Bregman divergence. [sent-607, score-0.368]

62 Then the alternative Bregman hard clustering i=1 problem is to find clusters {Xh }k and corresponding representatives {µh }k that solve h=1 h=1 k min ∑ ∑ {µh }k h=1 h=1 xi ∈Xh νi dφ (µh , xi ). [sent-611, score-0.304]

63 (19) As mentioned earlier, Bregman divergences are convex in the first argument and hence, the resulting optimization problem for each cluster is convex so there is a unique optimal representative for each 1731 BANERJEE , M ERUGU , D HILLON AND G HOSH cluster. [sent-612, score-0.339]

64 It is now straightforward to see that this is our original Bregman hard clustering problem for the set X θ consisting of the dual data points with the same measure ν and the dual Bregman divergence dψ . [sent-621, score-0.351]

65 We restrict our attention to regular exponential families and regular Bregman divergences in this section. [sent-632, score-0.426]

66 , the average number of bits for encoding a symbol, and the expected distortion between the source and the ˆ reproduction random variables based on an appropriate distortion function d : X × X → R+ . [sent-639, score-0.367]

67 The rate distortion problem is a convex problem that involves optimizing over the probabilistic assignments p(x|x) and can be theoretically solved using the Blahut-Arimoto algorithm (Arimoto, ˆ 1972; Blahut, 1972; Csisz´ r, 1974; Cover and Thomas, 1991). [sent-641, score-0.25]

68 (2004a), which extend previous work (Rose, 1994; Gray and Neuhoff, 1998) that related kmeans clustering to vector quantization and rate-distortion based on squared ˆ ˆ Euclidean distortion. [sent-652, score-0.252]

69 Let Z, Z denote suitable sufficient statistic representations of X, X so that the distortion can be measured by a Bregman divergence dφ in the sufficient statistic space. [sent-653, score-0.294]

70 Unlike the basic rate distortion problem (21), the RDFC problem (23) is no longer a convex ˆ problem since it involves optimization over both Zs and p(ˆ |z). [sent-655, score-0.25]

71 From Section 5, we know that the maximum likelihood mixture estimation problem for any regular exponential family is equivalent to the Bregman soft clustering problem for the corresponding regular Bregman divergence. [sent-667, score-0.531]

72 Then the RDFC problem (23) for the source Z with regular Bregman divergence dφ , variational ˆ ˆ parameter βD , and reproduction random variable Z with |Z| = k is equivalent to the Bregman soft clustering problem (16) based on the Bregman divergence dβD φ with number of clusters set to k. [sent-670, score-0.585]

73 The Bregman soft clustering problem corresponds to the RDFC problem and not to the basic rate distortion problem (21). [sent-677, score-0.383]

74 The resultant rate distortion function is referred to as the “scalar” or “order 1” rate distortion function R1 (D) (Gray and Neuhoff, 1998). [sent-681, score-0.342]

75 The problem is solved by performing hard assignments of the source symbols to the closest codebook members, which is similar to the assignment step in the Bregman hard clustering problem. [sent-682, score-0.282]

76 In fact, the “order 1” or “1-shot” rate distortion problem, assuming a known finite cardinality of the optimal reproduction support set, turns out to be exactly equivalent to the Bregman hard clustering problem. [sent-683, score-0.427]

77 In rate distortion theory, the loss in “information” is quantified by the expected Bregman distortion between Z and ˆ Z. [sent-688, score-0.345]

78 (2004a)) The expected Bregman distortion between the source and the reproduction random variables is exactly equal to the loss in Bregman information due to compression, i. [sent-691, score-0.247]

79 ˆ ˆ 1735 BANERJEE , M ERUGU , D HILLON AND G HOSH Thus the information bottleneck problem is seen to be a special case of the RDFC problem (23), and hence also of the Bregman soft clustering problem and mixture estimation problem for exponential families. [sent-721, score-0.309]

80 , 2000) that illustrate the usefulness of specific Bregman divergences and the corresponding Bregman clustering algorithms in important application domains. [sent-732, score-0.29]

81 The classical kmeans algorithm, which is a special case of the Bregman hard clustering algorithm for the squared Euclidean distance has been successfully applied to a large number of domains where a Gaussian distribution assumption is valid. [sent-733, score-0.274]

82 These algorithms are, respectively, special cases of the Bregman hard and soft clustering algorithms for KL-divergence, and have been shown to provide high quality results on large real datasets such as the 20-Newsgroups, Reuters and Dmoz text datasets. [sent-738, score-0.273]

83 Since special cases of Bregman clustering algorithms have already been shown to be effective in various domains, we do not experimentally re-evaluate the Bregman clustering algorithms against other methods. [sent-744, score-0.324]

84 In particular we study Bregman clustering of data generated from mixture of exponential family distributions using the corresponding Bregman divergence as well as non-matching divergences. [sent-746, score-0.392]

85 From Table 3, we can see that clustering quality is significantly better when the Bregman divergence used in the clustering algorithm matches that of the generative model. [sent-782, score-0.421]

86 , the choice of the Bregman divergence used for clustering is important for obtaining good quality clusters. [sent-787, score-0.259]

87 It is meaningless to compare the clustering objective function values as they are different for the three versions of the Bregman clustering algorithm. [sent-791, score-0.324]

88 In general, however, the divergence used for clustering need not necessarily have to be the one corresponding to the generative model. [sent-811, score-0.259]

89 Second, our soft clustering approach is based on the relationship between Bregman divergences and exponential family distributions and the suitability of Bregman divergences as distortion or loss functions for data drawn from exponential distributions. [sent-824, score-0.816]

90 In our work, we extend this concept to say that the Bregman divergence of the Legendre conjugate of the cumulant function is a natural distance function for the data drawn according to a mixture model based on that exponential family. [sent-827, score-0.301]

91 (2001) develop a generalization of PCA for exponential families using loss functions based on the corresponding Bregman divergences and propose alternate minimization schemes for solving the problem. [sent-834, score-0.267]

92 As we discussed in Section 6, our treatment of clustering is very closely tied to rate distortion theory (Berger, 1971; Berger and Gibson, 1998; Gray and Neuhoff, 1998). [sent-841, score-0.333]

93 In addition, the results also establish a connection between the rate distortion problem with Bregman divergences and the mixture model estimation problem for exponential families (Banerjee et al. [sent-844, score-0.439]

94 In the literature, there are clustering algorithms that involve minimizing loss functions based on distortion measures that are somewhat different from Bregman divergences. [sent-846, score-0.336]

95 For example, Modha and Spangler (2003) present the convex-kmeans clustering algorithm for distortion measures that are always non-negative and convex in the second argument, using the notion of a generalized centroid. [sent-847, score-0.388]

96 Concluding Remarks In this paper, we have presented hard and soft clustering algorithms to minimize loss functions involving Bregman divergences. [sent-855, score-0.286]

97 Further, using a related result, we see that Bregman divergences are the only distortion functions for which such a centroid-based clustering scheme is possible. [sent-858, score-0.452]

98 Second, we formally show that there is a one-to-one correspondence between regular exponential families and regular Bregman divergences. [sent-859, score-0.298]

99 Hence, φa is also strictly convex and differentiable and the Bregman divergences associated with φ and φa are identical. [sent-929, score-0.252]

100 Let P0 be any non-negative bounded measure on Rd and Fψ = {p(ψ,θ) , θ ∈ Θ ⊆ Rd } be a regular exponential family with cumulant function ψ and base measure P0 , as discussed in Section 4. [sent-981, score-0.274]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bregman', 0.842), ('clustering', 0.162), ('dom', 0.154), ('distortion', 0.147), ('banerjee', 0.14), ('divergences', 0.128), ('legendre', 0.121), ('xh', 0.116), ('rd', 0.104), ('divergence', 0.097), ('regular', 0.093), ('regman', 0.09), ('erugu', 0.086), ('ivergences', 0.086), ('convex', 0.079), ('rdfc', 0.074), ('hillon', 0.072), ('hosh', 0.072), ('exponential', 0.069), ('int', 0.066), ('ri', 0.061), ('lustering', 0.054), ('cumulant', 0.05), ('soft', 0.05), ('hard', 0.047), ('reproduction', 0.047), ('dhillon', 0.046), ('quantization', 0.044), ('exp', 0.044), ('families', 0.043), ('conjugate', 0.038), ('log', 0.036), ('representatives', 0.036), ('family', 0.036), ('bijection', 0.033), ('kmeans', 0.032), ('csisz', 0.03), ('qd', 0.03), ('em', 0.029), ('mixture', 0.028), ('cluster', 0.028), ('buzo', 0.027), ('nwald', 0.027), ('zu', 0.027), ('loss', 0.027), ('source', 0.026), ('theoretic', 0.026), ('statistic', 0.025), ('representative', 0.025), ('uniquely', 0.024), ('differentiable', 0.024), ('ui', 0.024), ('rate', 0.024), ('linde', 0.023), ('xi', 0.023), ('compression', 0.021), ('strictly', 0.021), ('absolutely', 0.02), ('counting', 0.02), ('mutual', 0.02), ('expectation', 0.02), ('relocation', 0.02), ('kazakos', 0.02), ('lbg', 0.02), ('redner', 0.02), ('distance', 0.019), ('density', 0.018), ('austin', 0.017), ('binomial', 0.017), ('gr', 0.017), ('poisson', 0.017), ('zv', 0.016), ('nq', 0.016), ('collins', 0.016), ('zs', 0.016), ('dual', 0.016), ('azoury', 0.016), ('parametric', 0.015), ('duality', 0.015), ('exponentially', 0.015), ('partitioning', 0.015), ('shannon', 0.015), ('tishby', 0.015), ('scheme', 0.015), ('lossy', 0.015), ('gray', 0.014), ('euclidean', 0.014), ('lebesgue', 0.014), ('xj', 0.014), ('proposition', 0.014), ('squared', 0.014), ('berger', 0.014), ('datasets', 0.014), ('argmin', 0.013), ('clusters', 0.013), ('ib', 0.013), ('measure', 0.013), ('walker', 0.013), ('qj', 0.013), ('utexas', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

2 0.15322472 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

3 0.14608657 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA

4 0.067048654 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

Author: Hal Daumé III, Daniel Marcu

Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian

5 0.04412584 39 jmlr-2005-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis

6 0.033433817 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

7 0.028784439 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

8 0.025301998 71 jmlr-2005-Variational Message Passing

9 0.025251016 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

10 0.024991348 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

11 0.024112908 32 jmlr-2005-Expectation Consistent Approximate Inference

12 0.022437047 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

13 0.022132719 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

14 0.021913644 64 jmlr-2005-Semigroup Kernels on Measures

15 0.021418594 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

16 0.020486049 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

17 0.020274876 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

18 0.020065604 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

19 0.019699018 3 jmlr-2005-A Classification Framework for Anomaly Detection

20 0.01926396 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.152), (1, 0.015), (2, 0.046), (3, -0.091), (4, -0.159), (5, 0.005), (6, -0.438), (7, -0.047), (8, 0.147), (9, 0.223), (10, 0.132), (11, 0.025), (12, -0.352), (13, -0.021), (14, -0.096), (15, 0.178), (16, -0.085), (17, 0.155), (18, 0.03), (19, -0.039), (20, -0.054), (21, 0.02), (22, -0.046), (23, -0.155), (24, 0.012), (25, 0.023), (26, 0.103), (27, 0.029), (28, 0.025), (29, 0.047), (30, -0.023), (31, -0.048), (32, -0.022), (33, -0.107), (34, 0.037), (35, -0.112), (36, 0.144), (37, 0.025), (38, -0.047), (39, -0.021), (40, 0.004), (41, -0.065), (42, 0.087), (43, -0.009), (44, -0.013), (45, 0.037), (46, 0.101), (47, 0.062), (48, 0.097), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97525221 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

2 0.587991 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

3 0.48861963 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA

4 0.18904993 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

Author: Hal Daumé III, Daniel Marcu

Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian

5 0.12804946 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

Author: Gal Elidan, Nir Friedman

Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods

6 0.11984937 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

7 0.10543185 64 jmlr-2005-Semigroup Kernels on Measures

8 0.10349512 39 jmlr-2005-Information Bottleneck for Gaussian Variables

9 0.10015006 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

10 0.096957877 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

11 0.094953582 41 jmlr-2005-Kernel Methods for Measuring Independence

12 0.092280231 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

13 0.083327986 48 jmlr-2005-Learning the Kernel Function via Regularization

14 0.082031578 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

15 0.080289952 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

16 0.080282219 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

17 0.07625258 12 jmlr-2005-An MDP-Based Recommender System

18 0.075757176 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

19 0.075353287 71 jmlr-2005-Variational Message Passing

20 0.074373029 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.03), (17, 0.028), (19, 0.059), (28, 0.321), (36, 0.026), (37, 0.021), (42, 0.015), (43, 0.032), (47, 0.016), (52, 0.045), (59, 0.022), (70, 0.018), (80, 0.018), (84, 0.096), (88, 0.102), (90, 0.025), (94, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67742914 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

2 0.4189066 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA

3 0.35178724 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming

4 0.35171282 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

5 0.34929544 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

6 0.3482964 48 jmlr-2005-Learning the Kernel Function via Regularization

7 0.34772506 39 jmlr-2005-Information Bottleneck for Gaussian Variables

8 0.34673297 71 jmlr-2005-Variational Message Passing

9 0.34348071 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

10 0.34076107 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

11 0.33996418 11 jmlr-2005-Algorithmic Stability and Meta-Learning

12 0.3396669 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

13 0.33847955 3 jmlr-2005-A Classification Framework for Anomaly Detection

14 0.33835071 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

15 0.33483943 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

16 0.33480072 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

17 0.33332089 44 jmlr-2005-Learning Module Networks

18 0.33323362 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

19 0.33051506 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

20 0.32813257 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning