jmlr jmlr2005 jmlr2005-19 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
Reference: text
sentIndex sentText sentNum sentScore
1 This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. [sent-13, score-0.55]
2 Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. [sent-17, score-0.348]
3 Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. [sent-19, score-0.312]
4 However, traditional methods for clustering data are severely challenged by a variety of complex characteristics exhibited by certain recent data sets examined by the machine learning and data mining communities. [sent-23, score-0.229]
5 In this paper we focus on clustering objects such as text documents and gene expressions, where the complexity arises from their representation as vectors that are not only very high dimensional (and often sparse) but also directional, i. [sent-25, score-0.412]
6 One can broadly categorize clustering approaches to be either generative (also known as parametric or probabilistic) (Smyth, 1997; Jaakkola and Haussler, 1999) or discriminative (non-parametric) (Indyk, 1999; Sch¨ lkopf and Smola, 2001). [sent-28, score-0.262]
7 A lot of domain knowledge can be incorporated into generative models, so that clustering of data uncovers specific desirable patterns that one is looking for. [sent-31, score-0.262]
8 For vector data, there are well studied clustering algorithms for popular generative models such as a mixture of multivariate Gaussians, whose effect is analogous to the use of Euclidean or Mahalanobis type distances as the chosen measure of distortion from the discriminative perspective. [sent-34, score-0.324]
9 The choice of a particular distortion measure (or the corresponding generative model) can be crucial to the performance of a clustering procedure. [sent-35, score-0.262]
10 For example, studies in information retrieval applications convincingly demonstrate cosine similarity to be a more effective measure of similarity for analyzing and clustering text documents. [sent-38, score-0.445]
11 Further, the spherical kmeans (spkmeans) algorithm (Dhillon and Modha, 2001), that performs kmeans using cosine similarity instead of Euclidean distortion, has been found to work well for text clustering. [sent-40, score-0.402]
12 ˜ ˜ ˜ ˜ Hence, analysis and clustering of data using Pearson correlations is essentially a clustering problem for directional data. [sent-58, score-0.553]
13 1 Contributions In this paper2 we present a generative model, consisting of a mixture of von Mises-Fisher (vMF) distributions, tailored for directional data distributed on the surface of a unit hypersphere. [sent-60, score-0.405]
14 We derive two clustering algorithms based on EM for estimating the parameters of the mixture model from first principles. [sent-61, score-0.269]
15 The ability to adapt κ on a per-component basis leads to substantial performance improvements over existing generative approaches to modeling directional data. [sent-63, score-0.22]
16 We show a connection between the proposed methods and a class of existing algorithms for clustering high-dimensional directional data. [sent-64, score-0.346]
17 In particular, our generative model has the same relation to spkmeans as a model based on a mixture of unit covariance Gaussians has to classical kmeans that uses squared Euclidean distances. [sent-65, score-0.53]
18 We also present detailed experimental comparisons of the proposed algorithms with spkmeans and one of its variants. [sent-66, score-0.321]
19 This result is particularly important when using mixtures of vMFs since the computational needs for hard assignments are lower than what is required for the standard soft assignments (E-step) for these models. [sent-74, score-0.309]
20 In the learning community, perhaps the most widely studied high-dimensional directional data stem from text documents represented by vector space models. [sent-85, score-0.245]
21 For example, hierarchical agglomerative methods based on cosine, Jaccard or Dice coefficients were dominant for text clustering till the mid-1990s (Rasmussen, 1992). [sent-88, score-0.258]
22 A fairly extensive list of references on generative approaches to text clustering can be found in (Zhong and Ghosh, 2003a). [sent-95, score-0.313]
23 Of particular relevance to this work is the spkmeans algorithm (Dhillon and Modha, 2001), which adapts the kmeans algorithm to normalized data by using the cosine similarity for cluster allocation, and also by re-normalizing the cluster means to unit length. [sent-96, score-0.807]
24 The spkmeans algorithm is superior to regular kmeans for high-dimensional text data, and competitive or superior in both performance and speed to a wide range of other existing alternatives for text clustering (Strehl et al. [sent-97, score-0.722]
25 The larger topic of clustering very high-dimensional data (dimension in the thousands or more), irrespective of whether it is directional or not, has also attracted great interest lately. [sent-100, score-0.346]
26 Among generative approaches for clustering high-dimensional data, perhaps the most noteworthy is one that uses low dimensional projections of mixtures of Gaussians (Dasgupta, 1999). [sent-102, score-0.285]
27 (2003) use a mixture of two circular von Mises distributions to estimate the parameters using a quasi-Newton procedure. [sent-110, score-0.273]
28 Wallace and Dowe (2000) perform mixture modeling for circular von Mises distributions and have produced a software called Snob that implements their ideas. [sent-111, score-0.299]
29 McLachlan and Peel (2000) discuss mixture analysis of directional data and mention the possibility of using Fisher distributions (3-dimensional vMFs), but instead use 3-dimensional Kent distributions (Mardia and Jupp, 2000). [sent-112, score-0.263]
30 They also mention work related to the clustering of directional data, but all the efforts included by them are restricted to 2-D or 3-D vMFs. [sent-113, score-0.346]
31 The connection between a generative model involving vMF distributions with constant κ and the spkmeans algorithm was first observed by Banerjee and Ghosh (2002). [sent-117, score-0.407]
32 This enhancement leads to even better clustering performance for difficult clustering tasks: when clusters overlap, when cluster sizes are skewed, and when cluster sizes are small relative to the dimensionality of the data. [sent-125, score-0.849]
33 soft mixture modeling and the behavior of vMF based mixture models are obtained. [sent-127, score-0.224]
34 Sections 3 and 4 form the basis for two clustering algorithms using soft and hard-assignments respectively, that are proposed in Section 5. [sent-133, score-0.281]
35 , x ∈ Rd and x = 1, or equivalently x ∈ Sd−1 ) is said to have d-variate von Mises-Fisher (vMF) distribution if its probability density function is given by f (x|µ, κ) = cd (κ)eκµ x, T 1349 (2. [sent-151, score-0.241]
36 Assuming the xi to be independent and identically distributed, we can write the log-likelihood of X as ln P(X |µ, κ) = n ln cd (κ) + κµT r, (2. [sent-170, score-0.345]
37 EM on a Mixture of vMFs (moVMF) We now consider a mixture of k vMF (moVMF) distributions that serves as a generative model for directional data. [sent-182, score-0.287]
38 2 Analysis of Hard Assignments In this subsection, we show that hard assignments should give a reasonable clustering in terms of the log-likelihood since they actually maximize a tight lower bound on the incomplete log-likelihood of the data. [sent-311, score-0.395]
39 On the other hand, hard assignments only need to maintain the cluster assignments of each point, i. [sent-325, score-0.34]
40 Thus, clustering using hard-assignments essentially maximizes a lower bound on the incomplete log-likelihood. [sent-349, score-0.247]
41 Now, adding the incomplete data log-likelihood to both sides of the inequality proven above, we obtain E p [ln P(Z |(X , Θ))] + ln P(X |Θ) ≤ Eq [ln P(Z |(X , Θ))] + ln P(X |Θ), E p [ln(P(Z |(X , Θ))P(X |Θ))] ≤ Eq [ln(P(Z |(X , Θ))P(X |Θ))], E p [ln P(X , Z |Θ)] ≤ Eq [ln P(X , Z |Θ)]. [sent-381, score-0.246]
42 Then, for h = h n Eq [ln P(Z |X , Θ)] = ∑ ˜ k ˜ ∑ q(h|xi , Θ) ln p(h|xi , Θ) = i=1 h=1 n ≤ ∑ ln p(h∗ |xi , Θ) = i i=1 n n ˜ ∑ ln p(hi |xi , Θ) i=1 k ∑ ∑ q(h|xi , Θ) ln p(h|xi , Θ) i=1 h=1 = Eq [ln P(Z |X , Θ)]. [sent-387, score-0.412]
43 Algorithms The developments of the previous section naturally lead to two algorithms for clustering directional data. [sent-392, score-0.346]
44 Upon termination, Algorithm 2 yields a hard clustering of the data and the parameters Θ = {αh , µh , κh }k of the k h=1 vMFs that model the input data set X . [sent-405, score-0.291]
45 1 Revisiting Spherical Kmeans In this section we show that upon enforcing certain restrictive assumptions on the generative model, the spkmeans algorithm (Algorithm 3) can be viewed as a special case of both the soft-moVMF and hard-moVMF algorithms. [sent-407, score-0.376]
46 algorithm reduces to assigning a point to its nearest cluster, where nearness is computed as a cosine similarity between the point and the cluster representative. [sent-416, score-0.266]
47 Thus, a point xi will be assigned to cluster h∗ = argmax xT µh , since i h eκ xi µh∗ T p(h |xi , Θ) = lim ∗ κ→∞ ∑k eκ xi µh h=1 T = 1, and p(h|xi , Θ) → 0, as κ → ∞ for all h = h∗ . [sent-417, score-0.269]
48 To show that spkmeans can also be seen as a special case of the hard-moVMF, in addition to assuming the priors of the components to be equal, we further assume that the concentration parameters of all the components are equal, i. [sent-418, score-0.38]
49 With these assumptions on the model, the estimation of the common concentration parameter becomes unessential since the hard assignment will depend only on the value of the cosine similarity xT µh , and hard-moVMF reduces i to spkmeans. [sent-421, score-0.311]
50 In addition to the abovementioned algorithms, we report experimental results on another algorithm fskmeans (Banerjee and Ghosh, 2002) that belongs to the same class in the sense that, like spkmeans, it can be derived from the mixture of vMF models with some restrictive assumptions. [sent-422, score-0.292]
51 It has already been established that kmeans using Euclidean distance performs much worse than spkmeans for text data (Strehl et al. [sent-439, score-0.464]
52 We also created various subsets of some 1359 BANERJEE , D HILLON , G HOSH AND S RA of the data sets for gaining greater insight into the nature of clusters discovered or to model some particular clustering scenario (e. [sent-445, score-0.386]
53 The underlying clusters in this data set are highly skewed in terms of the number of documents per cluster, with sizes ranging from 9 to 494. [sent-499, score-0.258]
54 Gene-expression data was selected to offer a clustering domain different from text analysis. [sent-537, score-0.258]
55 As previously motivated, the use of Pearson correlation for the analysis of gene expression data is common, so a directional model is well-suited. [sent-538, score-0.26]
56 Out of these we selected a subset of 996 genes for clustering (Dhillon et al. [sent-546, score-0.233]
57 2 Methodology Except for the gene expression data set, performance of the algorithms on all the data sets has been analyzed using mutual information (MI) between the cluster and class labels. [sent-550, score-0.288]
58 For soft-moVMF we “harden” the clustering produced by labeling a point with the cluster label for which it has the highest value of posterior probability (ties broken arbitrarily), in order to evaluate MI. [sent-556, score-0.335]
59 The clustering produced by our soft cluster assignment algorithm is shown in Figure 1(b). [sent-593, score-0.439]
60 The confusion matrix, obtained by “hardening” the clustering produced by soft-moVMF for the 26 1 . [sent-605, score-0.264]
61 As is evident from this confusion matrix, the clustering performed small-mix data set is 0 23 by soft-moVMF is excellent, though not surprising, since small-mix is a data set with well-separated clusters. [sent-606, score-0.264]
62 Even though Classic300 is well separated, the small number of documents per cluster makes the problem somewhat difficult for fskmeans and spkmeans, while hard-moVMF has a much better performance due to its model flexibility. [sent-648, score-0.413]
63 It seems that the low number of documents does not pose a problem for soft-moVMF and it ends up getting an almost perfect clustering for this data set. [sent-650, score-0.262]
64 fskmeans med cisi cran 29 38 22 31 27 38 40 35 40 spkmeans med cisi cran 29 38 22 31 27 38 40 35 40 hard-moVMF med cisi cran 3 72 1 62 28 17 35 0 82 soft-moVMF med cisi cran 0 98 0 99 2 0 1 0 100 Table 5: Comparative confusion matrices for 3 clusters of Classic300. [sent-652, score-2.171]
65 As before, due to the small number of documents per cluster, fskmeans and spkmeans give a rather mixed confusion matrix. [sent-656, score-0.663]
66 When 4 clusters are requested from soft-moVMF, it returns 4 very pure clusters (not shown in the confusion matrices) two of which are almost equal sized segments of the bigger cluster. [sent-660, score-0.415]
67 In Table 7 we show the confusion matrices resulting from 5 clusters of the Classic3 corpus. [sent-662, score-0.236]
68 This behavior of our moVMF algorithms coupled with the observations in Table 6, suggest a clustering method in which one could generate a slightly higher number of clusters than required, and then agglomerate them appropriately. [sent-664, score-0.386]
69 Among the other three, hard-moVMF seems to perform better than spkmeans and fskmeans across the range of clusters. [sent-670, score-0.551]
70 05 1 fskmeans spkmeans hard−movMF soft−movMF 1 0. [sent-672, score-0.551]
71 95 Mutual Information value fskmeans spkmeans hard−movMF soft−movMF 0. [sent-676, score-0.551]
72 5 Yahoo News Data Set The Yahoo News data set is a relatively difficult data set for clustering since it has a fair amount of overlap among its clusters and the number of points per cluster is low. [sent-711, score-0.514]
73 In addition, the clusters are highly skewed in terms of their comparative sizes. [sent-712, score-0.227]
74 All the algorithms perform similarly until the true number of clusters after which soft-moVMF and spkmeans perform better than the others. [sent-719, score-0.5]
75 Since the number of documents per cluster is small (100), as before spkmeans and fskmeans do not perform that well, even at the true number of clusters, whereas soft-moVMF performs considerably better than the others over the entire range. [sent-722, score-0.734]
76 4 fskmeans spkmeans hard−movMF soft−movMF fskmeans spkmeans hard−movMF soft−movMF 1. [sent-731, score-1.102]
77 6 fskmeans spkmeans hard−movMF soft−movMF fskmeans spkmeans hard−movMF soft−movMF 0. [sent-756, score-1.102]
78 3 Mutual Information value fskmeans spkmeans hard−movMF soft−movMF 0. [sent-775, score-0.551]
79 These internal measures have been earlier employed for evaluating clustering of genes (e. [sent-793, score-0.233]
80 1) As can easily be seen, a higher homogeneity means that the individual elements of each cluster are quite similar to the cluster representative. [sent-806, score-0.256]
81 , 2004) have started looking at supervised methods of evaluating the gene clustering results using public genome databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the gene ontology (GO). [sent-817, score-0.405]
82 Figure 5 shows the various cluster quality figures of merit as computed for clusters of our gene expression data. [sent-822, score-0.428]
83 1 4 6 8 10 12 14 16 Number of clusters (k) 18 20 22 −1 24 4 6 8 (a) Havg values 10 12 14 16 Number of clusters (k) 18 20 22 24 (b) Hmin values 0. [sent-841, score-0.358]
84 2 24 (c) Savg values 4 6 8 10 12 14 16 Number of clusters (k) (d) Smax values Figure 5: Measures of cluster quality for gene data. [sent-854, score-0.406]
85 1369 18 20 22 24 BANERJEE , D HILLON , G HOSH AND S RA We see from Figure 5(a) that both hard-moVMF and soft-moVMF yield clusters that are much more homogeneous than those furnished by fskmeans and spkmeans. [sent-855, score-0.409]
86 Though the inter-cluster similarities do not differ that much between the four algorithms, soft-moVMF seems to be forming clusters with higher inter-cluster similarity than other algorithms. [sent-857, score-0.228]
87 We could explain this behavior of soft-moVMF by noting that it tends to form overlapping clusters (because of soft-assignments) and those clusters remain closer even after hardening. [sent-858, score-0.358]
88 Since Havg essentially measures the average cosine similarity, we note that using our moVMF based algorithms, we are able to achieve clusters that are more coherent and better separated—a fact that could be attributed to the richer model employed by our algorithms. [sent-859, score-0.268]
89 Discussion The mixture of vMF distributions gives a parametric model-based generalization of the widely used cosine similarity measure. [sent-883, score-0.231]
90 1, the spherical kmeans algorithm that uses cosine similarity arises as a special case of EM on mixture of vMFs when, among other things, the concentration κ of all the distributions is held constant. [sent-885, score-0.411]
91 Assuming the Fisher-Information matrix is identity, the Fisher kernel similarity (Jaakkola and Haussler, 1999) corresponding to the vMF distribution is given by K(xi , x j ) = (∇µ ln f (xi |µ))T (∇µ ln f (x j |µ)) (see (2. [sent-888, score-0.255]
92 The cost of the added flexibility in soft-moVMF over spkmeans is the extra computation in estimating the κ values. [sent-909, score-0.321]
93 The fskmeans and spkmeans do not do well for difficult data sets due to their hard assignment scheme as well as their significantly less modeling capabilities. [sent-912, score-0.691]
94 Finally, a word on model selection, since choosing the number of clusters remains one of the widely debated topics in clustering (McLachlan and Peel, 2000). [sent-913, score-0.386]
95 Banerjee and Langford (2004) have proposed a new objective criterion for evaluation and model-selection for clustering algorithms: how well does the clustering algorithm perform as a prediction algorithm. [sent-914, score-0.414]
96 1) we obtain ln P(X |µ, κ) = n ln cd (κ) + κµT r, (A. [sent-972, score-0.298]
97 13c) gives us cd (κh ) ∑n xi p(h|xi , Θ) = − i=1 , cd (κh ) ∑n p(h|xi , Θ) i=1 which can be written as Ad (κh ) = where Ad (κ) = respectively. [sent-1036, score-0.231]
98 Iterative clustering of high dimensional text data augmented by local search. [sent-1215, score-0.258]
99 CLICK: A clustering algorithm with applications to gene expression analysis. [sent-1473, score-0.328]
100 MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. [sent-1514, score-0.387]
wordName wordTfidf (topN-words)
[('vmf', 0.362), ('movmf', 0.346), ('spkmeans', 0.321), ('fskmeans', 0.23), ('banerjee', 0.214), ('clustering', 0.207), ('clusters', 0.179), ('von', 0.149), ('ises', 0.148), ('directional', 0.139), ('cisi', 0.132), ('cran', 0.132), ('hillon', 0.131), ('hosh', 0.131), ('cluster', 0.128), ('isher', 0.124), ('istributions', 0.124), ('mi', 0.111), ('dhillon', 0.11), ('ln', 0.103), ('gene', 0.099), ('ra', 0.098), ('lustering', 0.093), ('cd', 0.092), ('kmeans', 0.092), ('em', 0.09), ('cosine', 0.089), ('hard', 0.084), ('med', 0.082), ('ghosh', 0.08), ('soft', 0.074), ('bessel', 0.066), ('ad', 0.065), ('assignments', 0.064), ('fh', 0.062), ('mixture', 0.062), ('austin', 0.061), ('sd', 0.061), ('concentration', 0.059), ('jupp', 0.058), ('vmfs', 0.058), ('confusion', 0.057), ('mardia', 0.055), ('generative', 0.055), ('documents', 0.055), ('text', 0.051), ('peel', 0.049), ('similarity', 0.049), ('newsgroup', 0.048), ('xi', 0.047), ('mclachlan', 0.045), ('cmu', 0.045), ('eq', 0.043), ('havg', 0.041), ('savg', 0.041), ('zhong', 0.041), ('incomplete', 0.04), ('mutual', 0.039), ('yahoo', 0.039), ('pearson', 0.039), ('xh', 0.037), ('texas', 0.036), ('messages', 0.036), ('smax', 0.034), ('hypersphere', 0.034), ('yeast', 0.033), ('hmin', 0.033), ('mises', 0.033), ('distributions', 0.031), ('newsgroups', 0.031), ('fl', 0.031), ('circular', 0.031), ('assignment', 0.03), ('spherical', 0.029), ('approximations', 0.028), ('langford', 0.028), ('estimates', 0.028), ('strehl', 0.028), ('utexas', 0.028), ('likelihood', 0.027), ('modeling', 0.026), ('genes', 0.026), ('news', 0.025), ('tx', 0.025), ('lagrangian', 0.025), ('sra', 0.025), ('suvrit', 0.025), ('rh', 0.024), ('karypis', 0.024), ('kent', 0.024), ('salton', 0.024), ('comparative', 0.024), ('neal', 0.024), ('expectation', 0.024), ('plane', 0.024), ('skewed', 0.024), ('mixtures', 0.023), ('annealing', 0.022), ('mining', 0.022), ('expression', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
2 0.14608657 20 jmlr-2005-Clustering with Bregman Divergences
Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh
Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory
3 0.1211364 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
Author: Hal Daumé III, Daniel Marcu
Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
5 0.081298448 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
6 0.07090234 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
8 0.057635605 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
9 0.056398146 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
10 0.0545532 36 jmlr-2005-Gaussian Processes for Ordinal Regression
11 0.05451097 71 jmlr-2005-Variational Message Passing
12 0.053526953 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
13 0.052653268 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
14 0.042911794 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
15 0.042198684 32 jmlr-2005-Expectation Consistent Approximate Inference
16 0.040420841 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
17 0.036226187 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
18 0.031898852 22 jmlr-2005-Concentration Bounds for Unigram Language Models
19 0.031827047 39 jmlr-2005-Information Bottleneck for Gaussian Variables
20 0.031596854 44 jmlr-2005-Learning Module Networks
topicId topicWeight
[(0, 0.247), (1, -0.081), (2, -0.013), (3, -0.122), (4, 0.01), (5, 0.011), (6, -0.469), (7, 0.042), (8, 0.098), (9, -0.047), (10, 0.048), (11, 0.068), (12, -0.259), (13, 0.045), (14, -0.008), (15, 0.004), (16, -0.03), (17, 0.063), (18, 0.066), (19, 0.072), (20, -0.176), (21, 0.085), (22, 0.097), (23, -0.05), (24, 0.152), (25, 0.039), (26, 0.112), (27, 0.02), (28, 0.05), (29, -0.0), (30, 0.016), (31, 0.007), (32, -0.012), (33, -0.065), (34, 0.069), (35, 0.096), (36, -0.131), (37, -0.073), (38, -0.045), (39, -0.096), (40, -0.002), (41, 0.177), (42, -0.05), (43, 0.159), (44, -0.081), (45, 0.043), (46, -0.176), (47, -0.048), (48, -0.0), (49, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.95239377 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
2 0.55951422 20 jmlr-2005-Clustering with Bregman Divergences
Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh
Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory
3 0.51225984 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
Author: Hal Daumé III, Daniel Marcu
Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
6 0.24977015 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
7 0.23807539 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
8 0.22263187 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
9 0.1946999 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
10 0.19361047 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
11 0.18030661 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
12 0.1779546 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
13 0.15775204 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
14 0.14995185 36 jmlr-2005-Gaussian Processes for Ordinal Regression
15 0.13842507 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
16 0.13837355 22 jmlr-2005-Concentration Bounds for Unigram Language Models
17 0.1345205 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
18 0.12721452 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
19 0.12455808 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
20 0.12418165 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
topicId topicWeight
[(13, 0.024), (17, 0.023), (19, 0.02), (36, 0.025), (37, 0.034), (42, 0.045), (43, 0.038), (47, 0.02), (52, 0.102), (59, 0.031), (70, 0.049), (84, 0.374), (88, 0.093), (90, 0.021), (92, 0.011), (94, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.733603 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
2 0.4598155 20 jmlr-2005-Clustering with Bregman Divergences
Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh
Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory
3 0.38567796 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
4 0.37989935 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
5 0.37799627 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
6 0.37766039 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
7 0.37524962 71 jmlr-2005-Variational Message Passing
8 0.37155429 3 jmlr-2005-A Classification Framework for Anomaly Detection
9 0.36970603 39 jmlr-2005-Information Bottleneck for Gaussian Variables
10 0.36775792 64 jmlr-2005-Semigroup Kernels on Measures
11 0.36678857 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
12 0.36655298 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
13 0.36366132 44 jmlr-2005-Learning Module Networks
14 0.36119193 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
15 0.35929167 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.35890064 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
17 0.35837942 11 jmlr-2005-Algorithmic Stability and Meta-Learning
18 0.35532784 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
19 0.35387355 54 jmlr-2005-Managing Diversity in Regression Ensembles
20 0.35344434 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning