jmlr jmlr2005 jmlr2005-19 knowledge-graph by maker-knowledge-mining

19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Source: pdf

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difﬁcult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. [sent-13, score-0.55]

2 Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. [sent-17, score-0.348]

3 Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. [sent-19, score-0.312]

4 However, traditional methods for clustering data are severely challenged by a variety of complex characteristics exhibited by certain recent data sets examined by the machine learning and data mining communities. [sent-23, score-0.229]

5 In this paper we focus on clustering objects such as text documents and gene expressions, where the complexity arises from their representation as vectors that are not only very high dimensional (and often sparse) but also directional, i. [sent-25, score-0.412]

6 One can broadly categorize clustering approaches to be either generative (also known as parametric or probabilistic) (Smyth, 1997; Jaakkola and Haussler, 1999) or discriminative (non-parametric) (Indyk, 1999; Sch¨ lkopf and Smola, 2001). [sent-28, score-0.262]

7 A lot of domain knowledge can be incorporated into generative models, so that clustering of data uncovers speciﬁc desirable patterns that one is looking for. [sent-31, score-0.262]

8 For vector data, there are well studied clustering algorithms for popular generative models such as a mixture of multivariate Gaussians, whose effect is analogous to the use of Euclidean or Mahalanobis type distances as the chosen measure of distortion from the discriminative perspective. [sent-34, score-0.324]

9 The choice of a particular distortion measure (or the corresponding generative model) can be crucial to the performance of a clustering procedure. [sent-35, score-0.262]

10 For example, studies in information retrieval applications convincingly demonstrate cosine similarity to be a more effective measure of similarity for analyzing and clustering text documents. [sent-38, score-0.445]

11 Further, the spherical kmeans (spkmeans) algorithm (Dhillon and Modha, 2001), that performs kmeans using cosine similarity instead of Euclidean distortion, has been found to work well for text clustering. [sent-40, score-0.402]

12 ˜ ˜ ˜ ˜ Hence, analysis and clustering of data using Pearson correlations is essentially a clustering problem for directional data. [sent-58, score-0.553]

13 1 Contributions In this paper2 we present a generative model, consisting of a mixture of von Mises-Fisher (vMF) distributions, tailored for directional data distributed on the surface of a unit hypersphere. [sent-60, score-0.405]

14 We derive two clustering algorithms based on EM for estimating the parameters of the mixture model from ﬁrst principles. [sent-61, score-0.269]

15 The ability to adapt κ on a per-component basis leads to substantial performance improvements over existing generative approaches to modeling directional data. [sent-63, score-0.22]

16 We show a connection between the proposed methods and a class of existing algorithms for clustering high-dimensional directional data. [sent-64, score-0.346]

17 In particular, our generative model has the same relation to spkmeans as a model based on a mixture of unit covariance Gaussians has to classical kmeans that uses squared Euclidean distances. [sent-65, score-0.53]

18 We also present detailed experimental comparisons of the proposed algorithms with spkmeans and one of its variants. [sent-66, score-0.321]

19 This result is particularly important when using mixtures of vMFs since the computational needs for hard assignments are lower than what is required for the standard soft assignments (E-step) for these models. [sent-74, score-0.309]

20 In the learning community, perhaps the most widely studied high-dimensional directional data stem from text documents represented by vector space models. [sent-85, score-0.245]

21 For example, hierarchical agglomerative methods based on cosine, Jaccard or Dice coefﬁcients were dominant for text clustering till the mid-1990s (Rasmussen, 1992). [sent-88, score-0.258]

22 A fairly extensive list of references on generative approaches to text clustering can be found in (Zhong and Ghosh, 2003a). [sent-95, score-0.313]

23 Of particular relevance to this work is the spkmeans algorithm (Dhillon and Modha, 2001), which adapts the kmeans algorithm to normalized data by using the cosine similarity for cluster allocation, and also by re-normalizing the cluster means to unit length. [sent-96, score-0.807]

24 The spkmeans algorithm is superior to regular kmeans for high-dimensional text data, and competitive or superior in both performance and speed to a wide range of other existing alternatives for text clustering (Strehl et al. [sent-97, score-0.722]

25 The larger topic of clustering very high-dimensional data (dimension in the thousands or more), irrespective of whether it is directional or not, has also attracted great interest lately. [sent-100, score-0.346]

26 Among generative approaches for clustering high-dimensional data, perhaps the most noteworthy is one that uses low dimensional projections of mixtures of Gaussians (Dasgupta, 1999). [sent-102, score-0.285]

27 (2003) use a mixture of two circular von Mises distributions to estimate the parameters using a quasi-Newton procedure. [sent-110, score-0.273]

28 Wallace and Dowe (2000) perform mixture modeling for circular von Mises distributions and have produced a software called Snob that implements their ideas. [sent-111, score-0.299]

29 McLachlan and Peel (2000) discuss mixture analysis of directional data and mention the possibility of using Fisher distributions (3-dimensional vMFs), but instead use 3-dimensional Kent distributions (Mardia and Jupp, 2000). [sent-112, score-0.263]

30 They also mention work related to the clustering of directional data, but all the efforts included by them are restricted to 2-D or 3-D vMFs. [sent-113, score-0.346]

31 The connection between a generative model involving vMF distributions with constant κ and the spkmeans algorithm was ﬁrst observed by Banerjee and Ghosh (2002). [sent-117, score-0.407]

32 This enhancement leads to even better clustering performance for difﬁcult clustering tasks: when clusters overlap, when cluster sizes are skewed, and when cluster sizes are small relative to the dimensionality of the data. [sent-125, score-0.849]

33 soft mixture modeling and the behavior of vMF based mixture models are obtained. [sent-127, score-0.224]

34 Sections 3 and 4 form the basis for two clustering algorithms using soft and hard-assignments respectively, that are proposed in Section 5. [sent-133, score-0.281]

35 , x ∈ Rd and x = 1, or equivalently x ∈ Sd−1 ) is said to have d-variate von Mises-Fisher (vMF) distribution if its probability density function is given by f (x|µ, κ) = cd (κ)eκµ x, T 1349 (2. [sent-151, score-0.241]

36 Assuming the xi to be independent and identically distributed, we can write the log-likelihood of X as ln P(X |µ, κ) = n ln cd (κ) + κµT r, (2. [sent-170, score-0.345]

37 EM on a Mixture of vMFs (moVMF) We now consider a mixture of k vMF (moVMF) distributions that serves as a generative model for directional data. [sent-182, score-0.287]

38 2 Analysis of Hard Assignments In this subsection, we show that hard assignments should give a reasonable clustering in terms of the log-likelihood since they actually maximize a tight lower bound on the incomplete log-likelihood of the data. [sent-311, score-0.395]

39 On the other hand, hard assignments only need to maintain the cluster assignments of each point, i. [sent-325, score-0.34]

40 Thus, clustering using hard-assignments essentially maximizes a lower bound on the incomplete log-likelihood. [sent-349, score-0.247]

43 Algorithms The developments of the previous section naturally lead to two algorithms for clustering directional data. [sent-392, score-0.346]

44 Upon termination, Algorithm 2 yields a hard clustering of the data and the parameters Θ = {αh , µh , κh }k of the k h=1 vMFs that model the input data set X . [sent-405, score-0.291]

45 1 Revisiting Spherical Kmeans In this section we show that upon enforcing certain restrictive assumptions on the generative model, the spkmeans algorithm (Algorithm 3) can be viewed as a special case of both the soft-moVMF and hard-moVMF algorithms. [sent-407, score-0.376]

46 algorithm reduces to assigning a point to its nearest cluster, where nearness is computed as a cosine similarity between the point and the cluster representative. [sent-416, score-0.266]

47 Thus, a point xi will be assigned to cluster h∗ = argmax xT µh , since i h eκ xi µh∗ T p(h |xi , Θ) = lim ∗ κ→∞ ∑k eκ xi µh h=1 T = 1, and p(h|xi , Θ) → 0, as κ → ∞ for all h = h∗ . [sent-417, score-0.269]

48 To show that spkmeans can also be seen as a special case of the hard-moVMF, in addition to assuming the priors of the components to be equal, we further assume that the concentration parameters of all the components are equal, i. [sent-418, score-0.38]

49 With these assumptions on the model, the estimation of the common concentration parameter becomes unessential since the hard assignment will depend only on the value of the cosine similarity xT µh , and hard-moVMF reduces i to spkmeans. [sent-421, score-0.311]

50 In addition to the abovementioned algorithms, we report experimental results on another algorithm fskmeans (Banerjee and Ghosh, 2002) that belongs to the same class in the sense that, like spkmeans, it can be derived from the mixture of vMF models with some restrictive assumptions. [sent-422, score-0.292]

51 It has already been established that kmeans using Euclidean distance performs much worse than spkmeans for text data (Strehl et al. [sent-439, score-0.464]

52 We also created various subsets of some 1359 BANERJEE , D HILLON , G HOSH AND S RA of the data sets for gaining greater insight into the nature of clusters discovered or to model some particular clustering scenario (e. [sent-445, score-0.386]

53 The underlying clusters in this data set are highly skewed in terms of the number of documents per cluster, with sizes ranging from 9 to 494. [sent-499, score-0.258]

54 Gene-expression data was selected to offer a clustering domain different from text analysis. [sent-537, score-0.258]

55 As previously motivated, the use of Pearson correlation for the analysis of gene expression data is common, so a directional model is well-suited. [sent-538, score-0.26]

56 Out of these we selected a subset of 996 genes for clustering (Dhillon et al. [sent-546, score-0.233]

57 2 Methodology Except for the gene expression data set, performance of the algorithms on all the data sets has been analyzed using mutual information (MI) between the cluster and class labels. [sent-550, score-0.288]

58 For soft-moVMF we “harden” the clustering produced by labeling a point with the cluster label for which it has the highest value of posterior probability (ties broken arbitrarily), in order to evaluate MI. [sent-556, score-0.335]

59 The clustering produced by our soft cluster assignment algorithm is shown in Figure 1(b). [sent-593, score-0.439]

60 The confusion matrix, obtained by “hardening” the clustering produced by soft-moVMF for the 26 1 . [sent-605, score-0.264]

61 As is evident from this confusion matrix, the clustering performed small-mix data set is 0 23 by soft-moVMF is excellent, though not surprising, since small-mix is a data set with well-separated clusters. [sent-606, score-0.264]

62 Even though Classic300 is well separated, the small number of documents per cluster makes the problem somewhat difﬁcult for fskmeans and spkmeans, while hard-moVMF has a much better performance due to its model ﬂexibility. [sent-648, score-0.413]

63 It seems that the low number of documents does not pose a problem for soft-moVMF and it ends up getting an almost perfect clustering for this data set. [sent-650, score-0.262]

64 fskmeans med cisi cran 29 38 22 31 27 38 40 35 40 spkmeans med cisi cran 29 38 22 31 27 38 40 35 40 hard-moVMF med cisi cran 3 72 1 62 28 17 35 0 82 soft-moVMF med cisi cran 0 98 0 99 2 0 1 0 100 Table 5: Comparative confusion matrices for 3 clusters of Classic300. [sent-652, score-2.171]

65 As before, due to the small number of documents per cluster, fskmeans and spkmeans give a rather mixed confusion matrix. [sent-656, score-0.663]

66 When 4 clusters are requested from soft-moVMF, it returns 4 very pure clusters (not shown in the confusion matrices) two of which are almost equal sized segments of the bigger cluster. [sent-660, score-0.415]

67 In Table 7 we show the confusion matrices resulting from 5 clusters of the Classic3 corpus. [sent-662, score-0.236]

68 This behavior of our moVMF algorithms coupled with the observations in Table 6, suggest a clustering method in which one could generate a slightly higher number of clusters than required, and then agglomerate them appropriately. [sent-664, score-0.386]

69 Among the other three, hard-moVMF seems to perform better than spkmeans and fskmeans across the range of clusters. [sent-670, score-0.551]

70 05 1 fskmeans spkmeans hard−movMF soft−movMF 1 0. [sent-672, score-0.551]

71 95 Mutual Information value fskmeans spkmeans hard−movMF soft−movMF 0. [sent-676, score-0.551]

72 5 Yahoo News Data Set The Yahoo News data set is a relatively difﬁcult data set for clustering since it has a fair amount of overlap among its clusters and the number of points per cluster is low. [sent-711, score-0.514]

73 In addition, the clusters are highly skewed in terms of their comparative sizes. [sent-712, score-0.227]

74 All the algorithms perform similarly until the true number of clusters after which soft-moVMF and spkmeans perform better than the others. [sent-719, score-0.5]

75 Since the number of documents per cluster is small (100), as before spkmeans and fskmeans do not perform that well, even at the true number of clusters, whereas soft-moVMF performs considerably better than the others over the entire range. [sent-722, score-0.734]

76 4 fskmeans spkmeans hard−movMF soft−movMF fskmeans spkmeans hard−movMF soft−movMF 1. [sent-731, score-1.102]

77 6 fskmeans spkmeans hard−movMF soft−movMF fskmeans spkmeans hard−movMF soft−movMF 0. [sent-756, score-1.102]

78 3 Mutual Information value fskmeans spkmeans hard−movMF soft−movMF 0. [sent-775, score-0.551]

79 These internal measures have been earlier employed for evaluating clustering of genes (e. [sent-793, score-0.233]

80 1) As can easily be seen, a higher homogeneity means that the individual elements of each cluster are quite similar to the cluster representative. [sent-806, score-0.256]

81 , 2004) have started looking at supervised methods of evaluating the gene clustering results using public genome databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the gene ontology (GO). [sent-817, score-0.405]

82 Figure 5 shows the various cluster quality ﬁgures of merit as computed for clusters of our gene expression data. [sent-822, score-0.428]

83 1 4 6 8 10 12 14 16 Number of clusters (k) 18 20 22 −1 24 4 6 8 (a) Havg values 10 12 14 16 Number of clusters (k) 18 20 22 24 (b) Hmin values 0. [sent-841, score-0.358]

84 2 24 (c) Savg values 4 6 8 10 12 14 16 Number of clusters (k) (d) Smax values Figure 5: Measures of cluster quality for gene data. [sent-854, score-0.406]

85 1369 18 20 22 24 BANERJEE , D HILLON , G HOSH AND S RA We see from Figure 5(a) that both hard-moVMF and soft-moVMF yield clusters that are much more homogeneous than those furnished by fskmeans and spkmeans. [sent-855, score-0.409]

86 Though the inter-cluster similarities do not differ that much between the four algorithms, soft-moVMF seems to be forming clusters with higher inter-cluster similarity than other algorithms. [sent-857, score-0.228]

87 We could explain this behavior of soft-moVMF by noting that it tends to form overlapping clusters (because of soft-assignments) and those clusters remain closer even after hardening. [sent-858, score-0.358]

88 Since Havg essentially measures the average cosine similarity, we note that using our moVMF based algorithms, we are able to achieve clusters that are more coherent and better separated—a fact that could be attributed to the richer model employed by our algorithms. [sent-859, score-0.268]

89 Discussion The mixture of vMF distributions gives a parametric model-based generalization of the widely used cosine similarity measure. [sent-883, score-0.231]

90 1, the spherical kmeans algorithm that uses cosine similarity arises as a special case of EM on mixture of vMFs when, among other things, the concentration κ of all the distributions is held constant. [sent-885, score-0.411]

91 Assuming the Fisher-Information matrix is identity, the Fisher kernel similarity (Jaakkola and Haussler, 1999) corresponding to the vMF distribution is given by K(xi , x j ) = (∇µ ln f (xi |µ))T (∇µ ln f (x j |µ)) (see (2. [sent-888, score-0.255]

92 The cost of the added ﬂexibility in soft-moVMF over spkmeans is the extra computation in estimating the κ values. [sent-909, score-0.321]

93 The fskmeans and spkmeans do not do well for difﬁcult data sets due to their hard assignment scheme as well as their signiﬁcantly less modeling capabilities. [sent-912, score-0.691]

94 Finally, a word on model selection, since choosing the number of clusters remains one of the widely debated topics in clustering (McLachlan and Peel, 2000). [sent-913, score-0.386]

95 Banerjee and Langford (2004) have proposed a new objective criterion for evaluation and model-selection for clustering algorithms: how well does the clustering algorithm perform as a prediction algorithm. [sent-914, score-0.414]

96 1) we obtain ln P(X |µ, κ) = n ln cd (κ) + κµT r, (A. [sent-972, score-0.298]

97 13c) gives us cd (κh ) ∑n xi p(h|xi , Θ) = − i=1 , cd (κh ) ∑n p(h|xi , Θ) i=1 which can be written as Ad (κh ) = where Ad (κ) = respectively. [sent-1036, score-0.231]

98 Iterative clustering of high dimensional text data augmented by local search. [sent-1215, score-0.258]

99 CLICK: A clustering algorithm with applications to gene expression analysis. [sent-1473, score-0.328]

100 MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. [sent-1514, score-0.387]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vmf', 0.362), ('movmf', 0.346), ('spkmeans', 0.321), ('fskmeans', 0.23), ('banerjee', 0.214), ('clustering', 0.207), ('clusters', 0.179), ('von', 0.149), ('ises', 0.148), ('directional', 0.139), ('cisi', 0.132), ('cran', 0.132), ('hillon', 0.131), ('hosh', 0.131), ('cluster', 0.128), ('isher', 0.124), ('istributions', 0.124), ('mi', 0.111), ('dhillon', 0.11), ('ln', 0.103), ('gene', 0.099), ('ra', 0.098), ('lustering', 0.093), ('cd', 0.092), ('kmeans', 0.092), ('em', 0.09), ('cosine', 0.089), ('hard', 0.084), ('med', 0.082), ('ghosh', 0.08), ('soft', 0.074), ('bessel', 0.066), ('ad', 0.065), ('assignments', 0.064), ('fh', 0.062), ('mixture', 0.062), ('austin', 0.061), ('sd', 0.061), ('concentration', 0.059), ('jupp', 0.058), ('vmfs', 0.058), ('confusion', 0.057), ('mardia', 0.055), ('generative', 0.055), ('documents', 0.055), ('text', 0.051), ('peel', 0.049), ('similarity', 0.049), ('newsgroup', 0.048), ('xi', 0.047), ('mclachlan', 0.045), ('cmu', 0.045), ('eq', 0.043), ('havg', 0.041), ('savg', 0.041), ('zhong', 0.041), ('incomplete', 0.04), ('mutual', 0.039), ('yahoo', 0.039), ('pearson', 0.039), ('xh', 0.037), ('texas', 0.036), ('messages', 0.036), ('smax', 0.034), ('hypersphere', 0.034), ('yeast', 0.033), ('hmin', 0.033), ('mises', 0.033), ('distributions', 0.031), ('newsgroups', 0.031), ('fl', 0.031), ('circular', 0.031), ('assignment', 0.03), ('spherical', 0.029), ('approximations', 0.028), ('langford', 0.028), ('estimates', 0.028), ('strehl', 0.028), ('utexas', 0.028), ('likelihood', 0.027), ('modeling', 0.026), ('genes', 0.026), ('news', 0.025), ('tx', 0.025), ('lagrangian', 0.025), ('sra', 0.025), ('suvrit', 0.025), ('rh', 0.024), ('karypis', 0.024), ('kent', 0.024), ('salton', 0.024), ('comparative', 0.024), ('neal', 0.024), ('expectation', 0.024), ('plane', 0.024), ('skewed', 0.024), ('mixtures', 0.023), ('annealing', 0.022), ('mining', 0.022), ('expression', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

2 0.14608657 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by ﬁrst posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efﬁcient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory

3 0.1211364 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

Author: Hal Daumé III, Daniel Marcu

Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to deﬁne distributions over the countably inﬁnite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over inﬁnitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artiﬁcial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian

4 0.10650966 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

Author: Lior Wolf, Amnon Shashua

Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classiﬁcation tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a deﬁnition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features deﬁned by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.

5 0.081298448 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

Author: Hyunsoo Kim, Peg Howland, Haesun Park

Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classiﬁcation methods for many applications including text classiﬁcation. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efﬁciently handle a large number of terms in practical applications of text classiﬁcation. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classiﬁcation algorithm and support vector classiﬁers to handle the classiﬁcation problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efﬁciency for both training and testing can be achieved without sacriﬁcing prediction accuracy of text classiﬁcation even when the dimension of the input space is signiﬁcantly reduced. Keywords: dimension reduction, support vector machines, text classiﬁcation, linear discriminant analysis, centroids

6 0.07090234 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

7 0.067045927 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

8 0.057635605 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

9 0.056398146 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

10 0.0545532 36 jmlr-2005-Gaussian Processes for Ordinal Regression

11 0.05451097 71 jmlr-2005-Variational Message Passing

12 0.053526953 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

13 0.052653268 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

14 0.042911794 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

15 0.042198684 32 jmlr-2005-Expectation Consistent Approximate Inference

16 0.040420841 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

17 0.036226187 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

18 0.031898852 22 jmlr-2005-Concentration Bounds for Unigram Language Models

19 0.031827047 39 jmlr-2005-Information Bottleneck for Gaussian Variables

20 0.031596854 44 jmlr-2005-Learning Module Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.247), (1, -0.081), (2, -0.013), (3, -0.122), (4, 0.01), (5, 0.011), (6, -0.469), (7, 0.042), (8, 0.098), (9, -0.047), (10, 0.048), (11, 0.068), (12, -0.259), (13, 0.045), (14, -0.008), (15, 0.004), (16, -0.03), (17, 0.063), (18, 0.066), (19, 0.072), (20, -0.176), (21, 0.085), (22, 0.097), (23, -0.05), (24, 0.152), (25, 0.039), (26, 0.112), (27, 0.02), (28, 0.05), (29, -0.0), (30, 0.016), (31, 0.007), (32, -0.012), (33, -0.065), (34, 0.069), (35, 0.096), (36, -0.131), (37, -0.073), (38, -0.045), (39, -0.096), (40, -0.002), (41, 0.177), (42, -0.05), (43, 0.159), (44, -0.081), (45, 0.043), (46, -0.176), (47, -0.048), (48, -0.0), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95239377 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

2 0.55951422 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

3 0.51225984 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

Author: Hal Daumé III, Daniel Marcu

4 0.32967401 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

Author: Lior Wolf, Amnon Shashua

5 0.2816523 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

Author: Savina Andonova Jaeger

Abstract: A uniﬁed approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This uniﬁed approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with ﬁnite VC dimension, generate classiﬁer functions that fall into the above category. We also explore the individual complexities of the classiﬁers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classiﬁers, generalization bounds, SVM, voting classiﬁers, random classes

6 0.24977015 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

7 0.23807539 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

8 0.22263187 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

9 0.1946999 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

10 0.19361047 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

11 0.18030661 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

12 0.1779546 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

13 0.15775204 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

14 0.14995185 36 jmlr-2005-Gaussian Processes for Ordinal Regression

15 0.13842507 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

16 0.13837355 22 jmlr-2005-Concentration Bounds for Unigram Language Models

17 0.1345205 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

18 0.12721452 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

19 0.12455808 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

20 0.12418165 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.024), (17, 0.023), (19, 0.02), (36, 0.025), (37, 0.034), (42, 0.045), (43, 0.038), (47, 0.02), (52, 0.102), (59, 0.031), (70, 0.049), (84, 0.374), (88, 0.093), (90, 0.021), (92, 0.011), (94, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.733603 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra

2 0.4598155 20 jmlr-2005-Clustering with Bregman Divergences

Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

3 0.38567796 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou

Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efﬁcient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We ﬁrst present an online SVM algorithm based on this premise. LASVM yields competitive misclassiﬁcation rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.

4 0.37989935 36 jmlr-2005-Gaussian Processes for Ordinal Regression

Author: Wei Chu, Zoubin Ghahramani

Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative ﬁltering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative ﬁltering, gene expression analysis, feature selection

5 0.37799627 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall

Abstract: Many learning algorithms use a metric deﬁned over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classiﬁcation. Speciﬁcally, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efﬁcient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.

6 0.37766039 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

7 0.37524962 71 jmlr-2005-Variational Message Passing

8 0.37155429 3 jmlr-2005-A Classification Framework for Anomaly Detection

9 0.36970603 39 jmlr-2005-Information Bottleneck for Gaussian Variables

10 0.36775792 64 jmlr-2005-Semigroup Kernels on Measures

11 0.36678857 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

12 0.36655298 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

13 0.36366132 44 jmlr-2005-Learning Module Networks

14 0.36119193 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

15 0.35929167 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

16 0.35890064 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

17 0.35837942 11 jmlr-2005-Algorithmic Stability and Meta-Learning

18 0.35532784 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

19 0.35387355 54 jmlr-2005-Managing Diversity in Regression Ensembles

20 0.35344434 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning