jmlr jmlr2005 jmlr2005-27 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. [sent-11, score-0.668]
2 Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids 1. [sent-14, score-0.425]
3 Several characteristics have been observed in vector space based methods for text classification (20; 21), including the high dimensionality of the input space, sparsity of document vectors, linear separability in most text classification problems, and the belief that few features are irrelevant. [sent-17, score-0.389]
4 It has been conjectured that an aggressive dimension reduction may result in a significant loss of information, and therefore, result in poor classification results (13). [sent-18, score-0.349]
5 The evaluation of the kernel function depends on the dimension of the input data, since the kernel functions contain the inner product of two input vectors for the linear or polynomial kernels or the distance of two vectors for the Gaussian RBF kernel. [sent-32, score-0.254]
6 Throughout the paper, we will assume that the document set is represented in an m × n termdocument matrix A = (ai j ), in which each column represents a document, and each entry ai j represents the weighted frequency of term i in document j (1; 2). [sent-36, score-0.409]
7 Although the truncated SVD provides the closest approximation to A in Frobenius or L2 norm, LSI ignores the cluster structure while reducing the dimension of the data. [sent-39, score-0.348]
8 With dimension reduction, computational complexity can be dramatically reduced for all classifiers including support vector machines and k-nearest neighbor classification. [sent-41, score-0.312]
9 In many document data sets, documents can be assigned to more than one cluster upon classification. [sent-44, score-0.438]
10 Our numerical experiments illustrate that the clusterpreserving dimension reduction algorithms we employ reduce the data dimension without any significant loss of information. [sent-46, score-0.501]
11 In fact, in many cases, they seem to have the effect of noise reduction, since prediction accuracy becomes better after dimension reduction when compared to that in the original high dimensional input space. [sent-47, score-0.452]
12 Low-Rank Approximation Using Latent Semantic Indexing LSI is based on the assumption that there is some underlying latent semantic structure in the termdocument matrix that is corrupted by the wide variety of words used in documents and queries. [sent-49, score-0.27]
13 The basic idea is that if two document vectors represent the same topic, they will share many associating words with a keyword, and they will have very close semantic structures after dimension reduction via SVD. [sent-51, score-0.588]
14 Thus LSI/SVD breaks the original relationship of the data into linearly independent components (6), where the original term vectors are represented by left singular vectors and document vectors by right singular vectors. [sent-52, score-0.345]
15 That is, if l ≤ rank(A), then A ≈ Ul Σl VlT , where the columns of Ul are the leading l left singular vectors, Σl is an l × l diagonal matrix with the l largest singular values in nonincreasing order along its diagonal, and the columns of Vl are the leading l right singular vectors. [sent-53, score-0.41]
16 Then Σl VlT is the reduced dimensional representation of A, or ˆ equivalently, a new document q ∈ Rm×1 can be represented in the l-dimensional space as q = UlT q. [sent-54, score-0.317]
17 Since the complete orthogonal decomposition such as ULV or URV has computational advantages over the SVD including easier updating (22; 23; 24) and downdating (17), dimension reduction by these faster low-rank orthogonal decompositions has also been exploited (3). [sent-56, score-0.709]
18 In addition, since there is no theoretical optimum value for the reduced dimension, potentially expensive experimentation may be required to determine a reduced dimension l. [sent-58, score-0.31]
19 The experimental results confirm that when the data set is already clustered, the dimension reduction methods we present in the next section are more effective for classification of new data. [sent-60, score-0.349]
20 Compute the centroid ci of the ith cluster, 1 ≤ i ≤ p 2. [sent-63, score-0.697]
21 Compute the centroid ci of the ith cluster, 1 ≤ i ≤ p 2. [sent-67, score-0.697]
22 Dimension Reduction Algorithms for Clustered Data To achieve greater efficiency in manipulating data represented in a high dimensional space, it is often necessary to reduce the dimension dramatically. [sent-71, score-0.223]
23 In this section, several dimension reduction methods that preserve the cluster structure are reviewed. [sent-72, score-0.496]
24 Each method attempts to choose a projection to a reduced dimensional space that will capture the cluster structure of the data collection as much as possible. [sent-73, score-0.297]
25 Instead of treating each column of the matrix A equally regardless of its membership in a specific cluster as in LSI/SVD, we want to find a lower dimensional representation Y of A so that the p clusters are preserved in Y . [sent-76, score-0.393]
26 Given a term-document matrix, the problem is to find a transformation that maps each document vector in the m dimensional space to a vector in the l dimensional space for some l < m. [sent-77, score-0.309]
27 For this, either the dimension reducing transformation GT ∈ Rl×m is computed explicitly or the problem is formulated as a rank reducing approximation where the given matrix A is to be decomposed into two matrices B and Y . [sent-78, score-0.323]
28 The matrix B accounts for the dimension reducing transformation. [sent-80, score-0.244]
29 However, it is not necessary to compute the dimension reducing transformation G from B explicitly, as long as we can find the reduced dimensional representation of a given data item. [sent-81, score-0.351]
30 (3) Any given document q ∈ Rm×1 can be transformed to the lower dimensional space by solving the minimization problem ˆ (4) min Bq − q 2 . [sent-83, score-0.238]
31 ˆ q∈Rl×1 (5) In the Centroid dimension reduction algorithm (see Algorithm 1), the ith column of B is the centroid vector of the ith cluster, which is the average of the data items in the ith cluster, for 1 ≤ i ≤ p. [sent-86, score-1.001]
32 Then, any vector q ∈ Rm×1 can be represented in the ˆ p dimensional space as q, the solution of the least squares problem (4), where B is the centroid matrix. [sent-88, score-0.624]
33 In the Orthogonal Centroid algorithm (see Algorithm 2), the p dimensional representation ˆ of a data vector q ∈ Rm×1 is given as q = QT q where Q p is an orthonormal basis for the centroid p matrix obtained from its QR decomposition. [sent-89, score-0.667]
34 The centroid-based dimension reduction algorithms are computationally less costly than LSI/SVD. [sent-90, score-0.349]
35 Although the centroid-based schemes can be applied only when the data are linearly separable, they are suitable for text classification problems, since text data is usually linearly separable in the original dimensional space (13). [sent-92, score-0.335]
36 2 Generalized Discriminant Analysis based on the Generalized Singular Value Decomposition Recently, a new algorithm has been developed for cluster-preserving dimension reduction based on the generalized singular value decomposition (GSVD) (10). [sent-95, score-0.488]
37 Classical discriminant analysis (7; 25) preserves cluster structure by maximizing the scatter between clusters while minimizing the scatter within clusters. [sent-97, score-0.346]
38 If we denote by Ni the set of column indices that belong to the cluster i, ni the number of columns in cluster i, and c the global centroid, then p Sw = ∑ ∑ (a j − ci )(a j − ci )T , Sb = ∑ ∑ (ci − c)(ci − c)T i=1 j∈Ni and p i=1 j∈Ni p = ∑ ni (ci − c)(ci − c)T . [sent-99, score-0.699]
39 In fact, when the reduced dimension −1 l ≥ p − 1, trace(Sw Sb ) is exactly preserved upon dimension reduction, and equals λ1 + · · · + λ p−1 , where each λi ≥ 0. [sent-117, score-0.462]
40 Without loss of generality, we assume that the term-document matrix A is partitioned as A = [A1 , · · · , A p ] 42 D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S where the columns of each block Ai ∈ Rm×ni belong to the cluster i. [sent-118, score-0.269]
41 As the product of an m × n matrix with an n × m matrix, Sw will be singular when the number of terms m exceeds the number of documents n. [sent-126, score-0.227]
42 Classification Methods To test the effect of dimension reduction in text classification, three different classification methods were used: centroid-based classification, k-nearest neighbor (kNN), and support vector machines (SVMs). [sent-139, score-0.541]
43 43 K IM , H OWLAND AND PARK Algorithm 4 : Centroid-based Classification Given a data matrix A with p clusters and p corresponding centroids, ci , 1 ≤ i ≤ p, and a vector q ∈ Rm×1 , this method finds the index j of the cluster in which the vector q belongs. [sent-142, score-0.35]
44 • find the index j such that sim(q, ci ), 1 ≤ i ≤ p, is minimum (or maximum), where sim(q, ci ) is the similarity measure between q and ci . [sent-143, score-0.398]
45 (For example, sim(q, ci ) = q − ci 2 using the L2 norm, and we take the index with the minimum value. [sent-144, score-0.222]
46 Using the cosine measure, qT ci q 2 ci sim(q, ci ) = cos(q, ci ) = , 2 and we take the index with the maximum value. [sent-145, score-0.661]
47 Using the cosine similarity measure, we can classify a test document q by computing arg max 1≤i≤p qT ci q 2 ci (8) 2 where ci is the centroid of the ith cluster of the training data. [sent-149, score-1.515]
48 When dimension reduction is performed by the Centroid algorithm, the centroids of the full space become the columns ei ∈ R p×1 of the identity matrix. [sent-150, score-0.55]
49 Then the decision rule becomes arg max 1≤i≤p ˆ qT ei ˆ q 2 ei , (9) 2 ˆ where q is the reduced dimensional representation of the document q. [sent-151, score-0.317]
50 We can also classify using the L2 norm similarity measure by finding the centroid that is closest to q in L2 norm. [sent-154, score-0.618]
51 The original form of centroid-based classification finds the nearest centroid and assigns the corresponding class as the predicted class. [sent-155, score-0.584]
52 In this way, document x will be a member of class j if its similarity to the centroid vector c j for the class is above the threshold. [sent-157, score-0.785]
53 Experimental Results Prediction results are compared for the test documents in the full space without any dimension reduction as well as those in the reduced space obtained by LSI/SVD, Centroid, Orthogonal Centroid, and LDA/GSVD dimension reduction methods. [sent-179, score-0.915]
54 For SVMs, we optimized the regularization parameter C, polynomial degree d for the polynomial kernel, and γ for the Gaussian RBF (radial basis function) kernel for each full and reduced dimension data set. [sent-180, score-0.376]
55 45 K IM , H OWLAND AND PARK classification methods centroid (L2 ) centroid (Cosine) 5NN (L2 ) 15NN (L2 ) 30NN (L2 ) 5NN (Cosine) 15NN (Cosine) 30NN (Cosine) SVM l=5 71. [sent-181, score-1.106]
56 9 Table 1: Text classification accuracy (%) using centroid-based classification, k-nearest neighbor classification, and SVMs, with LSI/SVD dimension reduction on the MEDLINE data set. [sent-262, score-0.433]
57 The Euclidean norm (L2 ) and the cosine similarity measure (Cosine) were used for the centroid-based and kNN classification. [sent-263, score-0.282]
58 Table 1 reports text classification accuracy for the MEDLINE data set using LSI/SVD with a range of values for the reduced dimension. [sent-275, score-0.222]
59 The smallest reduced dimension, l = 5, is included in order to compare with centroid-based and LDA/GSVD methods, which reduce the dimension to 5 and 4, respectively. [sent-276, score-0.231]
60 For a training set of size 1250, the reduced dimension l = 300 is generous. [sent-278, score-0.231]
61 This is consistent with the common belief that cosine similarity performs better with unnormalized text data. [sent-280, score-0.393]
62 Also, classification accuracy using 5NN lags that for higher values of k, suggesting that k=5 is too small for classes 46 D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S kernel Dimension reduction methods Centroid Orthogonal LDA/ Centroid GSVD4 22095×1250 5×1250 5×1250 4×1250 88. [sent-281, score-0.263]
63 3 Table 2: Text classification accuracy (%) with different kernels in SVMs with and without dimension reduction on the MEDLINE data set. [sent-352, score-0.381]
64 It is noteworthy that even with LSI, which makes no attempt to preserve the cluster structure upon dimension reduction, SVM classification achieves very consistent classification results for reduced dimensions of 100 or greater, and the SVM accuracy exceeds that of the other classification methods. [sent-357, score-0.439]
65 Table 2 shows text classification accuracy (%) with different kernels in SVMs, with and without dimension reduction on the MEDLINE data set. [sent-358, score-0.492]
66 This table shows that the prediction results in the reduced dimension are similar to those in the original full dimensional space, while achieving a significant reduction in time and space complexity. [sent-360, score-0.542]
67 In the reduced space obtained by the Orthogonal Centroid dimension reduction algorithm, the classification accuracy is insensitive to the choice of the kernel. [sent-361, score-0.46]
68 Table 3 shows classification accuracy obtained by all three classification methods – centroidbased, kNN with three different values of k, and the optimal result from SVM – for each dimension reduced data set and the full space. [sent-363, score-0.306]
69 For the LDA/GSVD dimension reduction method, the classification accuracy with cosine similarity measure is lower with centroid-based classification as well as with kNN, while the results with L2 norm are better. [sent-364, score-0.663]
70 With LDA/GSVD, documents from the same class in 47 K IM , H OWLAND AND PARK classification methods centroid (L2 ) centroid (Cosine) 5NN (L2 ) 15NN (L2 ) 30NN (L2 ) 5NN (Cosine) 15NN (Cosine) 30NN (Cosine) SVM Full 22095×1250 85. [sent-366, score-1.201]
71 4 Table 3: Text classification accuracy (%) using centroid-based classification, k-nearest neighbor classification, and SVMs, with and without dimension reduction on the MEDLINE data set. [sent-411, score-0.433]
72 The Euclidean norm (L2 ) and the cosine similarity measure (Cosine) were used for centroid-based and kNN classification. [sent-412, score-0.282]
73 the full dimensional space tend to be transformed to a very tight cluster or even to a single point in the reduced space, since the LDA/GSVD algorithm tends to minimize the trace of the within cluster scatter. [sent-445, score-0.532]
74 Table 4 shows text classification accuracy for the 5 classes using SVMs with and without dimension reduction methods on the MEDLINE data set. [sent-447, score-0.492]
75 The REUTERS data set has many documents that are classified to more than 2 classes, whereas no document is classified to belong to more than one class in the MEDLINE data set. [sent-449, score-0.291]
76 03 Table 5: Comparison of micro-averaged F1 scores for 3 different classification methods with and without dimension reduction on the REUTERS data set. [sent-468, score-0.349]
77 The Euclidean norm (L2 ) and the cosine similarity measure (Cosine) were used for the centroid-based classification. [sent-469, score-0.282]
78 The cosine similarity measure was used for the kNN classification. [sent-470, score-0.282]
79 The dimension of the full training term-document matrix is 11941×9579 and that of the reduced matrix is 90×9579. [sent-471, score-0.36]
80 could handle relatively large matrices using a sparse matrix representation and sparse QR decomposition in the Centroid and Orthogonal Centroid dimension reduction methods, results for the LDA/GSVD dimension reduction method are not reported, since we ran out of memory while computing the GSVD. [sent-472, score-0.791]
81 Table 5 shows that the effectiveness of classification was preserved for the Orthogonal Centroid dimension reduction algorithm, while it became worse for the Centroid dimension reduction algorithm. [sent-476, score-0.778]
82 This is due to a property of the Centroid algorithm that the centroids of the full space are projected to the columns of the identity matrix in the reduced space. [sent-477, score-0.323]
83 This orthogonality between the centroids may make it difficult to represent the multiclass membership of a document by separating closely related classes after dimension reduction. [sent-478, score-0.46]
84 Conclusion and Discussion In this paper, we applied three methods, Centroid, Orthogonal Centroid, and LDA/GSVD, which are designed for reducing the dimension of clustered data. [sent-482, score-0.262]
85 For comparison, we also applied LSI/SVD, which does not attempt to preserve cluster structure upon dimension reduction. [sent-483, score-0.328]
86 We tested the effectiveness in classification with dimension reduction using three different classification methods: 49 K IM , H OWLAND AND PARK class Full earn acq money-fx grain crude trade interest ship wheat corn microavg (top 10) avg (top 10) microavg(all) 11941×9579 98. [sent-484, score-0.397]
87 The dimension of the full training term-document matrix is 11941×9579 and that of the reduced matrix is 90×9579. [sent-525, score-0.36]
88 They justify dimension reduction as a worthwhile preprocessing stage for achieving high efficiency and effectiveness. [sent-528, score-0.349]
89 Especially for kNN classification, the savings in computational complexity in classification after dimension reduction are significant. [sent-529, score-0.389]
90 In the case of SVM the savings are also clear, since the distance between two pairs of input data points need to be computed repeatedly with and without the use of the kernel function, and the vectors become significantly shorter with dimension reduction. [sent-530, score-0.226]
91 Prediction results with the Centroid dimension reduction method became better compared to those from the full space for the completely disjoint MEDLINE data set, but became worse for the REUTERS data set. [sent-532, score-0.452]
92 Since the Centroid dimension reduction method maps the centroids to unit vectors ei which are orthogonal to each other, it is helpful for the disjoint data set, but not for a data set which contains documents belonging multiple classes. [sent-533, score-0.677]
93 We observed that prediction accuracy with the Orthogonal Centroid dimension reduction algorithm was preserved for SVMs as well as with centroid-based classification. [sent-534, score-0.431]
94 Another way to handle non-linearly separable data is to apply nonlinear extensions of the dimension reduction methods, including those presented in (18; 19). [sent-539, score-0.391]
95 All of the dimension reduction methods presented here can also be applied to visualize the higher dimensional structure by reducing the dimension to 2- or 3-dimensional space. [sent-540, score-0.621]
96 We conclude that dramatic dimension reduction of text documents can be achieved, without sacrificing classification accuracy. [sent-541, score-0.555]
97 For the document sets we tested, the Orthogonal Centroid method did particularly well at preserving the cluster structure from the full dimensional representation. [sent-542, score-0.428]
98 That is, the prediction accuracies for Orthogonal Centroid rival those of the full space, even though the dimension is reduced to the number of clusters. [sent-543, score-0.274]
99 Dimensional reduction based on centroids and least squares for efficient processing of text data. [sent-623, score-0.416]
100 Lower dimensional representation of text data based on centroids and least squares, BIT Numerical Mathematics, 42(2):1–22, 2003. [sent-648, score-0.29]
wordName wordTfidf (topN-words)
[('centroid', 0.553), ('knn', 0.269), ('cosine', 0.217), ('reduction', 0.197), ('document', 0.167), ('medline', 0.16), ('dimension', 0.152), ('cluster', 0.147), ('ext', 0.128), ('imension', 0.128), ('owland', 0.128), ('orthogonal', 0.125), ('sb', 0.121), ('sw', 0.119), ('ci', 0.111), ('text', 0.111), ('rm', 0.11), ('centroids', 0.108), ('eduction', 0.107), ('park', 0.099), ('sim', 0.098), ('lsi', 0.096), ('hw', 0.095), ('documents', 0.095), ('singular', 0.089), ('im', 0.087), ('hb', 0.087), ('classi', 0.086), ('lassification', 0.08), ('reduced', 0.079), ('svms', 0.073), ('semantic', 0.072), ('dimensional', 0.071), ('qt', 0.07), ('ul', 0.067), ('howland', 0.067), ('svd', 0.066), ('rl', 0.065), ('similarity', 0.065), ('urv', 0.064), ('clustered', 0.061), ('gt', 0.058), ('sv', 0.058), ('qr', 0.054), ('discriminant', 0.054), ('neighbor', 0.052), ('ni', 0.052), ('columns', 0.05), ('preserved', 0.05), ('decomposition', 0.05), ('svm', 0.049), ('clusters', 0.049), ('reducing', 0.049), ('microavg', 0.048), ('vlt', 0.048), ('rh', 0.048), ('scatter', 0.048), ('rbf', 0.045), ('trace', 0.045), ('indexing', 0.043), ('matrix', 0.043), ('full', 0.043), ('separable', 0.042), ('minnesota', 0.04), ('savings', 0.04), ('umn', 0.04), ('qh', 0.04), ('reuters', 0.04), ('siam', 0.039), ('berry', 0.036), ('jeon', 0.036), ('di', 0.035), ('polynomial', 0.034), ('kernel', 0.034), ('membership', 0.033), ('ith', 0.033), ('cation', 0.033), ('accuracy', 0.032), ('bq', 0.032), ('centroidbased', 0.032), ('colon', 0.032), ('downdating', 0.032), ('haesun', 0.032), ('hyunsoo', 0.032), ('linearopt', 0.032), ('oral', 0.032), ('peg', 0.032), ('rbfopt', 0.032), ('termdocument', 0.032), ('nearest', 0.031), ('became', 0.03), ('rank', 0.03), ('cancer', 0.03), ('machines', 0.029), ('belong', 0.029), ('upon', 0.029), ('latent', 0.028), ('decompositions', 0.028), ('ult', 0.027), ('paige', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
Author: Jieping Ye
Abstract: A generalized discriminant analysis based on a new optimization criterion is presented. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) when the scatter matrices are singular. An efficient algorithm for the new optimization problem is presented. The solutions to the proposed criterion form a family of algorithms for generalized LDA, which can be characterized in a closed form. We study two specific algorithms, namely Uncorrelated LDA (ULDA) and Orthogonal LDA (OLDA). ULDA was previously proposed for feature extraction and dimension reduction, whereas OLDA is a novel algorithm proposed in this paper. The features in the reduced space of ULDA are uncorrelated, while the discriminant vectors of OLDA are orthogonal to each other. We have conducted a comparative study on a variety of real-world data sets to evaluate ULDA and OLDA in terms of classification accuracy. Keywords: dimension reduction, linear discriminant analysis, uncorrelated LDA, orthogonal LDA, singular value decomposition
3 0.081298448 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
5 0.055829067 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
6 0.054803681 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
7 0.050897971 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
8 0.046129413 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
9 0.045209214 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
10 0.042697843 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
11 0.04163852 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
12 0.04106278 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
13 0.040618151 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
14 0.039860331 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
15 0.039335914 5 jmlr-2005-A Generalization Error for Q-Learning
16 0.037510339 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
17 0.033027574 64 jmlr-2005-Semigroup Kernels on Measures
18 0.032714657 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
19 0.03270191 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
20 0.032173462 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
topicId topicWeight
[(0, 0.211), (1, 0.106), (2, 0.126), (3, -0.067), (4, 0.166), (5, -0.052), (6, -0.247), (7, 0.291), (8, -0.017), (9, -0.125), (10, -0.432), (11, 0.182), (12, 0.075), (13, -0.039), (14, 0.046), (15, -0.068), (16, -0.001), (17, -0.086), (18, 0.098), (19, -0.022), (20, 0.001), (21, -0.007), (22, -0.029), (23, -0.122), (24, 0.007), (25, 0.033), (26, 0.07), (27, 0.031), (28, 0.081), (29, 0.04), (30, -0.015), (31, 0.024), (32, -0.023), (33, 0.073), (34, -0.007), (35, -0.061), (36, 0.036), (37, 0.079), (38, 0.012), (39, 0.032), (40, 0.014), (41, -0.005), (42, 0.012), (43, -0.059), (44, -0.043), (45, 0.019), (46, 0.087), (47, 0.053), (48, -0.023), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.96064478 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
Author: Jieping Ye
Abstract: A generalized discriminant analysis based on a new optimization criterion is presented. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) when the scatter matrices are singular. An efficient algorithm for the new optimization problem is presented. The solutions to the proposed criterion form a family of algorithms for generalized LDA, which can be characterized in a closed form. We study two specific algorithms, namely Uncorrelated LDA (ULDA) and Orthogonal LDA (OLDA). ULDA was previously proposed for feature extraction and dimension reduction, whereas OLDA is a novel algorithm proposed in this paper. The features in the reduced space of ULDA are uncorrelated, while the discriminant vectors of OLDA are orthogonal to each other. We have conducted a comparative study on a variety of real-world data sets to evaluate ULDA and OLDA in terms of classification accuracy. Keywords: dimension reduction, linear discriminant analysis, uncorrelated LDA, orthogonal LDA, singular value decomposition
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
4 0.25806659 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
5 0.23576 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
6 0.2039104 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
7 0.20196629 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
8 0.18676385 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
9 0.18487333 5 jmlr-2005-A Generalization Error for Q-Learning
10 0.18431322 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
11 0.1792748 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
12 0.17811403 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
13 0.17310628 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
14 0.16724533 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
15 0.16142565 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
16 0.16003367 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
17 0.15382908 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
18 0.15245211 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
19 0.14448775 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
20 0.14394076 41 jmlr-2005-Kernel Methods for Measuring Independence
topicId topicWeight
[(13, 0.024), (17, 0.016), (19, 0.013), (26, 0.027), (36, 0.02), (37, 0.018), (42, 0.53), (43, 0.035), (47, 0.028), (52, 0.077), (70, 0.028), (88, 0.084), (94, 0.015)]
simIndex simValue paperId paperTitle
1 0.95213944 21 jmlr-2005-Combining Information Extraction Systems Using Voting and Stacked Generalization
Author: Georgios Sigletos, Georgios Paliouras, Constantine D. Spyropoulos, Michalis Hatzopoulos
Abstract: This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the metalevel. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems. Keywords: stacking, voting, information extraction, cross-validation 1
same-paper 2 0.90162253 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
Author: Jieping Ye
Abstract: A generalized discriminant analysis based on a new optimization criterion is presented. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) when the scatter matrices are singular. An efficient algorithm for the new optimization problem is presented. The solutions to the proposed criterion form a family of algorithms for generalized LDA, which can be characterized in a closed form. We study two specific algorithms, namely Uncorrelated LDA (ULDA) and Orthogonal LDA (OLDA). ULDA was previously proposed for feature extraction and dimension reduction, whereas OLDA is a novel algorithm proposed in this paper. The features in the reduced space of ULDA are uncorrelated, while the discriminant vectors of OLDA are orthogonal to each other. We have conducted a comparative study on a variety of real-world data sets to evaluate ULDA and OLDA in terms of classification accuracy. Keywords: dimension reduction, linear discriminant analysis, uncorrelated LDA, orthogonal LDA, singular value decomposition
4 0.32765505 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
5 0.29396647 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
6 0.29373032 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
7 0.2828601 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
8 0.28230804 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
9 0.28167328 3 jmlr-2005-A Classification Framework for Anomaly Detection
10 0.28118703 54 jmlr-2005-Managing Diversity in Regression Ensembles
11 0.27508157 64 jmlr-2005-Semigroup Kernels on Measures
12 0.27306357 39 jmlr-2005-Information Bottleneck for Gaussian Variables
13 0.27293968 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
14 0.27193934 36 jmlr-2005-Gaussian Processes for Ordinal Regression
15 0.26861519 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.26832408 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
17 0.26808944 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
18 0.26463202 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
19 0.25988537 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
20 0.25849715 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve