jmlr jmlr2010 jmlr2010-112 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
Reference: text
sentIndex sentText sentNum sentScore
1 We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. [sent-20, score-0.244]
2 Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing 1. [sent-21, score-0.603]
3 In addition, the testing procedure is slow due to the kernel calculation involving support vectors and testing instances. [sent-28, score-0.31]
4 , document classification), people have shown that testing accuracy is similar with/without a nonlinear mapping. [sent-31, score-0.256]
5 If l is the number of training data, n is ¯ the average number of non-zero features per instance, and each kernel evaluation takes O(n) time, ¯ then the cost per decomposition iteration for nonlinear SVM is O(l n). [sent-37, score-0.383]
6 Currently, polynomial kernels are less widely used than the RBF (Gaussian) kernel, which maps data to an infinite dimensional space. [sent-45, score-0.328]
7 This might be because under similar training and testing cost, a polynomial kernel may not give higher accuracy. [sent-46, score-0.555]
8 We show for some data, the testing accuracy of using low-degree polynomial mappings is only slightly worse than RBF, but training/testing via linear-SVM strategies is much faster. [sent-47, score-0.548]
9 An exception where polynomial kernels have been popular is NLP (natural language processing). [sent-54, score-0.38]
10 Some have explored the fast calculation of low-degree polynomial kernels to save the testing time (e. [sent-55, score-0.433]
11 In Section 3, we discuss the proposed method for efficiently training and testing SVM for low-degree polynomial data mappings. [sent-61, score-0.455]
12 A particular emphasis is on the degree-2 polynomial mapping. [sent-62, score-0.242]
13 We give an NLP application on dependency parsing in Section 5. [sent-64, score-0.4]
14 We put an emphasis on the degree-2 polynomial mapping. [sent-105, score-0.242]
15 1 Low-degree Polynomial Mappings A polynomial kernel takes the following form K(xi , x j ) = (γxT x j + r)d , i (3) where γ and r are parameters and d is the degree. [sent-107, score-0.342]
16 The polynomial kernel is the product between two vectors φ(xi ) and φ(x j ). [sent-108, score-0.342]
17 For the polynomial kernel (3), the dimensionality of φ(x) is C(n + d, d) = (n + d)(n + d − 1) · · · (n + 1) , d! [sent-131, score-0.384]
18 If the input data are dense, ˆ then the number of non-zero elements in φ(x) is O(nd ), where d is the degree of the polynomial mapping. [sent-164, score-0.352]
19 If φ(xi ), ∀i can be stored in memory, we construct a hash mapping from (r, s) to j ∈ {1, . [sent-244, score-0.236]
20 In prediction, for any testing instance x, we use the same hash table to conduct the inner product wT φ(x). [sent-250, score-0.272]
21 5 Relations with the RBF kernel RBF kernel (Gaussian kernel) may be the most used kernel in training nonlinear SVM. [sent-260, score-0.474]
22 This result implies that with a suitable parameter selection, the testing accuracy of using the RBF kernel is at least as good as using the linear kernel. [sent-263, score-0.256]
23 For polynomial kernels, Lippert and Rifkin (2006) discuss the relation with RBF. [sent-264, score-0.242]
24 For a positive integer d, the limit of SVM with the RBF kernel approaches SVM with a degree-d polynomial mapping of data. [sent-266, score-0.436]
25 The polynomial mapping is related only to the degree d. [sent-267, score-0.374]
26 This result seems to indicate that RBF is again at least as good as polynomial kernels. [sent-268, score-0.242]
27 However, the polynomial mapping for (3) is more general due to two additional parameters γ and r. [sent-269, score-0.336]
28 Thus the situation is unclear if parameter 1476 T RAINING AND T ESTING L OW- DEGREE P OLYNOMIAL DATA M APPINGS VIA L INEAR SVM Data set n 123 20,958 1,355,181 22 752 54 254 a9a real-sim news20 ijcnn1 MNIST38 covtype webspam n ¯ 13. [sent-270, score-0.363]
29 In Section 4, we give a detailed comparison between degree-2 polynomial mappings and RBF. [sent-282, score-0.392]
30 6 Parameter Selection The polynomial kernel defined in (3) has three parameters (d, γ, and r). [sent-284, score-0.342]
31 This result is obtained by proving that a polynomial kernel ¯ ¯ ¯ K(xi , x j ) = (¯ xT x j + r)d with parameters (C, γ, r) γ i (12) results in the same model as the polynomial kernel K(xi , x j ) = (γxT x j + 1)d with parameters γ = i ¯ γ ¯ and C = rd C. [sent-288, score-0.724]
32 7 Prediction Assume a degree-d polynomial mapping is considered and #SV is the number of support vectors. [sent-290, score-0.336]
33 For any testing data x, the prediction time with/without kernels is ∑i:α >0 αi yi K(xi , x) i wT φ(x) ⇒ O(#SV · n), ¯ (14) ⇒ O(n), ˆ (15) where n and n are respectively the average number of non-zero elements in x and φ(x). [sent-291, score-0.26]
34 Experiments In this section, we experimentally analyze the proposed approach for degree-2 polynomial mappings. [sent-297, score-0.242]
35 1477 C HANG , H SIEH , C HANG , R INGGAARD AND L IN Analysis of φ(xi ) Data set a9a real-sim news20 ijcnn1 MNIST38 covtype webspam n2 /2 ¯ n ˆ ln ¯ C(n+2, 2) 96. [sent-301, score-0.363]
36 Problems real-sim, news20, covtype and webspam have no original test sets, so we use a 80/20 split for training and testing. [sent-359, score-0.471]
37 We compare implicit mappings (kernel) and explicit mappings of data by LIBSVM (Chang and Lin, 2001) and an extension of LIBLINEAR (Fan et al. [sent-366, score-0.3]
38 Table 2 presents these two values for the degree-2 polynomial mapping. [sent-376, score-0.242]
39 The column n/n indicates the ratio ˆ ¯ between the memory consumption for storing φ(xi ) and xi . [sent-420, score-0.268]
40 Table 3 lists n/n to show the ratio between two methods’ memory consumption on storing φ(xi ) ˆ ¯ and xi , ∀i. [sent-439, score-0.268]
41 In the same table we present each method’s number of L2 cache misses and training time. [sent-440, score-0.323]
42 1479 C HANG , H SIEH , C HANG , R INGGAARD AND L IN Data set a9a real-sim ijcnn1 MNIST38 covtype webspam Linear (LIBLINEAR) C Time (s) Accuracy 32 5. [sent-457, score-0.363]
43 Data set a9a real-sim ijcnn1 MNIST38 covtype webspam C 8 0. [sent-488, score-0.363]
44 76 Table 5: Training time (in seconds) and testing accuracy of using the degree-2 polynomial mapping. [sent-521, score-0.398]
45 4 Accuracy and Time of Using Linear, Degree-2 Polynomial, and RBF We compare training time, testing time, and testing accuracy of using three mappings: linear, degree-2 polynomial, and RBF. [sent-525, score-0.369]
46 The best (C, γ) are then used to train the whole training set and obtain the testing accuracy. [sent-528, score-0.25]
47 To reduce the training time, LIBSVM allocates some memory space, called kernel cache, to store recently used kernel elements. [sent-529, score-0.439]
48 Degree-2 polynomial mappings may be useful for these data. [sent-538, score-0.392]
49 We then explore the performance of the degree-2 polynomial mapping. [sent-540, score-0.242]
50 05 a9a real-sim ijcnn1 MNIST38 covtype webspam degree-2 0. [sent-548, score-0.363]
51 6 Table 5 also presents the testing accuracy difference between degree-2 polynomial and linear/RBF. [sent-568, score-0.398]
52 It is observed that for nearly all problems, the performance of the degree-2 polynomial mapping can compete with RBF, while for covtype, the performance is only similar to the linear mapping. [sent-569, score-0.336]
53 Regarding the training time, LIBLINEAR with degree-2 polynomial mappings is faster than LIBSVM with RBF. [sent-571, score-0.534]
54 Next, we compare the training time between LIBLINEAR and LIBSVM when the same degree-2 polynomial mapping is used. [sent-573, score-0.444]
55 Thus for applications needing to use low-degree polynomial kernels, the training time can be significantly reduced. [sent-575, score-0.35]
56 4, after a degree-2 polynomial mapping the number of features may be very large. [sent-583, score-0.391]
57 In this section, we conduct a preliminary investigation on training degree-2 polynomial mappings via (17). [sent-585, score-0.5]
58 1481 C HANG , H SIEH , C HANG , R INGGAARD AND L IN Data set a9a real-sim ijcnn1 MNIST38 covtype webspam Linear Time Sparsity Accuracy (s) (%) (%) 0. [sent-601, score-0.363]
59 55 Degree-2 polynomial Time (s): φ(xi ) is Sparsity Accuracy stored calculated (%) (%) 3. [sent-619, score-0.279]
60 96 Table 7: L1-regularized SVM: a comparison between linear and degree-2 polynomial mappings. [sent-649, score-0.242]
61 For the degree-2 polynomial mapping, we show the training time of both calculating and storing φ(x). [sent-652, score-0.499]
62 We extend a primal decomposition implementation for (17) in LIBLINEAR to handle degree-2 polynomial mappings. [sent-670, score-0.336]
63 Table 7 compares linear and degree-2 polynomial mappings by showing training time, w’s sparsity, and testing accuracy. [sent-674, score-0.605]
64 For the training time of using degree-2 polynomials, we present results by storing and calculating φ(xi ). [sent-675, score-0.257]
65 If using a degree-2 polynomial mapping, the dimensionality is C(n + 2, 2) = 283, 881. [sent-689, score-0.284]
66 Due to the nice sparsity results, L1 regularization for low-degree polynomial mappings may be a promising future direction. [sent-692, score-0.44]
67 Data-driven dependency parsing is a common method to construct dependency graphs. [sent-697, score-0.555]
68 Different from grammar-based parsing, it learns to produce the dependency graph solely by using the training data. [sent-698, score-0.263]
69 Data-driven dependency parsing has become popular because it is chosen as the shared task at CONLL-X9 and CONLL2007. [sent-699, score-0.4]
70 10 More information about dependency parsing can be found in, for example, McDonald and Nivre (2007). [sent-700, score-0.4]
71 1 A Multi-class Problem We study a transition-based parsing method proposed by Nivre (2003). [sent-705, score-0.245]
72 The parsing algorithm builds a labeled dependency graph in one left-to-right pass over the input with a stack to store partially processed tokens. [sent-706, score-0.555]
73 (2006) use LIBSVM with a degree-2 polynomial kernel to train the multi-class classification problems, and get good results at CONLL-X. [sent-751, score-0.379]
74 The treebank is converted to dependency format using Penn2Malt,11 and the data is split into sections 02–21 for training and section 23 for testing. [sent-754, score-0.263]
75 During training we construct a canonical transition sequence from the dependency graph of each sentence in the training corpus, adopting an arc-eager approach for disambiguation. [sent-755, score-0.407]
76 When parsing a sentence, the classifiers are used for predicting the next transition, based on the features extracted from the current parse state. [sent-757, score-0.3]
77 When all the input tokens have been processed the dependency graph is extracted from the transition sequence. [sent-758, score-0.273]
78 We divide the training data into 125 smaller training sets according to the part-of-speech tag of the current input token. [sent-763, score-0.253]
79 3 l 294,582 #nz 3,913,845 Table 9: Summary of the dependency parsing data set. [sent-773, score-0.4]
80 2 Implementations We consider the degree-2 polynomial mapping in (5). [sent-777, score-0.336]
81 The dimensionality of using the degree-2 polynomial mapping is huge. [sent-782, score-0.378]
82 ” We use a dependency parsing system at Google, which calls LIBSVM/LIBLINEAR for training/testing. [sent-803, score-0.4]
83 As the whole parsing system is quite complex, we have not conducted a complete parameter optimization. [sent-805, score-0.245]
84 , the original input data) and the degree-2 polynomial mapping via (5). [sent-813, score-0.373]
85 71 Table 10: Accuracy, training time, and parsing speed (relative to LIBSVM with polynomial kernel) for the dependency parsing. [sent-836, score-0.786]
86 The accuracy of dependency parsing is measured by two evaluation metrics: 1. [sent-839, score-0.451]
87 Labeled attachment score (LAS): the percentage of tokens with correct dependency head and dependency label. [sent-840, score-0.355]
88 For LIBSVM the polynomial kernel gives better accuracy than the RBF kernel, consistent with previous observations, that polynomial mappings are important for parsing (Kudo and Matsumoto, 2000; McDonald and Pereira, 2006; Yamada and Matsumoto, 2003; Goldberg and Elhadad, 2008). [sent-843, score-1.03]
89 Moreover, LIBSVM using degree-2 polynomial kernel produces better results in terms of UAS/LAS than LIBLINEAR using just a linear mapping of features. [sent-844, score-0.436]
90 However, parsing using LIBSVM is slow compared to LIBLINEAR. [sent-845, score-0.245]
91 We can speed up parsing by a factor of 1,652 with only a 2. [sent-846, score-0.281]
92 With a degree-2 polynomial mapping (5), we achieve UAS/LAS results similar to LIBSVM, while still maintaining high parsing speed, 103 times faster than LIBSVM. [sent-848, score-0.615]
93 4 Related Work Earlier works have improved the testing speed of SVM with low-degree polynomial kernels. [sent-857, score-0.383]
94 They do not really form w, but their result by expanding the degree-2 polynomial kernel leads to something very similar. [sent-861, score-0.342]
95 Goldberg and Elhadad (2008) propose speeding up the calculation of lowdegree polynomial kernels by separating features to rare and common ones. [sent-865, score-0.418]
96 have actually taken long training time,” so they must select a subset using properties of dependency parsing. [sent-873, score-0.263]
97 Our approach considers linear SVM on explicitly mapped data, applies state of the art training techniques, and can simultaneously achieve fast training and testing. [sent-874, score-0.252]
98 It has certain requirements on the training and testing speed, but we also hope to achieve better testing accuracy. [sent-906, score-0.318]
99 splitSVM: Fast, space-efficient, non-heuristic, polynomial kernel computation for NLP applications. [sent-957, score-0.342]
100 Labeled pseudou¸ g projective dependency parsing with support vector machines. [sent-1046, score-0.4]
wordName wordTfidf (topN-words)
[('libsvm', 0.261), ('parsing', 0.245), ('polynomial', 0.242), ('svm', 0.241), ('covtype', 0.228), ('liblinear', 0.208), ('inggaard', 0.176), ('hang', 0.175), ('rbf', 0.16), ('appings', 0.158), ('olynomial', 0.158), ('dependency', 0.155), ('mappings', 0.15), ('sieh', 0.135), ('webspam', 0.135), ('esting', 0.122), ('hsieh', 0.114), ('nlp', 0.114), ('raining', 0.112), ('training', 0.108), ('hash', 0.105), ('testing', 0.105), ('storing', 0.101), ('kernel', 0.1), ('cache', 0.099), ('inear', 0.099), ('mapping', 0.094), ('kudo', 0.09), ('kernels', 0.086), ('xi', 0.085), ('memory', 0.082), ('nivre', 0.081), ('ni', 0.078), ('elhadad', 0.07), ('joakim', 0.07), ('chang', 0.07), ('matsumoto', 0.07), ('stack', 0.069), ('nonlinear', 0.066), ('wt', 0.066), ('xr', 0.066), ('keerthi', 0.065), ('goldberg', 0.062), ('table', 0.062), ('features', 0.055), ('sathiya', 0.055), ('decomposition', 0.054), ('csie', 0.054), ('ntu', 0.054), ('misses', 0.054), ('isozaki', 0.054), ('dei', 0.053), ('gbytes', 0.053), ('hashing', 0.053), ('kazawa', 0.053), ('ringgaard', 0.053), ('na', 0.052), ('taiwan', 0.052), ('xl', 0.052), ('language', 0.052), ('accuracy', 0.051), ('mcdonald', 0.05), ('store', 0.049), ('xs', 0.049), ('sparsity', 0.048), ('calculating', 0.048), ('yuan', 0.047), ('tw', 0.047), ('tokens', 0.045), ('token', 0.045), ('yuji', 0.045), ('dual', 0.045), ('langford', 0.042), ('dimensionality', 0.042), ('ryan', 0.041), ('yamada', 0.041), ('primal', 0.04), ('rd', 0.04), ('degree', 0.038), ('input', 0.037), ('stored', 0.037), ('train', 0.037), ('huge', 0.036), ('speed', 0.036), ('transition', 0.036), ('mapped', 0.036), ('cachegrind', 0.035), ('gertz', 0.035), ('las', 0.035), ('lowdegree', 0.035), ('moh', 0.035), ('pops', 0.035), ('svetoslav', 0.035), ('taku', 0.035), ('uas', 0.035), ('elements', 0.035), ('document', 0.034), ('yi', 0.034), ('faster', 0.034), ('shi', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999851 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
2 0.14460175 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
3 0.10986774 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
4 0.10016796 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
5 0.10011916 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
6 0.098212093 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
7 0.092645153 15 jmlr-2010-Approximate Tree Kernels
8 0.090155892 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
9 0.081186615 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
10 0.080861531 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
11 0.075855419 53 jmlr-2010-Inducing Tree-Substitution Grammars
12 0.0723157 44 jmlr-2010-Graph Kernels
13 0.070322558 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
14 0.065556176 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
15 0.06434425 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
16 0.057810325 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
17 0.056960627 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
18 0.054457702 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
19 0.054269098 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
20 0.053956199 22 jmlr-2010-Classification Using Geometric Level Sets
topicId topicWeight
[(0, -0.256), (1, -0.018), (2, -0.081), (3, 0.164), (4, 0.188), (5, -0.022), (6, -0.159), (7, -0.104), (8, -0.04), (9, -0.079), (10, 0.092), (11, -0.069), (12, 0.023), (13, 0.088), (14, -0.033), (15, -0.088), (16, 0.106), (17, -0.138), (18, 0.154), (19, -0.118), (20, 0.145), (21, 0.124), (22, -0.024), (23, 0.074), (24, -0.003), (25, -0.023), (26, -0.014), (27, 0.098), (28, 0.25), (29, 0.028), (30, -0.043), (31, -0.019), (32, 0.057), (33, -0.011), (34, 0.077), (35, -0.014), (36, -0.051), (37, -0.125), (38, -0.018), (39, 0.025), (40, 0.071), (41, 0.071), (42, 0.041), (43, -0.03), (44, 0.079), (45, 0.003), (46, 0.121), (47, -0.046), (48, 0.021), (49, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.95330906 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
2 0.74693924 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
3 0.697249 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu
Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory
4 0.59771931 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
5 0.45791125 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
6 0.41522458 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
7 0.41117072 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
8 0.40278187 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
9 0.35278246 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
10 0.34481922 15 jmlr-2010-Approximate Tree Kernels
11 0.34147412 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
12 0.33266971 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
13 0.32875577 53 jmlr-2010-Inducing Tree-Substitution Grammars
14 0.30593455 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
15 0.29968768 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
16 0.27051607 22 jmlr-2010-Classification Using Geometric Level Sets
17 0.26886004 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
18 0.26850355 48 jmlr-2010-How to Explain Individual Classification Decisions
19 0.26671112 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
20 0.26412362 44 jmlr-2010-Graph Kernels
topicId topicWeight
[(1, 0.01), (8, 0.018), (10, 0.023), (21, 0.361), (22, 0.015), (24, 0.012), (32, 0.039), (36, 0.034), (37, 0.048), (46, 0.02), (75, 0.166), (79, 0.013), (81, 0.014), (85, 0.092), (96, 0.031)]
simIndex simValue paperId paperTitle
1 0.91538441 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate
Author: Gérard Biau, Frédéric Cérou, Arnaud Guyader
Abstract: Bagging is a simple way to combine estimates in order to improve their performance. This method, suggested by Breiman in 1996, proceeds by resampling from the original data set, constructing a predictor from each subsample, and decide by combining. By bagging an n-sample, the crude nearest neighbor regression estimate is turned into a consistent weighted nearest neighbor regression estimate, which is amenable to statistical analysis. Letting the resampling size kn grows appropriately with n, it is shown that this estimate may achieve optimal rate of convergence, independently from the fact that resampling is done with or without replacement. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, adaptation results by data-splitting are presented. Keywords: bagging, resampling, nearest neighbor estimate, rates of convergence
2 0.87796491 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
Author: Tong Zhang
Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation
3 0.84585655 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
same-paper 4 0.7694602 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
5 0.55815637 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines, document classification
6 0.5543983 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
7 0.55302167 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
8 0.53709877 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
9 0.53694427 40 jmlr-2010-Fast and Scalable Local Kernel Machines
10 0.53654444 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
11 0.53566647 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
12 0.53335619 109 jmlr-2010-Stochastic Composite Likelihood
13 0.53256375 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
14 0.53239405 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
15 0.53032368 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
16 0.53026283 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
17 0.52832389 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
18 0.52784443 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
19 0.52774477 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
20 0.52724379 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming