jmlr jmlr2006 jmlr2006-44 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou
Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP
Reference: text
sentIndex sentText sentNum sentScore
1 ORG NEC Laboratories America 4 Independence Way Princeton, NJ 08540, USA Editor: Thorsten Joachims Abstract We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. [sent-6, score-0.167]
2 Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP 1. [sent-15, score-0.151]
3 TSVMs, like SVMs, learn a large margin hyperplane classifier using labeled training data, but simultaneously force this hyperplane to be far away from the unlabeled data. [sent-18, score-0.415]
4 e C OLLOBERT, S INZ , W ESTON AND B OTTOU it seems clear that algorithms such as TSVMs can give considerable improvement in generalization over SVMs, if the number of labeled points is small and the number of unlabeled points is large. [sent-25, score-0.413]
5 Unfortunately, TSVM algorithms (like other semi-supervised approaches) are often unable to deal with a large number of unlabeled examples. [sent-26, score-0.32]
6 Finally, Chapelle and Zien (2005) proposed a primal method, which turned out to show improved generalization performance over the previous approaches, but still scales as (L + U)3 , where L and U are the numbers of labeled and unlabeled examples. [sent-31, score-0.397]
7 , 2005) transform the non-convex transductive problem into a convex semi-definite programming problem that scales as (L +U)4 or worse. [sent-34, score-0.182]
8 CCCP iteratively optimizes non-convex cost functions that can be expressed as the sum of a convex function and a concave function. [sent-37, score-0.126]
9 The optimization is carried out iteratively by solving a sequence of convex problems obtained by linearly approximating the concave function in the vicinity of the solution of the previous convex problem. [sent-38, score-0.168]
10 This provides what we believe is the best known method for implementing transductive SVMs with an empirical scaling of (L +U)2 , which involves training a sequence of typically 1-10 conventional convex SVM optimization problems. [sent-40, score-0.245]
11 This is a combinatorial problem, but one can approximate it (see Vapnik, 1995) as finding an SVM separating the training set under constraints which force the unlabeled examples to be as far 1688 L ARGE S CALE T RANSDUCTIVE SVM S 1 1 0. [sent-57, score-0.39]
12 This can be written as minimizing 2 L L+U i=1 1 w 2 i=L+1 +C ∑ ξi +C∗ ∑ ξi subject to yi fθ (xi ) ≥ 1 − ξi , i = 1, . [sent-75, score-0.101]
13 The loss function H1 (| · |) for the unlabeled examples can be seen in Figure 1, left. [sent-82, score-0.386]
14 For C∗ > 0 we penalize unlabeled data that is inside the margin. [sent-84, score-0.336]
15 This is equivalent to using the hinge loss on the unlabeled data as well, but where we assume the label for the unlabeled example is yi = sign( fθ (xi )). [sent-85, score-0.869]
16 As shown above, it assigns a Hinge Loss H1 (·) on the labeled examples (Figure 2, center) and a “Symmetric Hinge Loss” H1 (| · |) on the unlabeled examples (Figure 1, left). [sent-87, score-0.421]
17 More recently, Chapelle and Zien (2005) proposed to handle unlabeled examples with a smooth version of this loss (Figure 1, center). [sent-88, score-0.386]
18 While we also use the Hinge Loss for labeled examples, we use for unlabeled examples a slightly more general form of the Symmetric Hinge Loss, that we allow to be “non-peaky” (Figure 1, right). [sent-89, score-0.401]
19 Given an unlabeled example x and using the notation z = fθ (x), this loss can be written as z → Rs (z) + Rs (−z) + const. [sent-90, score-0.366]
20 The s parameter controls where we clip the Ramp Loss, and as a consequence it also controls the wideness of the flat part of the loss (2) we use for transduction: when s = 0, this reverts to the Symmetric Hinge H1 (| · |). [sent-97, score-0.108]
21 3 −1 0 Z 1 2 Figure 2: The Ramp Loss function Rs (t) = min(1 − s, max(0, 1 − t)) = H1 (t) − Hs (t) (left) can be decomposed into the sum of the convex Hinge Loss (center) and a concave loss (right), where Hs (t) = max(0, s − t). [sent-114, score-0.129]
22 Training a TSVM using the loss function (2) is equivalent to training an SVM using the Hinge loss H1 (·) for labeled examples, and using the Ramp loss Rs (·) for unlabeled examples, where each unlabeled example appears as two examples labeled with both possible classes. [sent-116, score-0.954]
23 More formally, after introducing yi = 1 i ∈ [L + 1 . [sent-117, score-0.101]
24 Balancing constraint One problem with TSVM as stated above is that in high dimensions with few training examples, it is possible to classify all the unlabeled examples as belonging to only one of the classes with a very large margin, which leads to poor performance. [sent-128, score-0.374]
25 To cure this problem, one further constrains the solution by introducing a balancing constraint that ensures the unlabeled data are assigned to both classes. [sent-129, score-0.373]
26 Joachims (1999b) directly enforces that the fraction of positive and negatives assigned to the unlabeled data should be the same fraction as found in the labeled data. [sent-130, score-0.397]
27 Chapelle and Zien (2005) use a similar but slightly relaxed constraint, which we also use in this work: 1 L+U 1 L fθ (xi ) = ∑ yi . [sent-131, score-0.101]
28 Assume that a cost function J(θ) can be rewritten as the sum of a convex part Jvex (θ) and a concave part Jcav (θ). [sent-142, score-0.101]
29 To simplifiy the first order approximation of the concave part in the CCCP procedure (5), we denote βi = yi s ∂Jcav (θ) ∂ fθ (xi ) = C∗ if yi fθ (xi ) < s , 0 otherwise (11) s for unlabeled examples (that is i ≥ L + 1). [sent-159, score-0.594]
30 2 The concave part Jcav does not depend on labeled examples (i ≤ L) so we obviously have βi = 0 for all i ≤ L. [sent-160, score-0.133]
31 Previous Work SVMLight-TSVM Like our work, the heuristic optimization algorithm implemented in SVMLight (Joachims, 1999b) solves successive SVM optimization problems, but on L + U instead of L + 2U data points. [sent-174, score-0.115]
32 It improves the objective function by iteratively switching the labels of two unlabeled points xi and x j with ξi + ξ j > 2. [sent-175, score-0.399]
33 The convergence proof of the inner loop relies on the fact that there is only a finite number 2U of labelings of U unlabeled points, even though it is unlikely that all of them are examined. [sent-177, score-0.32]
34 However, since the heuristic only swaps the labels of two unlabeled examples at each step in order to enforce the balancing constraint, it might need many iterations to reach a minimum, which makes it intractable for big data set sizes in practice (cf. [sent-178, score-0.434]
35 Following the formulation of transductive SVMs found in Bennett and Demiriz (1998), the authors consider transductive linear SVMs with a 1-norm regularizer, which allow them to decompose the corresponding loss function as a sum of a linear function and a concave function. [sent-205, score-0.4]
36 Bennett proposed the following formulation which is similar to (1): minimize L ||w||1 +C ∑ ξi +C∗ i=1 U ∑ min(ξi , ξ∗ ) i i=L+1 subject to yi fθ (xi ) ≥ 1 − ξi , i = 1, . [sent-206, score-0.101]
37 1694 L ARGE S CALE T RANSDUCTIVE SVM S Note also that the algorithm presented in their paper did not implement a balancing constraint for the labeling of the unlabeled examples as in (4). [sent-219, score-0.377]
38 Our transduction algorithm is nonlinear and the use of kernels, solving the optimization in the dual, allows for large scale training with high dimensionality and number of examples. [sent-220, score-0.139]
39 The uspst data set is the test part of the USPS hand written digit data. [sent-265, score-0.101]
40 All data sets are split into ten parts with each part having a small amount of labeled examples and using the remaining part as unlabeled data. [sent-266, score-0.417]
41 We use heuristic UC∗ = LC because it decreases C∗ when the number of unlabeled data increases. [sent-274, score-0.377]
42 4 Figure 3 shows the training time on g50c and text for the three methods as we vary the number of unlabeled examples. [sent-290, score-0.404]
43 For each method we report the training times for the hyperparameters that give optimal s=0 performance as measured on the test set on the first split of the data (we use CCCP-TSVM|UC∗ =LC in these experiments). [sent-291, score-0.122]
44 Using all 2000 unlabeled data on Text, CCCP-TSVMs are approximately 133 times faster than SVMLight-TSVM and 50 times faster than ∇TSVM. [sent-292, score-0.336]
45 For the Text data set, using 2000 unlabeled examples CCCP-TSVMs are 133x faster than SVMLight-TSVMs, and 50x faster than ∇TSVMs. [sent-307, score-0.356]
46 4 0 4 2 4 6 Number Of Iterations 8 0 10 Figure 4: Value of the objective function and test error during the CCCP iterations of training TSVM on two data sets (single trial), g50c (left) and text (right). [sent-314, score-0.128]
47 We expect these differences to increase as the number of unlabeled examples increases further. [sent-316, score-0.34]
48 Effect of the parameter C∗ As mentioned before, both SVMLight-TSVM and ∇TSVM use an annealing heuristic for hyperparameter C∗ . [sent-364, score-0.118]
49 Comparing the heuristics C∗ = C and UC∗ = LC Table 3 compares the C∗ = C and UC∗ = LC heuristics on the small scale data sets. [sent-368, score-0.107]
50 Although the heuristic C∗ = C gives reasonable results for small amounts of unlabeled data, we prefer the heuristic UC∗ = LC. [sent-370, score-0.402]
51 When the number of unlabeled examples U becomes large, setting C∗ = C will mean the third term in the objective function (1) will dominate, resulting in almost no attention being paid to minimizing the training error. [sent-371, score-0.374]
52 We also conducted an experiment to compare these two heuristics for larger unlabeled data sizes U. [sent-373, score-0.371]
53 We took the same uspst data set (that is, the test part of the USPS data set) and we increased the number of unlabeled examples by adding up to 6000 additional unlabeled examples taken from the original USPS training set. [sent-374, score-0.831]
54 Figure 6 reports the best test error for both heuristics over possible choices of γ and C, taking the mean of the same 10 training splits with 50 labeled examples as before. [sent-375, score-0.2]
55 The heuristic C∗ = U C maintains the balance between unlabeled points U and labeled points L as U and L change, and is close to the best possible choice of C∗ . [sent-402, score-0.454]
56 We report the best test error for both heuristics over possible of choices of γ and C, taking the mean of the same 10 training splits with 50 labeled examples as before. [sent-407, score-0.2]
57 As the number of unlabeled examples increases, the C∗ = C heuristic gives too much weight to the unlabeled data, resulting in no improvement in performance. [sent-408, score-0.701]
58 Intuitively, the UC∗ = LC heuristic balances the weight of the unlabeled and labeled data and empirically performs better. [sent-409, score-0.438]
59 Training our algorithm with a tuned value of s appears to give slightly improved results over using the Symmetric Hinge loss (s = 0, see Figure 1, left), especially on the text data set, as can be seen in Tables 2 and 3. [sent-423, score-0.112]
60 Furthermore, Figure 7 highlights the importance of the parameter s of the loss function (2) by showing the best test error over different choices of s for two data sets, text and g50c. [sent-424, score-0.162]
61 That is, the flat part of the loss far inside the margin prevents our algorithm from making erroneous early decisions regarding the labels of the unlabeled data that may be hard to undo later in the optimization. [sent-427, score-0.382]
62 2 0 s Figure 7: Effect of the parameter s of the Symmetric Ramp loss (see Figure 1 and equation (2) ) on the text data set (left) and the g50c data set (right). [sent-443, score-0.128]
63 Here, the authors are refering to the way SVMLight TSVM has a discrete rather than continuous approach of assigning labels to unlabeled data. [sent-448, score-0.32]
64 However, we think that the smoothed loss function of ∇TSVM may help it to outperform the Symmetric Hinge loss of SVMLight TSVM, making it similar to the clipped loss when we use s < 0. [sent-449, score-0.156]
65 This may be sub-optimal: if we are unlucky enough that all unlabeled points lie in this region, we perform no updates at all. [sent-453, score-0.336]
66 72% Table 4: Comparing CCCP-TSVMs with SVMs on the RCV1 problem for different number of labeled and unlabeled examples. [sent-517, score-0.381]
67 This was done by training a TSVM using the validation set as the unlabeled data. [sent-523, score-0.373]
68 We then varied the number of unlabeled examples U, and reported the test error for each choice of U. [sent-525, score-0.368]
69 58% test error, when using 10500 unlabeled points, and for 100 training points was 10. [sent-529, score-0.398]
70 61% for the latter, this shows that unlabeled data can improve the results on this problem. [sent-533, score-0.336]
71 Figure 8 shows the training time of CCCP optimization as a function of the number of unlabeled examples. [sent-537, score-0.383]
72 On a 64 bit Opteron processor the optimization time for 12500 unlabeled examples was approximately 18 minutes using the 1000 training examples and 69 minutes using 100 training examples. [sent-538, score-0.457]
73 Although the worst case complexity of SVMs is cubic and the optimization time seems to be dependent on the ratio of the number of labeled to unlabeled examples, the training times show a quadratic trend. [sent-539, score-0.444]
74 38% Table 5: Comparing CCCP-TSVMs with SVMs on the MNIST problem for different number of labeled and unlabeled examples. [sent-566, score-0.381]
75 We subsampled the training set for labeled points, and used the test set for unlabeled examples (or the test set plus remainder of the training set when using more than 10,000 unlabeled examples). [sent-571, score-0.845]
76 We then performed model selection using 2000 unlabeled examples to find the best choices of C∗ and s using the validation set. [sent-575, score-0.381]
77 When using 1703 C OLLOBERT, S INZ , W ESTON AND B OTTOU 40 CCCP−TSVM Quadratic fit Time (Hours) 30 20 10 0 −10 0 1 2 3 4 5 Number of Unlabeled Examples 6 4 x 10 Figure 9: Optimization time for the MNIST data set as a function of the number of unlabeled data. [sent-576, score-0.336]
78 The algorithm was trained on 1,000 labeled examples and up to 60,000 unlabeled examples. [sent-577, score-0.401]
79 more unlabeled data, we only reperformed model selection on C∗ as it appeared that this parameter was the most sensitive to changes in the unlabeled set, and kept the other parameters fixed. [sent-579, score-0.64]
80 For the larger labeled set we took 2000, 5000, 10000, 20000, 40000 and 60000 unlabeled examples. [sent-580, score-0.381]
81 The results show an improvement over SVM for CCCP-TSVMs which increases steadily as the number of unlabeled examples increases. [sent-583, score-0.34]
82 Most experiments in semi-supervised learning only use a few labeled examples and do not use as many unlabeled examples as described here. [sent-584, score-0.421]
83 Discussion and Conclusions TSVMs are not the only means of using unlabeled data to improve generalization performance on classification tasks. [sent-587, score-0.336]
84 In the following we discuss some competing algorithms for utilizing unlabeled data, and also discuss the differences between the transductive and semi-supervised learning frameworks. [sent-588, score-0.471]
85 1 Cluster Kernels and Manifold-Learning Transductive SVMs are not the only method of leveraging unlabeled data in a supervised learning task. [sent-591, score-0.355]
86 That is, one is given a set of unlabeled data, and one uses it to improve an inductive classifier to improve its generalization on an unseen test set. [sent-613, score-0.348]
87 Transductive school Vapnik (1982) describes transduction as a mathematical setup for describing learning algorithms that benefit from the prior knowledge of the unlabeled test patterns. [sent-614, score-0.419]
88 As a consequence, transductive bounds are purely derived from combinatorial arguments (Vapnik, 1982) and are more easily made data-dependent (Bottou et al. [sent-618, score-0.167]
89 Experiments The following experiments attempt to determine whether the benefits of TSVMs are solely caused by the prior knowledge represented by the distribution of the unlabeled data. [sent-622, score-0.32]
90 If this is the case, the accuracy should not depend on the presence of the actual test patterns in the unlabeled data. [sent-623, score-0.348]
91 The following experiments consider three distinct subsets: a small labeled training set and two equally sized sets of unlabeled examples. [sent-624, score-0.415]
92 On the other hand, we run CCCP-TSVM using either the second or the third set as unlabeled data. [sent-626, score-0.32]
93 Table 6 shows that transductive TSVMs perform slightly better than semi-supervised TSVMs on these data sets. [sent-634, score-0.167]
94 On the other hand, when the test and training data are not identically distributed, we believe the concept of transduction could be particularly worthwhile. [sent-637, score-0.133]
95 1705 C OLLOBERT, S INZ , W ESTON AND B OTTOU SVM semi-supervised TSVM transductive TSVM Text 18. [sent-638, score-0.151]
96 (2005) to give a fast online transductive learning procedure. [sent-651, score-0.151]
97 We rewrite the problem here for convenience: minimizing J s (θ) = 1 w 2 2 L L+2U i=1 i=L+1 +C ∑ H1 (yi fθ (xi )) +C∗ under the constraint 1 U L+U ∑ fθ (xi ) = i=L+1 1 L ∑ Rs (yi fθ (xi )) , (12) L ∑ yi . [sent-665, score-0.101]
98 (13) i=1 Assume that a cost function J(θ) can be rewritten as the sum of a convex part Jvex (θ) and a concave part Jcav (θ). [sent-666, score-0.101]
99 Setting the derivatives to zero gives us L+2U w= ∑ yi (αi − βi ) Φ(xi ) (22) i=0 and L+2U ∑ yi (αi − βi ) = 0 (23) i=0 and also C − αi − νi = 0 ∀1 ≤ i ≤ L, C∗ − αi − νi ∀L + 1 ≤ i ≤ L + 2U . [sent-678, score-0.202]
100 Beyond the point cloud: from transductive to semisupervised learning. [sent-900, score-0.151]
wordName wordTfidf (topN-words)
[('tsvm', 0.641), ('cccp', 0.331), ('unlabeled', 0.32), ('jcav', 0.189), ('lc', 0.161), ('tsvms', 0.16), ('transductive', 0.151), ('zien', 0.128), ('eston', 0.123), ('inz', 0.123), ('jvex', 0.123), ('ollobert', 0.123), ('ottou', 0.123), ('ransductive', 0.113), ('yi', 0.101), ('uc', 0.098), ('chapelle', 0.094), ('ramp', 0.088), ('arge', 0.086), ('cale', 0.086), ('svm', 0.08), ('hinge', 0.066), ('svmlight', 0.062), ('labeled', 0.061), ('annealing', 0.059), ('rs', 0.057), ('uspst', 0.057), ('transduction', 0.055), ('svms', 0.053), ('concave', 0.052), ('text', 0.05), ('hs', 0.049), ('loss', 0.046), ('secs', 0.046), ('hyperparameters', 0.044), ('heuristic', 0.041), ('xi', 0.038), ('balancing', 0.037), ('heuristics', 0.035), ('xl', 0.034), ('training', 0.034), ('yuille', 0.032), ('convex', 0.031), ('weston', 0.031), ('optimization', 0.029), ('symmetric', 0.029), ('szummer', 0.029), ('sinz', 0.028), ('test', 0.028), ('cluster', 0.027), ('collobert', 0.026), ('yl', 0.026), ('bottou', 0.025), ('mnist', 0.025), ('iteratively', 0.025), ('clip', 0.024), ('america', 0.024), ('nec', 0.024), ('ronan', 0.024), ('joachims', 0.024), ('kernel', 0.023), ('choices', 0.022), ('keerthi', 0.022), ('rangarajan', 0.022), ('fung', 0.021), ('scale', 0.021), ('examples', 0.02), ('shen', 0.02), ('usps', 0.02), ('controls', 0.019), ('validation', 0.019), ('descent', 0.019), ('derbeko', 0.019), ('leveraging', 0.019), ('thi', 0.019), ('vapnik', 0.018), ('nj', 0.018), ('hyperparameter', 0.018), ('smoothed', 0.018), ('demiriz', 0.018), ('laboratories', 0.018), ('cost', 0.018), ('lagrangian', 0.017), ('minimization', 0.016), ('combinatorial', 0.016), ('regularizer', 0.016), ('data', 0.016), ('points', 0.016), ('disappears', 0.016), ('fabian', 0.016), ('krause', 0.016), ('peaked', 0.016), ('sindhwani', 0.016), ('setup', 0.016), ('primal', 0.016), ('princeton', 0.015), ('losses', 0.015), ('gamma', 0.015), ('zhou', 0.015), ('tuebingen', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 44 jmlr-2006-Large Scale Transductive SVMs
Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou
Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP
2 0.31287616 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani
Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines
Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste
Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efficient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classification, sparse design
4 0.064831376 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. Keywords: multiple kernel learning, string kernels, large scale optimization, support vector machines, support vector regression, column generation, semi-infinite linear programming
Author: Luca Zanni, Thomas Serafini, Gaetano Zanghirati
Abstract: Parallel software for solving the quadratic program arising in training support vector machines for classification problems is introduced. The software implements an iterative decomposition technique and exploits both the storage and the computing resources available on multiprocessor systems, by distributing the heaviest computational tasks of each decomposition iteration. Based on a wide range of recent theoretical advances, relevant decomposition issues, such as the quadratic subproblem solution, the gradient updating, the working set selection, are systematically described and their careful combination to get an effective parallel tool is discussed. A comparison with state-ofthe-art packages on benchmark problems demonstrates the good accuracy and the remarkable time saving achieved by the proposed software. Furthermore, challenging experiments on real-world data sets with millions training samples highlight how the software makes large scale standard nonlinear support vector machines effectively tractable on common multiprocessor systems. This feature is not shown by any of the available codes. Keywords: support vector machines, large scale quadratic programs, decomposition techniques, gradient projection methods, parallel computation
6 0.056376047 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)
8 0.051739488 12 jmlr-2006-Active Learning with Feedback on Features and Instances
10 0.042179119 16 jmlr-2006-Bounds for Linear Multi-Task Learning
12 0.039609041 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
14 0.036970589 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
15 0.036335133 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
17 0.034204349 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
18 0.032838881 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
19 0.030966878 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
20 0.030627282 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error
topicId topicWeight
[(0, 0.193), (1, -0.165), (2, 0.183), (3, -0.023), (4, 0.207), (5, -0.062), (6, -0.016), (7, -0.058), (8, -0.336), (9, -0.099), (10, 0.004), (11, 0.336), (12, -0.272), (13, -0.232), (14, -0.113), (15, -0.019), (16, -0.135), (17, -0.039), (18, 0.003), (19, 0.142), (20, 0.017), (21, 0.209), (22, 0.049), (23, -0.064), (24, -0.017), (25, -0.085), (26, -0.137), (27, 0.024), (28, 0.012), (29, -0.07), (30, -0.037), (31, 0.003), (32, -0.006), (33, -0.047), (34, -0.029), (35, 0.033), (36, 0.008), (37, -0.062), (38, -0.012), (39, 0.005), (40, -0.002), (41, 0.011), (42, 0.074), (43, 0.022), (44, -0.061), (45, 0.074), (46, 0.007), (47, -0.004), (48, 0.033), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.92848271 44 jmlr-2006-Large Scale Transductive SVMs
Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou
Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP
2 0.81707013 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani
Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines
Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste
Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efficient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classification, sparse design
Author: Luca Zanni, Thomas Serafini, Gaetano Zanghirati
Abstract: Parallel software for solving the quadratic program arising in training support vector machines for classification problems is introduced. The software implements an iterative decomposition technique and exploits both the storage and the computing resources available on multiprocessor systems, by distributing the heaviest computational tasks of each decomposition iteration. Based on a wide range of recent theoretical advances, relevant decomposition issues, such as the quadratic subproblem solution, the gradient updating, the working set selection, are systematically described and their careful combination to get an effective parallel tool is discussed. A comparison with state-ofthe-art packages on benchmark problems demonstrates the good accuracy and the remarkable time saving achieved by the proposed software. Furthermore, challenging experiments on real-world data sets with millions training samples highlight how the software makes large scale standard nonlinear support vector machines effectively tractable on common multiprocessor systems. This feature is not shown by any of the available codes. Keywords: support vector machines, large scale quadratic programs, decomposition techniques, gradient projection methods, parallel computation
5 0.23255612 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. Keywords: multiple kernel learning, string kernels, large scale optimization, support vector machines, support vector regression, column generation, semi-infinite linear programming
6 0.20100245 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
9 0.18064453 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
10 0.16989012 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)
11 0.16962275 12 jmlr-2006-Active Learning with Feedback on Features and Instances
12 0.15758912 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error
13 0.15645519 16 jmlr-2006-Bounds for Linear Multi-Task Learning
14 0.14989333 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
16 0.14957683 65 jmlr-2006-Nonparametric Quantile Estimation
17 0.14324695 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
18 0.13897513 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
20 0.1314813 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
topicId topicWeight
[(8, 0.02), (17, 0.363), (35, 0.015), (36, 0.058), (45, 0.011), (50, 0.043), (63, 0.061), (76, 0.019), (78, 0.036), (81, 0.032), (84, 0.033), (90, 0.03), (91, 0.058), (96, 0.102)]
simIndex simValue paperId paperTitle
same-paper 1 0.67410803 44 jmlr-2006-Large Scale Transductive SVMs
Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou
Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP
Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor
Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs
3 0.38256818 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. Keywords: multiple kernel learning, string kernels, large scale optimization, support vector machines, support vector regression, column generation, semi-infinite linear programming
Author: Matthias Heiler, Christoph Schnörr
Abstract: We exploit the biconvex nature of the Euclidean non-negative matrix factorization (NMF) optimization problem to derive optimization schemes based on sequential quadratic and second order cone programming. We show that for ordinary NMF, our approach performs as well as existing stateof-the-art algorithms, while for sparsity-constrained NMF, as recently proposed by P. O. Hoyer in JMLR 5 (2004), it outperforms previous methods. In addition, we show how to extend NMF learning within the same optimization framework in order to make use of class membership information in supervised learning problems. Keywords: non-negative matrix factorization, second-order cone programming, sequential convex optimization, reverse-convex programming, sparsity
5 0.37881213 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
Author: Tobias Glasmachers, Christian Igel
Abstract: Support vector machines are trained by solving constrained quadratic optimization problems. This is usually done with an iterative decomposition algorithm operating on a small working set of variables in every iteration. The training time strongly depends on the selection of these variables. We propose the maximum-gain working set selection algorithm for large scale quadratic programming. It is based on the idea to greedily maximize the progress in each single iteration. The algorithm takes second order information from cached kernel matrix entries into account. We prove the convergence to an optimal solution of a variant termed hybrid maximum-gain working set selection. This method is empirically compared to the prominent most violating pair selection and the latest algorithm using second order information. For large training sets our new selection scheme is significantly faster. Keywords: working set selection, sequential minimal optimization, quadratic programming, support vector machines, large scale optimization
6 0.37678909 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
7 0.37282115 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
9 0.36603671 70 jmlr-2006-Online Passive-Aggressive Algorithms
11 0.36427593 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
13 0.36264569 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
16 0.35474694 53 jmlr-2006-Learning a Hidden Hypergraph
17 0.35148558 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
19 0.34910277 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
20 0.34907183 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies