jmlr jmlr2011 jmlr2011-51 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stefano Melacci, Mikhail Belkin
Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classification, optimization
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. [sent-7, score-0.189]
2 In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. [sent-8, score-0.29]
3 We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. [sent-9, score-0.26]
4 The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. [sent-11, score-0.13]
5 Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. [sent-12, score-0.166]
6 Introduction In semi-supervised learning one estimates a target classification/regression function from a few labeled examples together with a large collection of unlabeled data. [sent-15, score-0.093]
7 This dual problem is defined on a number of dual variables equal to l, the number of labeled points. [sent-36, score-0.161]
8 If the total number of labeled and unlabeled points is n, the relationship between the l variables and the final n coefficients is given by a linear system of n equations and variables. [sent-37, score-0.105]
9 Motivated by the recent interest in solving the SVM problem in the primal (Keerthi and DeCoste, 2005; Joachims, 2006; Chapelle, 2007; Shalev-Shwartz et al. [sent-39, score-0.119]
10 , 2007), we present a solution to the primal LapSVM problem that can significantly reduce training times and overcome some issues of the original training algorithm. [sent-40, score-0.22]
11 We propose two methods for solving the LapSVM problem in the primal form (not limited to the linear case), following the ideas presented by Chapelle (2007) for SVMs and pointing out some important differences resulting from an additional regularization term. [sent-42, score-0.119]
12 Interestingly, it turns out that the advantages of the Newton’s method for the SVM problem are lost in LapSVM due to the intrinsic norm regularizer, and the complexity of this solution is still O(n3 ), same as in the original dual formulation. [sent-45, score-0.139]
13 The second method is preconditioned conjugate gradient, which seems better suited to the LapSVM optimization problem. [sent-46, score-0.134]
14 We note that the quality of an approximate solution of the traditional dual form and the resulting approximation of the target optimal function are hard to relate due to the change of variables when passing to the dual problem. [sent-51, score-0.137]
15 Training LapSVMs in the primal overcomes this issue, and it allows us to directly compute approximate solutions by controlling the number of conjugate gradient iterations. [sent-52, score-0.191]
16 An approximation of the target function with roughly the same classification accuracy as the optimal one can be achieved with a small number of iterations due to the influence of the intrinsic norm regularizer of LapSVMs on the training process. [sent-54, score-0.15]
17 We investigate those effects, showing that they make common stopping conditions for iterative gradient based algorithms hard to tune, often leading to either a premature stopping of the iteration or to a large amount of unnecessary iterations, which do not improve classification accuracy. [sent-55, score-0.105]
18 This criterion exploits the stability of the prediction on the unlabeled data, or the classification accuracy on validation data (if available). [sent-57, score-0.191]
19 The primal solution of the LapSVM problem is based on an L2 hinge loss, that establishes a direct connection to the Laplacian Regularized Least Square Classifier (LapRLSC) (Belkin et al. [sent-60, score-0.151]
20 We discuss the similarities between primal LapSVM and LapRLSC and we show that the proposed fast solution can be straightforwardly applied to LapRLSC. [sent-62, score-0.136]
21 1 describes the LapSVM algorithm in its original formulation while in Section 3 we discuss in detail the proposed solutions in the primal form. [sent-66, score-0.129]
22 In Section 5 a parallel with the primal solution of LapSVM and the solution for LapRLSC (Regularized Least Squares) is drawn, describing some possible future work. [sent-68, score-0.153]
23 This constraint is enforced in the learning process by an intrinsic regularizer f 2 that is I empirically estimated from the point cloud of labeled and unlabeled data using the graph Laplacian associated to them, since M is truly unknown. [sent-99, score-0.149]
24 The supervised approach defines the classification hyperplane just by considering the two labeled examples, and it does not benefit from unlabeled data (Figure 1(b)). [sent-111, score-0.093]
25 The intrinsic norm of Equation 1 actually performs a transduction along the manifold that enforces the values of f in nearby points with respect to geodesic distances on M to be the “same”. [sent-113, score-0.122]
26 A large set of unlabeled examples (black squares) and only one labeled example per class (red diamond, blue circle) are selected. [sent-133, score-0.093]
27 - (b) The result of a maximum margin supervised classification - (c) The result of a semi-supervised classification with intrinsic norm from manifold regularization. [sent-134, score-0.1]
28 By introducing the slack variables ξi , the unconstrained primal problem can be written as a constrained one, minα∈IRn ,ξ∈IRl ∑l ξi + γA αT Kα + γI αT KLKα i=1 subject to: yi (∑n αi k(xi , x j ) + b) ≥ 1 − ξi , i = 1, . [sent-147, score-0.133]
29 Training in the Primal In this section we analyze the optimization of the primal form of the non linear LapSVM problem, following the growing interest in training SVMs in the primal of the last few years (Keerthi and DeCoste, 2005; Joachims, 2006; Chapelle, 2007; Shalev-Shwartz et al. [sent-172, score-0.286]
30 The focus of researchers has been mainly on the solution of the linear SVM primal problem, showing how it can be solved fast and efficiently. [sent-175, score-0.136]
31 In the Modified Finite Newton method of Keerthi and DeCoste (2005) the SVM problem is optimized in the primal by a numerically robust conjugate gradient technique that implements the Newton iterations. [sent-176, score-0.191]
32 Therefore why should a solution of the primal problem be useful in the case of LapSVM? [sent-184, score-0.136]
33 We believe that studying the primal LapSVM problem is the basis for future investigations and improvements of this classifier. [sent-189, score-0.119]
34 i n 2 i=1 α∈IR ,b∈IR (7) We solved such convex problem by Newton’s method and by preconditioned conjugate gradient, comparing their complexities and the complexity of the original LapSVM solution, and showing a parallel with the SVM case. [sent-198, score-0.144]
35 We have ∇= ∇b ∇α = 1T IE (Kα + 1b) − 1IE y + γI 1T L(Kα + 1b) KIE (Kα + 1b) − KIE y + γA Kα + γI KL(Kα + 1b) (9) where y ∈ {−1, 0, 1}n is the vector that collects the l labels yi of the labeled training points and a set of u zeros. [sent-209, score-0.104]
36 Exactly like in the case of primal SVMs (Chapelle, 2007), in our experiments setting s = 1 did not result in any convergence problems. [sent-219, score-0.119]
37 1 C OMPLEXITY A NALYSIS Updating the α coefficients with the Newton’s method costs O(n3 ), due to the matrix inversion in the update rule, the same complexity of the original LapSVM solution based on the dual problem discussed in Section 2. [sent-222, score-0.107]
38 It is interesting to compare the training of SVMs in the primal with the one of LapSVMs for a better insight in the Newton’s method based solution. [sent-228, score-0.156]
39 We can see that P is a factor 0 K ˆ ˆ of H and c, hence the terms H and c (and, consequently, the preconditioned gradient ∇, given by ˆ −1 ∇ = Hz − c) can be trivially computed without explicitly performing any matrix inversions. [sent-245, score-0.14]
40 ˆ ˆ ∇=P ˆ The condition number of the preconditioned system is sensibly decreased with respect to the one of ˆ Equation 11, since KIE K and KLK are reduced to IE K and LK. [sent-246, score-0.119]
41 For an easier comparison with the standard formulation of PCG, consider that the vectors of residual ˆ of the original and preconditioned systems corresponds to −∇ and −∇, respectively. [sent-251, score-0.111]
42 After several iterations the conjugacy of the descent directions tends to get lost due to roundoff floating point error, so a restart of the preconditioned conjugate gradient algorithm is required. [sent-256, score-0.212]
43 Convergence is usually declared when the norm of the preconditioned gradient falls below a given threshold (Chapelle, 2007), or when the current preconditioned gradient is roughly orthogonal with the real gradient (Shewchuk, 1994). [sent-263, score-0.348]
44 Algorithm 1 Preconditioned Conjugate Gradient (PCG) for primal LapSVMs. [sent-265, score-0.119]
45 We have α i ψ j (0) = ∑xi ∈E j ( f (xi ) − yi ) fd (xi ) + γA αT Kd α + γI f T L f , d ψ j (1) = ∑xi ∈E j ( f (xi ) + fd (xi ) − yi ) fd (xi ) + γA (α + d α )T Kd α + γI f T L( f + f d ) d where E j is the set of error vectors for the j-th interval. [sent-296, score-0.106]
46 Convergence of the conjugate gradient algorithm is theoretically declared in O(n) steps, but a solution very close to the optimal one can be computed with far less iterations. [sent-309, score-0.1]
47 The common goal conditions for the PCG algorithm and, more generally, for gradient based iterative algorithms, rely on the norm of the gradient ∇ (Boyd and Vandenberghe, 2004), of the ˆ ˆ preconditioned gradient ∇ (Chapelle, 2007), on the mixed product ∇T ∇ (Shewchuk, 1994). [sent-327, score-0.304]
48 Fixing in advance the values of τ and tmax may cause an early stop too far from the optimal solution, or it may result in the execution of a large number of iterations without a significant improvement on the classification accuracy. [sent-338, score-0.106]
49 All the remaining iterations until complete convergence are used to slightly asses the coherence along the manifold required by the intrinsic norm and the balancing with the smoothness of the function, as can be observed by looking at the function values after 25 iterations. [sent-349, score-0.153]
50 Due to the high amount of unlabeled training points in the semi-supervised learning framework, the stability of the decision y(x) = sign( f (x)) , x ∈ U , can be used as a reference to early stop the gradient descent (stability check). [sent-357, score-0.266]
51 Moreover, labeled validation data in the semi-supervised setting is usually small with respect to the whole training data, labeled and unlabeled, and it may not be enough to represent the structure of the data set. [sent-363, score-0.181]
52 Differently, computing the accuracy on validation data requires the evaluation of the kernel 1163 M ELACCI AND B ELKIN obj ∇ ˆ ∇∇ ˆ ∇ Normalized Value 0. [sent-371, score-0.1]
53 The norm of the gradient ∇ , of the preconˆ ditioned gradient ∇ , the value of the objective function ob j and of the mixed product ˆ ∇T ∇ are displayed in function of the number of PCG iterations. [sent-377, score-0.2]
54 Algorithm 2 The stability check for PCG stopping. [sent-379, score-0.223]
55 , u]T τ = (100 · d − d old 1 /u)% if τ < η then Stop PCG else d old = d end if Algorithm 3 The validation check for PCG stopping. [sent-384, score-0.228]
56 In a hypothetic situation where all the labeled examples always belong to E during the training of a LapSVM classifier in the primal, then the solution will be the same of LapRLSC. [sent-399, score-0.095]
57 (2006); Sindhwani and Rosenberg (2008) and in the experimental section of this paper, LapRLSC, LapSVM and primal LapSVM allow us to achieve similar classification performances. [sent-404, score-0.119]
58 Experimental Results We ran a wide set of experiments to analyze the proposed solution strategies of the primal LapSVM problem. [sent-408, score-0.136]
59 We compare the quality of the L2 hinge loss LapSVMs trained in the primal by Newton’s method with respect to the L1 hinge loss dual formulation and LapRLSCs. [sent-411, score-0.234]
60 Training data has been divided in labeled (L ), unlabeled (U ) and validation sets (V ), where the last one is only used to tune the classifier parameters. [sent-451, score-0.155]
61 The labeled and validation sets have been randomly selected from the training data such that at least one example per class is assured to be present on each of them, without any additional balancing constraints. [sent-452, score-0.164]
62 The MNIST3VS8 and FACEMIT data set are already divided in training and test data, so that the 4-fold generation process was not necessary, and just the random subdivision of training data has been performed (balancing the class labels on training and validation data). [sent-454, score-0.173]
63 In this case our goal is just to show how we were able to handle a high amount of training data using the proposed primal solution with PCG, whereas it was not possible to do it with the original dual formulation of LapSVM. [sent-456, score-0.243]
64 weights of the ambient and intrinsic norms, γA , γI , were determined by varying them on the grid {10−6 , 10−4 , 10−2 , 10−1 , 1, 10, 100} and chosen with respect to validation error. [sent-467, score-0.112]
65 4 Results Before going into further detail, in Table 3 we report the training times of LapSVMs using the original dual formulation and the primal training approach. [sent-471, score-0.263]
66 As expected, training in the primal by the Newton’s method requires training times similar to those for the dual formulation. [sent-473, score-0.253]
67 On the other hand, training by PCG with the proposed early stopping conditions shows an appreciable reduction of training times for all data sets. [sent-474, score-0.142]
68 As the size of labeled and unlabeled points increases, the improvement becomes very evident. [sent-475, score-0.105]
69 Both in the dual formulation of LapSVMs and in the primal one solved by means of Newton’s method, a lot of time is spent in computing the LK matrix product. [sent-477, score-0.179]
70 On the other hand, the high memory requirements of dual LapSVM and primal LapSVM solved with Newton’s method, coupled with the high computational cost, made those methods impossible to runt on our machine. [sent-482, score-0.179]
71 We now investigate the details of the solution of the primal LapSVM problem. [sent-483, score-0.136]
72 For this comparison, the solution of primal LapSVMs is computed by means of the Newton’s method. [sent-485, score-0.136]
73 The time required to solve the original dual formulation and the primal solution with Newton’s method are comparable, whereas solving the Laplacian SVMs problem in the primal with early stopped preconditioned conjugate gradient (PCG) offers a noticeable speedup. [sent-534, score-0.533]
74 The error rates of primal LapSVMs and LapRLSCs are quite close, due to the described relationship of the L2 hinge loss and the squared loss. [sent-536, score-0.146]
75 We compared the error rates of LapSVMs trained in the primal by Newton’s method with ones of PCG training, in function of the number of gradient steps t. [sent-539, score-0.195]
76 In Figure 8-9 we collected the values of the gradient norm ∇ , of the preconditioned gradient ˆ ˆ norm ∇ , of the mixed product ∇T ∇, and of the objective function ob j for each data set, normalized by their respective values at t = 0. [sent-549, score-0.333]
77 The vertical line is an indicative index of the number of iterations after which the error rate on all partitions (L , U , V , T ) becomes equal to the one at the stationary point (when the gradient of the objective function is zero). [sent-550, score-0.103]
78 38) Data Set G50C Table 4: Comparison of the accuracy of LapSVMs trained by solving the primal (Newton’s method) or the dual problem. [sent-772, score-0.204]
79 Results on the labeled training set L are omitted since all algorithms correctly classify such a few labeled training points. [sent-777, score-0.156]
80 76) 5 (0) Table 5: Newton’s steps required to compute the solution of the primal Laplacian SVM problem. [sent-788, score-0.136]
81 The error rate of the primal solution computed by means of Newton’s method is reported as a horizontal line. [sent-790, score-0.173]
82 In the MNIST data, the elements of kernel matrix non belonging to the main diagonal are very small due to the high degree of the polynomial kernel, so that the gradient and the preconditioned gradient are close. [sent-798, score-0.19]
83 The error rate of the primal solution computed by means of Newton’s method is reported as a horizontal line. [sent-800, score-0.173]
84 The error rate of the primal solution computed by means of Newton’s method is reported as a horizontal line. [sent-802, score-0.173]
85 Using the proposed PCG goal conditions (Section 4), we cross-validated the primal LapSVM classifier trained by PCG, and the selected parameters are reported in Table 10 of Appendix A. [sent-803, score-0.166]
86 The value of the objective function ob j, of the gradient ˆ ˆ norm ∇ , of the preconditioned gradient norm ∇ , and of the mixed product ∇T ∇ are displayed in function of the number of PCG iterations (t). [sent-840, score-0.358]
87 In Table 6 the training times, the number of PCG and line search iterations are collected, whereas in Table 7 the corresponding classification error rates are reported, for a comparison with the nonapproximated solution computed using Newton’s method. [sent-843, score-0.105]
88 As already stressed, the training times appreciably drop down when training a LapSVM in the primal using PCG and our goal conditions, independently by the data set. [sent-844, score-0.193]
89 Early stopping allows us to obtain results comparable to the Newton’s method or to the original two step dual formulation, showing a direct correlation between the proposed goal conditions and the quality of the classifier. [sent-845, score-0.103]
90 38) Table 7: Average classification error (standard deviation is reported brackets) of Laplacian SVMs trained in the primal by means of Newton’s method (Newton) and of preconditioned conjugate gradient (PCG) with the proposed early stopping conditions (in square brackets). [sent-1175, score-0.409]
91 Results on the labeled training set L are omitted since all algorithms correctly classify such a few labeled training points. [sent-1178, score-0.156]
92 The value of the objective function ob j, of the gradient ˆ ˆ norm ∇ , of the preconditioned gradient norm ∇ , and of the mixed product ∇T ∇ are displayed in function of the number of PCG iterations (t). [sent-1214, score-0.358]
93 A better tuning of the goal conditions or a different formulation of them can move the accuracy closer to the one of primal LapSVM trained with Newton’s method, but it goes beyond to the scope of this paper. [sent-1217, score-0.144]
94 The number of iterations from the stability check is sometimes larger that the one from the validation check (COIL20(B), USPST, COIL20). [sent-1220, score-0.47]
95 As a matter of fact, labeled validation data is more informative than a stable, but unknown, decision on the unlabeled one. [sent-1221, score-0.155]
96 46) Table 8: Training time comparison among the Laplacian RLSCs trained by solving Equation 14 with matrix inversion and by means of preconditioned conjugate gradient (PCG) with the proposed early stopping conditions (in square brackets). [sent-1241, score-0.286]
97 A very fast solution can be achieved using preconditioned conjugate gradient coupled with an early stopping criterion based on the stability of the classifier decision. [sent-1251, score-0.335]
98 , 2007) could be investigated to solve the primal LapSVM problem, that we will address in future work. [sent-1257, score-0.119]
99 38) Table 12: Average classification error (standard deviation is reported brackets) of Laplacian SVMs trained in the primal by means of Newton’s method and of preconditioned conjugate gradient (PCG) with the proposed early stopping conditions (in square brackets). [sent-1690, score-0.409]
100 Results on the labeled training set L are omitted since all classifiers perfectly fit such few labeled training points. [sent-1694, score-0.156]
wordName wordTfidf (topN-words)
[('pcg', 0.79), ('lapsvm', 0.293), ('newton', 0.272), ('check', 0.146), ('uspst', 0.145), ('primal', 0.119), ('laprlsc', 0.102), ('preconditioned', 0.101), ('rlsc', 0.085), ('lapsvms', 0.081), ('ie', 0.079), ('stability', 0.077), ('elacci', 0.077), ('elkin', 0.077), ('facemit', 0.077), ('aplacian', 0.072), ('pcmac', 0.072), ('laplacian', 0.069), ('mixed', 0.068), ('validation', 0.062), ('rained', 0.062), ('rimal', 0.062), ('dual', 0.06), ('svm', 0.058), ('unlabeled', 0.052), ('manifold', 0.048), ('belkin', 0.048), ('klk', 0.043), ('chapelle', 0.041), ('labeled', 0.041), ('iterations', 0.039), ('gradient', 0.039), ('kie', 0.038), ('obj', 0.038), ('training', 0.037), ('ob', 0.036), ('early', 0.035), ('er', 0.035), ('intrinsic', 0.034), ('decoste', 0.034), ('keerthi', 0.033), ('conjugate', 0.033), ('stopping', 0.033), ('lk', 0.031), ('classi', 0.03), ('sindhwani', 0.029), ('hz', 0.027), ('shewchuk', 0.026), ('svms', 0.025), ('zt', 0.025), ('trained', 0.025), ('fd', 0.022), ('regularizer', 0.022), ('brackets', 0.022), ('iters', 0.021), ('inversion', 0.02), ('norm', 0.018), ('sensibly', 0.018), ('tmax', 0.018), ('solution', 0.017), ('moons', 0.017), ('lg', 0.017), ('ambient', 0.016), ('rosenberg', 0.016), ('err', 0.016), ('hessian', 0.015), ('hinge', 0.015), ('balancing', 0.014), ('stop', 0.014), ('normalized', 0.014), ('yi', 0.014), ('rate', 0.013), ('discriminates', 0.013), ('errv', 0.013), ('laprlscs', 0.013), ('rlscs', 0.013), ('error', 0.012), ('ls', 0.012), ('points', 0.012), ('reported', 0.012), ('piecewise', 0.012), ('non', 0.011), ('xi', 0.011), ('ir', 0.011), ('ers', 0.011), ('cg', 0.011), ('declared', 0.011), ('dii', 0.011), ('transductive', 0.011), ('niyogi', 0.01), ('original', 0.01), ('table', 0.01), ('expansion', 0.01), ('old', 0.01), ('joachims', 0.01), ('tsang', 0.01), ('selected', 0.01), ('switches', 0.01), ('clock', 0.01), ('transduction', 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
Author: Stefano Melacci, Mikhail Belkin
Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classification, optimization
2 0.055324234 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama
Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization
3 0.037812635 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
4 0.034951262 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
5 0.033954613 105 jmlr-2011-lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN
6 0.026021695 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
7 0.0257826 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
8 0.023505574 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
9 0.02309379 101 jmlr-2011-Variable Sparsity Kernel Learning
10 0.022865934 12 jmlr-2011-Bayesian Co-Training
11 0.022845032 66 jmlr-2011-Multiple Kernel Learning Algorithms
12 0.022226473 36 jmlr-2011-Generalized TD Learning
13 0.018848836 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
14 0.018671269 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
15 0.01826629 58 jmlr-2011-Learning from Partial Labels
16 0.018213384 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
17 0.017245542 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
18 0.01648761 35 jmlr-2011-Forest Density Estimation
19 0.016045352 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
20 0.015701056 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
topicId topicWeight
[(0, 0.093), (1, 0.025), (2, 0.029), (3, -0.046), (4, -0.017), (5, -0.043), (6, -0.048), (7, -0.021), (8, 0.001), (9, -0.04), (10, -0.061), (11, 0.005), (12, 0.02), (13, -0.016), (14, -0.023), (15, 0.047), (16, -0.107), (17, -0.068), (18, -0.062), (19, 0.045), (20, 0.04), (21, 0.034), (22, 0.059), (23, 0.053), (24, -0.041), (25, 0.046), (26, -0.211), (27, 0.081), (28, -0.096), (29, 0.151), (30, 0.054), (31, 0.021), (32, -0.174), (33, 0.307), (34, -0.095), (35, 0.087), (36, -0.074), (37, 0.254), (38, 0.189), (39, 0.053), (40, -0.089), (41, 0.17), (42, -0.196), (43, -0.045), (44, 0.14), (45, -0.238), (46, 0.241), (47, -0.053), (48, -0.137), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.96273541 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
Author: Stefano Melacci, Mikhail Belkin
Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classification, optimization
2 0.36175454 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
3 0.30337399 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
Author: Fabien Lauer, Yann Guermeur
Abstract: This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine. All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the first unified implementation and offers a convenient basis to develop other instances. This is also the first parallel implementation for M-SVMs. The package consists in a set of command-line tools with a callable library. The documentation includes a tutorial, a user’s guide and a developer’s guide. Keywords: multi-class support vector machines, open source, C
4 0.2502785 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
5 0.21629776 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
Author: Amarnag Subramanya, Jeff Bilmes
Abstract: We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 108 samples. In this context, we propose a graph node ordering algorithm that is also applicable to other graph-based semi-supervised learning approaches. We compare the proposed approach against other standard semi-supervised learning algorithms on the semi-supervised learning benchmark data sets (Chapelle et al., 2007), and other real-world tasks such as text classification on Reuters and WebKB, speech phone classification on TIMIT and Switchboard, and linguistic dialog-act tagging on Dihana and Switchboard. In each case, the proposed approach outperforms the state-of-the-art. Lastly, we show that our objective can be generalized into a form that includes the standard squared-error loss, and we prove a geometric rate of convergence in that case. Keywords: graph-based semi-supervised learning, transductive inference, large-scale semi-supervised learning, non-parametric models
6 0.19788922 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
7 0.19030756 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization
8 0.17598201 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
9 0.16324043 12 jmlr-2011-Bayesian Co-Training
10 0.1621269 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
11 0.16096845 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
12 0.14881302 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
13 0.14823778 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
14 0.13804445 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
15 0.12776132 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
16 0.12711166 105 jmlr-2011-lp-Norm Multiple Kernel Learning
17 0.12597139 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
18 0.11559691 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
19 0.10786408 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
20 0.10760689 36 jmlr-2011-Generalized TD Learning
topicId topicWeight
[(4, 0.033), (9, 0.058), (10, 0.029), (11, 0.011), (12, 0.423), (24, 0.03), (31, 0.117), (32, 0.015), (41, 0.023), (60, 0.013), (71, 0.012), (73, 0.031), (78, 0.062), (90, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.71693552 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
Author: Stefano Melacci, Mikhail Belkin
Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classification, optimization
2 0.56685925 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
3 0.33191001 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
4 0.32981172 101 jmlr-2011-Variable Sparsity Kernel Learning
Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman
Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efficient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efficiency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN
5 0.32566851 95 jmlr-2011-Training SVMs Without Offset
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We develop, analyze, and test a training algorithm for support vector machine classifiers without offset. Key features of this algorithm are a new, statistically motivated stopping criterion, new warm start options, and a set of inexpensive working set selection strategies that significantly reduce the number of iterations. For these working set strategies, we establish convergence rates that, not surprisingly, coincide with the best known rates for SVMs with offset. We further conduct various experiments that investigate both the run time behavior and the performed iterations of the new training algorithm. It turns out, that the new algorithm needs significantly less iterations and also runs substantially faster than standard training algorithms for SVMs with offset. Keywords: support vector machines, decomposition algorithms
6 0.32352495 105 jmlr-2011-lp-Norm Multiple Kernel Learning
7 0.32278147 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
8 0.31962052 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
9 0.3176145 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
10 0.31752872 12 jmlr-2011-Bayesian Co-Training
11 0.3173027 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
12 0.31532478 66 jmlr-2011-Multiple Kernel Learning Algorithms
13 0.31409317 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
14 0.31371635 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes
15 0.31295156 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
16 0.31212258 91 jmlr-2011-The Sample Complexity of Dictionary Learning
17 0.31204909 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
18 0.31108204 16 jmlr-2011-Clustering Algorithms for Chains
19 0.30920595 48 jmlr-2011-Kernel Analysis of Deep Networks
20 0.30846488 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity