jmlr jmlr2011 jmlr2011-51 knowledge-graph by maker-knowledge-mining

51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Source: pdf

Author: Stefano Melacci, Mikhail Belkin

Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classiﬁers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classiﬁcation. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efﬁciently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classiﬁcation accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be signiﬁcantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the beneﬁts of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classiﬁcation, optimization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. [sent-7, score-0.189]

2 In particular, training a LapSVM in the primal can be efﬁciently performed with preconditioned conjugate gradient. [sent-8, score-0.29]

3 We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. [sent-9, score-0.26]

4 The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be signiﬁcantly smaller than n. [sent-11, score-0.13]

5 Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. [sent-12, score-0.166]

6 Introduction In semi-supervised learning one estimates a target classiﬁcation/regression function from a few labeled examples together with a large collection of unlabeled data. [sent-15, score-0.093]

7 This dual problem is deﬁned on a number of dual variables equal to l, the number of labeled points. [sent-36, score-0.161]

8 If the total number of labeled and unlabeled points is n, the relationship between the l variables and the ﬁnal n coefﬁcients is given by a linear system of n equations and variables. [sent-37, score-0.105]

9 Motivated by the recent interest in solving the SVM problem in the primal (Keerthi and DeCoste, 2005; Joachims, 2006; Chapelle, 2007; Shalev-Shwartz et al. [sent-39, score-0.119]

10 , 2007), we present a solution to the primal LapSVM problem that can signiﬁcantly reduce training times and overcome some issues of the original training algorithm. [sent-40, score-0.22]

11 We propose two methods for solving the LapSVM problem in the primal form (not limited to the linear case), following the ideas presented by Chapelle (2007) for SVMs and pointing out some important differences resulting from an additional regularization term. [sent-42, score-0.119]

12 Interestingly, it turns out that the advantages of the Newton’s method for the SVM problem are lost in LapSVM due to the intrinsic norm regularizer, and the complexity of this solution is still O(n3 ), same as in the original dual formulation. [sent-45, score-0.139]

13 The second method is preconditioned conjugate gradient, which seems better suited to the LapSVM optimization problem. [sent-46, score-0.134]

14 We note that the quality of an approximate solution of the traditional dual form and the resulting approximation of the target optimal function are hard to relate due to the change of variables when passing to the dual problem. [sent-51, score-0.137]

15 Training LapSVMs in the primal overcomes this issue, and it allows us to directly compute approximate solutions by controlling the number of conjugate gradient iterations. [sent-52, score-0.191]

16 An approximation of the target function with roughly the same classiﬁcation accuracy as the optimal one can be achieved with a small number of iterations due to the inﬂuence of the intrinsic norm regularizer of LapSVMs on the training process. [sent-54, score-0.15]

17 We investigate those effects, showing that they make common stopping conditions for iterative gradient based algorithms hard to tune, often leading to either a premature stopping of the iteration or to a large amount of unnecessary iterations, which do not improve classiﬁcation accuracy. [sent-55, score-0.105]

18 This criterion exploits the stability of the prediction on the unlabeled data, or the classiﬁcation accuracy on validation data (if available). [sent-57, score-0.191]

19 The primal solution of the LapSVM problem is based on an L2 hinge loss, that establishes a direct connection to the Laplacian Regularized Least Square Classiﬁer (LapRLSC) (Belkin et al. [sent-60, score-0.151]

20 We discuss the similarities between primal LapSVM and LapRLSC and we show that the proposed fast solution can be straightforwardly applied to LapRLSC. [sent-62, score-0.136]

21 1 describes the LapSVM algorithm in its original formulation while in Section 3 we discuss in detail the proposed solutions in the primal form. [sent-66, score-0.129]

22 In Section 5 a parallel with the primal solution of LapSVM and the solution for LapRLSC (Regularized Least Squares) is drawn, describing some possible future work. [sent-68, score-0.153]

23 This constraint is enforced in the learning process by an intrinsic regularizer f 2 that is I empirically estimated from the point cloud of labeled and unlabeled data using the graph Laplacian associated to them, since M is truly unknown. [sent-99, score-0.149]

24 The supervised approach deﬁnes the classiﬁcation hyperplane just by considering the two labeled examples, and it does not beneﬁt from unlabeled data (Figure 1(b)). [sent-111, score-0.093]

25 The intrinsic norm of Equation 1 actually performs a transduction along the manifold that enforces the values of f in nearby points with respect to geodesic distances on M to be the “same”. [sent-113, score-0.122]

26 A large set of unlabeled examples (black squares) and only one labeled example per class (red diamond, blue circle) are selected. [sent-133, score-0.093]

27 - (b) The result of a maximum margin supervised classiﬁcation - (c) The result of a semi-supervised classiﬁcation with intrinsic norm from manifold regularization. [sent-134, score-0.1]

28 By introducing the slack variables ξi , the unconstrained primal problem can be written as a constrained one, minα∈IRn ,ξ∈IRl ∑l ξi + γA αT Kα + γI αT KLKα i=1 subject to: yi (∑n αi k(xi , x j ) + b) ≥ 1 − ξi , i = 1, . [sent-147, score-0.133]

29 Training in the Primal In this section we analyze the optimization of the primal form of the non linear LapSVM problem, following the growing interest in training SVMs in the primal of the last few years (Keerthi and DeCoste, 2005; Joachims, 2006; Chapelle, 2007; Shalev-Shwartz et al. [sent-172, score-0.286]

30 The focus of researchers has been mainly on the solution of the linear SVM primal problem, showing how it can be solved fast and efﬁciently. [sent-175, score-0.136]

31 In the Modiﬁed Finite Newton method of Keerthi and DeCoste (2005) the SVM problem is optimized in the primal by a numerically robust conjugate gradient technique that implements the Newton iterations. [sent-176, score-0.191]

32 Therefore why should a solution of the primal problem be useful in the case of LapSVM? [sent-184, score-0.136]

33 We believe that studying the primal LapSVM problem is the basis for future investigations and improvements of this classiﬁer. [sent-189, score-0.119]

34 i n 2 i=1 α∈IR ,b∈IR (7) We solved such convex problem by Newton’s method and by preconditioned conjugate gradient, comparing their complexities and the complexity of the original LapSVM solution, and showing a parallel with the SVM case. [sent-198, score-0.144]

35 We have ∇= ∇b ∇α = 1T IE (Kα + 1b) − 1IE y + γI 1T L(Kα + 1b) KIE (Kα + 1b) − KIE y + γA Kα + γI KL(Kα + 1b) (9) where y ∈ {−1, 0, 1}n is the vector that collects the l labels yi of the labeled training points and a set of u zeros. [sent-209, score-0.104]

36 Exactly like in the case of primal SVMs (Chapelle, 2007), in our experiments setting s = 1 did not result in any convergence problems. [sent-219, score-0.119]

37 1 C OMPLEXITY A NALYSIS Updating the α coefﬁcients with the Newton’s method costs O(n3 ), due to the matrix inversion in the update rule, the same complexity of the original LapSVM solution based on the dual problem discussed in Section 2. [sent-222, score-0.107]

38 It is interesting to compare the training of SVMs in the primal with the one of LapSVMs for a better insight in the Newton’s method based solution. [sent-228, score-0.156]

39 We can see that P is a factor 0 K ˆ ˆ of H and c, hence the terms H and c (and, consequently, the preconditioned gradient ∇, given by ˆ −1 ∇ = Hz − c) can be trivially computed without explicitly performing any matrix inversions. [sent-245, score-0.14]

40 ˆ ˆ ∇=P ˆ The condition number of the preconditioned system is sensibly decreased with respect to the one of ˆ Equation 11, since KIE K and KLK are reduced to IE K and LK. [sent-246, score-0.119]

41 For an easier comparison with the standard formulation of PCG, consider that the vectors of residual ˆ of the original and preconditioned systems corresponds to −∇ and −∇, respectively. [sent-251, score-0.111]

42 After several iterations the conjugacy of the descent directions tends to get lost due to roundoff ﬂoating point error, so a restart of the preconditioned conjugate gradient algorithm is required. [sent-256, score-0.212]

43 Convergence is usually declared when the norm of the preconditioned gradient falls below a given threshold (Chapelle, 2007), or when the current preconditioned gradient is roughly orthogonal with the real gradient (Shewchuk, 1994). [sent-263, score-0.348]

44 Algorithm 1 Preconditioned Conjugate Gradient (PCG) for primal LapSVMs. [sent-265, score-0.119]

45 We have α i ψ j (0) = ∑xi ∈E j ( f (xi ) − yi ) fd (xi ) + γA αT Kd α + γI f T L f , d ψ j (1) = ∑xi ∈E j ( f (xi ) + fd (xi ) − yi ) fd (xi ) + γA (α + d α )T Kd α + γI f T L( f + f d ) d where E j is the set of error vectors for the j-th interval. [sent-296, score-0.106]

46 Convergence of the conjugate gradient algorithm is theoretically declared in O(n) steps, but a solution very close to the optimal one can be computed with far less iterations. [sent-309, score-0.1]

47 The common goal conditions for the PCG algorithm and, more generally, for gradient based iterative algorithms, rely on the norm of the gradient ∇ (Boyd and Vandenberghe, 2004), of the ˆ ˆ preconditioned gradient ∇ (Chapelle, 2007), on the mixed product ∇T ∇ (Shewchuk, 1994). [sent-327, score-0.304]

48 Fixing in advance the values of τ and tmax may cause an early stop too far from the optimal solution, or it may result in the execution of a large number of iterations without a signiﬁcant improvement on the classiﬁcation accuracy. [sent-338, score-0.106]

49 All the remaining iterations until complete convergence are used to slightly asses the coherence along the manifold required by the intrinsic norm and the balancing with the smoothness of the function, as can be observed by looking at the function values after 25 iterations. [sent-349, score-0.153]

50 Due to the high amount of unlabeled training points in the semi-supervised learning framework, the stability of the decision y(x) = sign( f (x)) , x ∈ U , can be used as a reference to early stop the gradient descent (stability check). [sent-357, score-0.266]

51 Moreover, labeled validation data in the semi-supervised setting is usually small with respect to the whole training data, labeled and unlabeled, and it may not be enough to represent the structure of the data set. [sent-363, score-0.181]

52 Differently, computing the accuracy on validation data requires the evaluation of the kernel 1163 M ELACCI AND B ELKIN obj ∇ ˆ ∇∇ ˆ ∇ Normalized Value 0. [sent-371, score-0.1]

53 The norm of the gradient ∇ , of the preconˆ ditioned gradient ∇ , the value of the objective function ob j and of the mixed product ˆ ∇T ∇ are displayed in function of the number of PCG iterations. [sent-377, score-0.2]

54 Algorithm 2 The stability check for PCG stopping. [sent-379, score-0.223]

55 , u]T τ = (100 · d − d old 1 /u)% if τ < η then Stop PCG else d old = d end if Algorithm 3 The validation check for PCG stopping. [sent-384, score-0.228]

56 In a hypothetic situation where all the labeled examples always belong to E during the training of a LapSVM classiﬁer in the primal, then the solution will be the same of LapRLSC. [sent-399, score-0.095]

57 (2006); Sindhwani and Rosenberg (2008) and in the experimental section of this paper, LapRLSC, LapSVM and primal LapSVM allow us to achieve similar classiﬁcation performances. [sent-404, score-0.119]

58 Experimental Results We ran a wide set of experiments to analyze the proposed solution strategies of the primal LapSVM problem. [sent-408, score-0.136]

59 We compare the quality of the L2 hinge loss LapSVMs trained in the primal by Newton’s method with respect to the L1 hinge loss dual formulation and LapRLSCs. [sent-411, score-0.234]

60 Training data has been divided in labeled (L ), unlabeled (U ) and validation sets (V ), where the last one is only used to tune the classiﬁer parameters. [sent-451, score-0.155]

61 The labeled and validation sets have been randomly selected from the training data such that at least one example per class is assured to be present on each of them, without any additional balancing constraints. [sent-452, score-0.164]

62 The MNIST3VS8 and FACEMIT data set are already divided in training and test data, so that the 4-fold generation process was not necessary, and just the random subdivision of training data has been performed (balancing the class labels on training and validation data). [sent-454, score-0.173]

63 In this case our goal is just to show how we were able to handle a high amount of training data using the proposed primal solution with PCG, whereas it was not possible to do it with the original dual formulation of LapSVM. [sent-456, score-0.243]

64 weights of the ambient and intrinsic norms, γA , γI , were determined by varying them on the grid {10−6 , 10−4 , 10−2 , 10−1 , 1, 10, 100} and chosen with respect to validation error. [sent-467, score-0.112]

65 4 Results Before going into further detail, in Table 3 we report the training times of LapSVMs using the original dual formulation and the primal training approach. [sent-471, score-0.263]

66 As expected, training in the primal by the Newton’s method requires training times similar to those for the dual formulation. [sent-473, score-0.253]

67 On the other hand, training by PCG with the proposed early stopping conditions shows an appreciable reduction of training times for all data sets. [sent-474, score-0.142]

68 As the size of labeled and unlabeled points increases, the improvement becomes very evident. [sent-475, score-0.105]

69 Both in the dual formulation of LapSVMs and in the primal one solved by means of Newton’s method, a lot of time is spent in computing the LK matrix product. [sent-477, score-0.179]

70 On the other hand, the high memory requirements of dual LapSVM and primal LapSVM solved with Newton’s method, coupled with the high computational cost, made those methods impossible to runt on our machine. [sent-482, score-0.179]

71 We now investigate the details of the solution of the primal LapSVM problem. [sent-483, score-0.136]

72 For this comparison, the solution of primal LapSVMs is computed by means of the Newton’s method. [sent-485, score-0.136]

73 The time required to solve the original dual formulation and the primal solution with Newton’s method are comparable, whereas solving the Laplacian SVMs problem in the primal with early stopped preconditioned conjugate gradient (PCG) offers a noticeable speedup. [sent-534, score-0.533]

74 The error rates of primal LapSVMs and LapRLSCs are quite close, due to the described relationship of the L2 hinge loss and the squared loss. [sent-536, score-0.146]

75 We compared the error rates of LapSVMs trained in the primal by Newton’s method with ones of PCG training, in function of the number of gradient steps t. [sent-539, score-0.195]

76 In Figure 8-9 we collected the values of the gradient norm ∇ , of the preconditioned gradient ˆ ˆ norm ∇ , of the mixed product ∇T ∇, and of the objective function ob j for each data set, normalized by their respective values at t = 0. [sent-549, score-0.333]

77 The vertical line is an indicative index of the number of iterations after which the error rate on all partitions (L , U , V , T ) becomes equal to the one at the stationary point (when the gradient of the objective function is zero). [sent-550, score-0.103]

78 38) Data Set G50C Table 4: Comparison of the accuracy of LapSVMs trained by solving the primal (Newton’s method) or the dual problem. [sent-772, score-0.204]

79 Results on the labeled training set L are omitted since all algorithms correctly classify such a few labeled training points. [sent-777, score-0.156]

80 76) 5 (0) Table 5: Newton’s steps required to compute the solution of the primal Laplacian SVM problem. [sent-788, score-0.136]

81 The error rate of the primal solution computed by means of Newton’s method is reported as a horizontal line. [sent-790, score-0.173]

82 In the MNIST data, the elements of kernel matrix non belonging to the main diagonal are very small due to the high degree of the polynomial kernel, so that the gradient and the preconditioned gradient are close. [sent-798, score-0.19]

83 The error rate of the primal solution computed by means of Newton’s method is reported as a horizontal line. [sent-800, score-0.173]

84 The error rate of the primal solution computed by means of Newton’s method is reported as a horizontal line. [sent-802, score-0.173]

85 Using the proposed PCG goal conditions (Section 4), we cross-validated the primal LapSVM classiﬁer trained by PCG, and the selected parameters are reported in Table 10 of Appendix A. [sent-803, score-0.166]

86 The value of the objective function ob j, of the gradient ˆ ˆ norm ∇ , of the preconditioned gradient norm ∇ , and of the mixed product ∇T ∇ are displayed in function of the number of PCG iterations (t). [sent-840, score-0.358]

87 In Table 6 the training times, the number of PCG and line search iterations are collected, whereas in Table 7 the corresponding classiﬁcation error rates are reported, for a comparison with the nonapproximated solution computed using Newton’s method. [sent-843, score-0.105]

88 As already stressed, the training times appreciably drop down when training a LapSVM in the primal using PCG and our goal conditions, independently by the data set. [sent-844, score-0.193]

89 Early stopping allows us to obtain results comparable to the Newton’s method or to the original two step dual formulation, showing a direct correlation between the proposed goal conditions and the quality of the classiﬁer. [sent-845, score-0.103]

90 38) Table 7: Average classiﬁcation error (standard deviation is reported brackets) of Laplacian SVMs trained in the primal by means of Newton’s method (Newton) and of preconditioned conjugate gradient (PCG) with the proposed early stopping conditions (in square brackets). [sent-1175, score-0.409]

91 Results on the labeled training set L are omitted since all algorithms correctly classify such a few labeled training points. [sent-1178, score-0.156]

92 The value of the objective function ob j, of the gradient ˆ ˆ norm ∇ , of the preconditioned gradient norm ∇ , and of the mixed product ∇T ∇ are displayed in function of the number of PCG iterations (t). [sent-1214, score-0.358]

93 A better tuning of the goal conditions or a different formulation of them can move the accuracy closer to the one of primal LapSVM trained with Newton’s method, but it goes beyond to the scope of this paper. [sent-1217, score-0.144]

94 The number of iterations from the stability check is sometimes larger that the one from the validation check (COIL20(B), USPST, COIL20). [sent-1220, score-0.47]

95 As a matter of fact, labeled validation data is more informative than a stable, but unknown, decision on the unlabeled one. [sent-1221, score-0.155]

96 46) Table 8: Training time comparison among the Laplacian RLSCs trained by solving Equation 14 with matrix inversion and by means of preconditioned conjugate gradient (PCG) with the proposed early stopping conditions (in square brackets). [sent-1241, score-0.286]

97 A very fast solution can be achieved using preconditioned conjugate gradient coupled with an early stopping criterion based on the stability of the classiﬁer decision. [sent-1251, score-0.335]

98 , 2007) could be investigated to solve the primal LapSVM problem, that we will address in future work. [sent-1257, score-0.119]

99 38) Table 12: Average classiﬁcation error (standard deviation is reported brackets) of Laplacian SVMs trained in the primal by means of Newton’s method and of preconditioned conjugate gradient (PCG) with the proposed early stopping conditions (in square brackets). [sent-1690, score-0.409]

100 Results on the labeled training set L are omitted since all classiﬁers perfectly ﬁt such few labeled training points. [sent-1694, score-0.156]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pcg', 0.79), ('lapsvm', 0.293), ('newton', 0.272), ('check', 0.146), ('uspst', 0.145), ('primal', 0.119), ('laprlsc', 0.102), ('preconditioned', 0.101), ('rlsc', 0.085), ('lapsvms', 0.081), ('ie', 0.079), ('stability', 0.077), ('elacci', 0.077), ('elkin', 0.077), ('facemit', 0.077), ('aplacian', 0.072), ('pcmac', 0.072), ('laplacian', 0.069), ('mixed', 0.068), ('validation', 0.062), ('rained', 0.062), ('rimal', 0.062), ('dual', 0.06), ('svm', 0.058), ('unlabeled', 0.052), ('manifold', 0.048), ('belkin', 0.048), ('klk', 0.043), ('chapelle', 0.041), ('labeled', 0.041), ('iterations', 0.039), ('gradient', 0.039), ('kie', 0.038), ('obj', 0.038), ('training', 0.037), ('ob', 0.036), ('early', 0.035), ('er', 0.035), ('intrinsic', 0.034), ('decoste', 0.034), ('keerthi', 0.033), ('conjugate', 0.033), ('stopping', 0.033), ('lk', 0.031), ('classi', 0.03), ('sindhwani', 0.029), ('hz', 0.027), ('shewchuk', 0.026), ('svms', 0.025), ('zt', 0.025), ('trained', 0.025), ('fd', 0.022), ('regularizer', 0.022), ('brackets', 0.022), ('iters', 0.021), ('inversion', 0.02), ('norm', 0.018), ('sensibly', 0.018), ('tmax', 0.018), ('solution', 0.017), ('moons', 0.017), ('lg', 0.017), ('ambient', 0.016), ('rosenberg', 0.016), ('err', 0.016), ('hessian', 0.015), ('hinge', 0.015), ('balancing', 0.014), ('stop', 0.014), ('normalized', 0.014), ('yi', 0.014), ('rate', 0.013), ('discriminates', 0.013), ('errv', 0.013), ('laprlscs', 0.013), ('rlscs', 0.013), ('error', 0.012), ('ls', 0.012), ('points', 0.012), ('reported', 0.012), ('piecewise', 0.012), ('non', 0.011), ('xi', 0.011), ('ir', 0.011), ('ers', 0.011), ('cg', 0.011), ('declared', 0.011), ('dii', 0.011), ('transductive', 0.011), ('niyogi', 0.01), ('original', 0.01), ('table', 0.01), ('expansion', 0.01), ('old', 0.01), ('joachims', 0.01), ('tsang', 0.01), ('selected', 0.01), ('switches', 0.01), ('clock', 0.01), ('transduction', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Author: Stefano Melacci, Mikhail Belkin

2 0.055324234 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama

Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally conﬁrm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efﬁciency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization

3 0.037812635 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classiﬁcation with a convex loss ϕ. Given a ﬁnite collection of classiﬁers, we combine them and obtain a new classiﬁer that satisﬁes simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-speciﬁed level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization

4 0.034951262 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

5 0.033954613 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efﬁcient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artiﬁcial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

6 0.026021695 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

7 0.0257826 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

8 0.023505574 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

9 0.02309379 101 jmlr-2011-Variable Sparsity Kernel Learning

10 0.022865934 12 jmlr-2011-Bayesian Co-Training

11 0.022845032 66 jmlr-2011-Multiple Kernel Learning Algorithms

12 0.022226473 36 jmlr-2011-Generalized TD Learning

13 0.018848836 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

14 0.018671269 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

15 0.01826629 58 jmlr-2011-Learning from Partial Labels

16 0.018213384 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

17 0.017245542 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

18 0.01648761 35 jmlr-2011-Forest Density Estimation

19 0.016045352 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

20 0.015701056 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.093), (1, 0.025), (2, 0.029), (3, -0.046), (4, -0.017), (5, -0.043), (6, -0.048), (7, -0.021), (8, 0.001), (9, -0.04), (10, -0.061), (11, 0.005), (12, 0.02), (13, -0.016), (14, -0.023), (15, 0.047), (16, -0.107), (17, -0.068), (18, -0.062), (19, 0.045), (20, 0.04), (21, 0.034), (22, 0.059), (23, 0.053), (24, -0.041), (25, 0.046), (26, -0.211), (27, 0.081), (28, -0.096), (29, 0.151), (30, 0.054), (31, 0.021), (32, -0.174), (33, 0.307), (34, -0.095), (35, 0.087), (36, -0.074), (37, 0.254), (38, 0.189), (39, 0.053), (40, -0.089), (41, 0.17), (42, -0.196), (43, -0.045), (44, 0.14), (45, -0.238), (46, 0.241), (47, -0.053), (48, -0.137), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96273541 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Author: Stefano Melacci, Mikhail Belkin

2 0.36175454 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

3 0.30337399 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

Author: Fabien Lauer, Yann Guermeur

Abstract: This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine. All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the ﬁrst uniﬁed implementation and offers a convenient basis to develop other instances. This is also the ﬁrst parallel implementation for M-SVMs. The package consists in a set of command-line tools with a callable library. The documentation includes a tutorial, a user’s guide and a developer’s guide. Keywords: multi-class support vector machines, open source, C

4 0.2502785 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

5 0.21629776 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

Author: Amarnag Subramanya, Jeff Bilmes

Abstract: We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efﬁciently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 108 samples. In this context, we propose a graph node ordering algorithm that is also applicable to other graph-based semi-supervised learning approaches. We compare the proposed approach against other standard semi-supervised learning algorithms on the semi-supervised learning benchmark data sets (Chapelle et al., 2007), and other real-world tasks such as text classiﬁcation on Reuters and WebKB, speech phone classiﬁcation on TIMIT and Switchboard, and linguistic dialog-act tagging on Dihana and Switchboard. In each case, the proposed approach outperforms the state-of-the-art. Lastly, we show that our objective can be generalized into a form that includes the standard squared-error loss, and we prove a geometric rate of convergence in that case. Keywords: graph-based semi-supervised learning, transductive inference, large-scale semi-supervised learning, non-parametric models

6 0.19788922 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

7 0.19030756 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization

8 0.17598201 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

9 0.16324043 12 jmlr-2011-Bayesian Co-Training

10 0.1621269 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

11 0.16096845 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

12 0.14881302 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

13 0.14823778 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

14 0.13804445 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

15 0.12776132 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

16 0.12711166 105 jmlr-2011-lp-Norm Multiple Kernel Learning

17 0.12597139 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

18 0.11559691 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

19 0.10786408 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

20 0.10760689 36 jmlr-2011-Generalized TD Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.033), (9, 0.058), (10, 0.029), (11, 0.011), (12, 0.423), (24, 0.03), (31, 0.117), (32, 0.015), (41, 0.023), (60, 0.013), (71, 0.012), (73, 0.031), (78, 0.062), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71693552 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Author: Stefano Melacci, Mikhail Belkin

2 0.56685925 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efﬁciently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

3 0.33191001 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We ﬁrst propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efﬁcient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line ﬁlling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture

4 0.32981172 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efﬁcient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efﬁciency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aﬂalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

5 0.32566851 95 jmlr-2011-Training SVMs Without Offset

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We develop, analyze, and test a training algorithm for support vector machine classiﬁers without offset. Key features of this algorithm are a new, statistically motivated stopping criterion, new warm start options, and a set of inexpensive working set selection strategies that signiﬁcantly reduce the number of iterations. For these working set strategies, we establish convergence rates that, not surprisingly, coincide with the best known rates for SVMs with offset. We further conduct various experiments that investigate both the run time behavior and the performed iterations of the new training algorithm. It turns out, that the new algorithm needs signiﬁcantly less iterations and also runs substantially faster than standard training algorithms for SVMs with offset. Keywords: support vector machines, decomposition algorithms

6 0.32352495 105 jmlr-2011-lp-Norm Multiple Kernel Learning

7 0.32278147 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

8 0.31962052 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

9 0.3176145 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

10 0.31752872 12 jmlr-2011-Bayesian Co-Training

11 0.3173027 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.31532478 66 jmlr-2011-Multiple Kernel Learning Algorithms

13 0.31409317 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

14 0.31371635 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes

15 0.31295156 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

16 0.31212258 91 jmlr-2011-The Sample Complexity of Dictionary Learning

17 0.31204909 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

18 0.31108204 16 jmlr-2011-Clustering Algorithms for Chains

19 0.30920595 48 jmlr-2011-Kernel Analysis of Deep Networks

20 0.30846488 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity