jmlr jmlr2008 jmlr2008-86 knowledge-graph by maker-knowledge-mining

86 jmlr-2008-SimpleMKL

Source: pdf

Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet

Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efﬁcient and general multiple kernel learning algorithm, based on semi-inﬁnite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is deﬁned as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classiﬁcation, for problems like regression, clustering (one-class classiﬁcation) or multiclass classiﬁcation. Experimental results show that the proposed algorithm converges rapidly and that its efﬁciency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classiﬁcation problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 FR Idiap Research Institute, Centre du Parc 1920 Martigny, Switzerland∗ Editor: Nello Cristianini Abstract Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. [sent-10, score-0.252]

2 In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. [sent-14, score-0.236]

3 Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is deﬁned as a linear combination of multiple kernels. [sent-15, score-0.175]

4 Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classiﬁcation problems. [sent-19, score-0.277]

5 Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. [sent-20, score-0.409]

6 In kernel methods, the data representation is implicitly chosen through the socalled kernel K(x, x ). [sent-27, score-0.252]

7 This kernel actually plays two roles: it deﬁnes the similarity between two examples x and x , while deﬁning an appropriate regularization term for the learning problem. [sent-28, score-0.182]

8 For kernel algorithms, the solution of the learning problem is of the form f (x) = ∑ αi K(x, xi ) + b , (1) i=1 where αi and b are some coefﬁcients to be learned from examples, while K(·, ·) is a given positive deﬁnite kernel associated with a reproducing kernel Hilbert space (RKHS) H . [sent-30, score-0.396]

9 Recent applications have shown that using multiple kernels instead of a single one can enhance the interpretability of the decision function and improve performances (Lanckriet et al. [sent-32, score-0.18]

10 In such cases, a convenient approach is to consider that the kernel K(x, x ) is actually a convex combination of basis kernels: M K(x, x ) = ∑ M dm Km (x, x ) , with dm ≥ 0 , m=1 ∑ dm = 1 , m=1 where M is the total number of kernels. [sent-34, score-1.573]

11 Alternatively, the kernels Km can simply be classical kernels (such as Gaussian kernels) with different parameters. [sent-37, score-0.26]

12 Within this framework, the problem of data representation through the kernel is then transferred to the choice of weights dm . [sent-38, score-0.62]

13 Learning both the coefﬁcients αi and the weights dm in a single optimization problem is known as the multiple kernel learning (MKL) problem. [sent-39, score-0.669]

14 Indeed, we replace the mixed-norm regularization by a weighted 2 -norm regularization, where the sparsity of the linear combination of kernels is controlled by a 1 norm constraint on the kernel weights. [sent-61, score-0.312]

15 Indeed, our algorithm is simple, essentially based on a gradient descent on the SVM objective value. [sent-68, score-0.208]

16 We iteratively determine the combination of kernels by a gradient descent wrapping a standard SVM solver, which is SimpleSVM in our case. [sent-69, score-0.288]

17 However, they differ in that we use reduced gradient descent in the primal, whereas Sonnenburg et al. [sent-72, score-0.158]

18 For instance, in addition to the empirical efﬁciency comparison, we also show, in a SVM regression problem involving wavelet kernels, that automatic learning of the kernels leads to far better performances. [sent-79, score-0.204]

19 1 Functional Framework Before entering into the details of the MKL optimization problem, we ﬁrst present the functional framework adopted for multiple kernel learning. [sent-95, score-0.175]

20 For any m, let dm be a non-negative coefﬁcient and Hm be the Hilbert space 2493 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET derived from Hm as follows: f Hm < ∞} , dm Hm = { f | f ∈ H m : endowed with the inner product 1 f , g Hm = f,g dm m . [sent-100, score-1.425]

21 This means that, if dm = 0 then a function f belongs to the Hilbert space Hm only if f = 0 ∈ Hm . [sent-102, score-0.475]

22 Within this framework, Hm is a RKHS with kernel K(x, x ) = dm Km (x, x ) since ∀ f ∈ Hm ⊆ Hm , f (x) = f (·), Km (x, ·) m 1 f (·), dm Km (x, ·) m dm f (·), dm Km (x, ·) Hm . [sent-104, score-2.026]

23 = = Now, if we deﬁne H as the direct sum of the spaces Hm , that is, H= M M Hm , m=1 then, a classical result on RKHS (Aronszajn, 1950) says that H is a RKHS of kernel M K(x, x ) = ∑ dm Km (x, x ) . [sent-105, score-0.601]

24 In the MKL framework, one looks for a decision function of the form f (x) + b = ∑m fm (x) + b, where each function f m belongs to a different RKHS Hm associated with a kernel Km . [sent-114, score-0.208]

25 1 1 fm 2 m +C ∑ ξi H 2 ∑ dm m i yi ∑ fm (xi ) + yi b ≥ 1 − ξi ∀i m ξi ≥ 0 ∀i ∑ dm = 1 , dm ≥ 0 (2) ∀m , m where each dm controls the squared norm of f m in the objective function. [sent-118, score-2.176]

26 The smaller dm is, the smoother f m (as measured by f m Hm ) should be. [sent-119, score-0.475]

27 When dm = 0, fm Hm has also to be equal to zero to yield a ﬁnite objective value. [sent-120, score-0.607]

28 The 1 -norm constraint on the vector d is a sparsity constraint that will force some dm to be zero, thus encouraging sparse basis kernel expansions. [sent-121, score-0.601]

29 2 1 fm Hm +C ∑ ξi 2 ∑ m i yi ∑ fm (xi ) + yi b ≥ 1 − ξi ∀i (3) m ξi ≥ 0 ∀i. [sent-127, score-0.226]

30 Note that the objective function of this problem is not smooth since f m Hm is not differentiable at fm = 0. [sent-128, score-0.168]

31 However, what makes this formulation interesting is that the mixed-norm penalization of f = ∑m fm is a soft-thresholding penalizer that leads to a sparse solution, for which the algorithm performs kernel selection (Bach, 2008). [sent-129, score-0.243]

32 When setting to zero the gradient of the Lagrangian with respect to the primal variables, we get the following (a) (b) 1 fm (·) = ∑ αi yi Km (·, xi ) , dm i ∑ αi yi = 0, ∀m, (5) i (c) C − αi − νi = 0 , ∀i, 2 1 f m Hm (d) − + λ − ηm = 0 , 2 2 dm ∀m . [sent-141, score-1.237]

33 We note again here that f m (·) has to go to 0 if the coefﬁcient dm vanishes. [sent-142, score-0.475]

34 Instead of using an alternate optimization algorithm, we prefer to consider here the following constrained optimization problem: M min J(d) such that d where J(d) = ∑ dm = 1, dm ≥ 0 , (7) m=1   min   { f },b,ξ   s. [sent-160, score-0.988]

35    1 1 fm 2 m +C ∑ ξi ∀i H 2 ∑ dm m i yi ∑ fm (xi ) + yi b ≥ 1 − ξi (8) m ξi ≥ 0 ∀i . [sent-162, score-0.701]

36 By setting to zero the derivatives of this Lagrangian according to the primal variables, we get conditions (5) (a) to (c), from which we derive the associated dual problem 1 αi α j yi y j ∑ dm Km (xi , x j ) + ∑ αi α 2∑ m i i, j with ∑ αi yi = 0 max − (9) i C ≥ αi ≥ 0 ∀i , 1. [sent-168, score-0.634]

37 (2004a) formulation differs slightly, in that the kernels are weighted by some pre-deﬁned coefﬁcients that were not considered here. [sent-170, score-0.165]

38 2497 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET which is identiﬁed as the standard SVM dual formulation using the combined kernel K(x i , x j ) = ∑m dm Km (xi , x j ). [sent-171, score-0.684]

39 Because of strong duality, J(d) is also the objective value of the dual problem: J(d) = − 1 α α yi y j ∑ dm Km (xi , x j ) + ∑ αi , 2∑ i j m i i, j (10) where α maximizes (9). [sent-173, score-0.604]

40 Thus, by simple differentiation of the dual function (9) with respect to dm , we have: ∂J 1 = − ∑ αi α j yi y j Km (xi , x j ) ∂dm 2 i, j ∀m . [sent-188, score-0.554]

41 With our positivity assumption on the kernel matrices, J(·) is convex and differentiable with Lipschitz gradient (Lemar´ chal and Sagastizabal, 1997). [sent-193, score-0.285]

42 Once the gradient of J(d) is computed, d is updated by using a descent direction ensuring that the equality constraint and the non-negativity constraints on d are satisﬁed. [sent-195, score-0.177]

43 However, if there is an index m such that dm = 0 and [∇red J]m > 0, using this direction would violate the positivity constraint for d m . [sent-202, score-0.494]

44 This gives the descent direction for updating d as  ∂J ∂J if dm = 0 and ∂dm − ∂dµ > 0  0      − ∂J + ∂J if dm > 0 and m = µ Dm = (12) ∂dm ∂dµ   ∂J ∂J   − for m = µ . [sent-204, score-1.051]

45 Here, as detailed in Algorithm 1, we go one step beyond: once a descent direction D has been computed, we ﬁrst look for the maximal admissible step size in that direction and check whether the objective value decreases or not. [sent-206, score-0.17]

46 From the primal and dual objectives provided respectively in (2) and (6), the MKL duality gap is 1 DualGap = J(d ) − ∑ αi + max ∑ αi α j yi y j Km (xi , x j ) , 2 m i, j i 2499 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET Algorithm 1 SimpleMKL algorithm 1 set dm = M for m = 1, . [sent-224, score-0.713]

47 , M while stopping criterion not met do compute J(d) by using an SVM solver with K = ∑m dm Km ∂J compute ∂dm for m = 1, . [sent-227, score-0.551]

48 If J(d ) has been obtained through the dual problem (9), then this MKL duality gap can also be computed from the single kernel SVM algorithm duality gap DGSVM . [sent-232, score-0.394]

49 Otherwise, we have DGSVM = J(d ) + 1 α α yi y j ∑ dm Km (xi , x j ) − ∑ αi 2∑ i j m i i, j then the MKL duality gap becomes DualGap = DGSVM − 1 1 ∑ αi α j yi y j ∑ dm Km (xi , x j ) + 2 max ∑ αi α j yi y j Km (xi , x j ) . [sent-234, score-1.153]

50 Consequently, we terminate the algorithm when max ∑ αi α j yi y j Km (xi , x j ) − ∑ αi α j yi y j ∑ dm Km (xi , x j ) ≤ ε . [sent-238, score-0.537]

51 For the general MKL problem (7), the ﬁrst order optimality conditions are 2500 S IMPLE MKL obtained through the KKT conditions: ∂J + λ − ηm = 0 ∀m , ∂dm ηm · dm = 0 ∀m , where λ and {ηm } are respectively the Lagrange multipliers for the equality and inequality constraints of (7). [sent-241, score-0.498]

52 These KKT conditions imply ∂J ∂dm ∂J ∂dm = −λ if dm > 0 , ≥ −λ if dm = 0 . [sent-242, score-0.95]

53 Let’s deﬁne dJmin and dJmax as dJmin = ∂J {dm |dm >0} ∂dm min and dJmax = ∂J , {dm |dm >0} ∂dm max then, the necessary optimality conditions are approximated by the following termination conditions: |dJmin − dJmax | ≤ ε and ∂J ≥ dJmax ∂dm if dm = 0 . [sent-245, score-0.498]

54 They, however, differ in computational efﬁciency, because the kernel weights dm are optimized in quite different ways, as detailed below. [sent-256, score-0.62]

55 Let us ﬁrst recall that our differentiable function J(d) is deﬁned as:   max − 1 ∑ αi α j yi y j ∑ dm Km (xi , x j ) + ∑ αi  α 2 i, j m i J(d) =  with ∑ αi yi = 0, C ≥ αi ≥ 0 ∀i ,  i and both algorithms aim at minimizing this differentiable function. [sent-257, score-0.609]

56 2501 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET Figure 1: Illustrating three iterations of the SILP algorithm and a gradient descent algorithm for a one-dimensional problem. [sent-259, score-0.158]

57 Provided the step size is chosen correctly, gradient descent converges directly towards the optimal solution without overshooting (from d0 to d ). [sent-262, score-0.158]

58 Figure 1 illustrates the behaviour of both algorithms in a simple case, with oscillations for cutting planes and direct convergence for gradient descent. [sent-270, score-0.197]

59 , 1997; Scholkopf and Smola, 2001), we have   min 1 ∑ 1 fm 2 +C ∑(ξi + ξ∗ )  i Hm  fm ,b,ξi 2 dm   m i   s. [sent-293, score-0.639]

60 yi − ∑ fm (xi ) − b ≤ ε + ξi ∀i J(d) = m    ∑ fm (xi ) + b − yi ≤ ε + ξ∗ ∀i i    m  ξi ≥ 0, ξ∗ ≤ 0 ∀i , i (14) and for one-class SVMs (Sch¨ lkopf and Smola, 2001), we have: o   min   f ,b,ξ  m i J(d) =  s. [sent-295, score-0.226]

61    1 1 1 ∑ dm f m 2 m + ν H 2 m ∑ fm (xi ) ≥ b − ξi ∑ ξi − b i m ξi ≥ 0 . [sent-297, score-0.557]

62 For instance, multiple kernel learning has been transposed to least-square ﬁtting and logistic regression (Bach et al. [sent-332, score-0.178]

63 Some of them are based on projected gradient or on coordinate descent algorithm. [sent-337, score-0.158]

64 The ﬁrst point is to show that our gradient descent algorithm is efﬁcient. [sent-361, score-0.158]

65 The examples we use are based on wavelet-based regression in which the multiple kernel learning framework naturally ﬁts. [sent-365, score-0.178]

66 The ﬁnal experiment aims at evaluating the multiple kernel approach in a model selection problem for some multiclass problems. [sent-366, score-0.251]

67 The candidate kernels are: • Gaussian kernels with 10 different bandwidths σ, on all variables and on each single variable; • polynomial kernels of degree 1 to 3, again on all and each single variable. [sent-378, score-0.39]

68 4 20 40 60 Iterations 80 100 0 0 120 Pima 150 200 Iterations 250 300 350 Ionosphere Figure 2: Evolution of the three largest weights dm for SimpleMKL and SILP; left row: Pima; right row: Ionosphere. [sent-400, score-0.494]

69 The updates of dm based on the descent algorithm of SimpleMKL are rather conservative (small steps departing from 1/M for all dm ), whereas the oscillations of cutting planes are likely to favor extreme solutions, hitting the edges of the simplex. [sent-417, score-1.153]

70 5 Table 1: Average performance measures for the two MKL algorithms and a plain gradient descent algorithm. [sent-512, score-0.158]

71 4, when the number of kernels is large, computing the gradient may be expensive compared to SVM retraining with warm-start techniques. [sent-517, score-0.206]

72 On the one hand, the large variations in subsequents dm values for SILP, entail that subsequent SVM problems are not likely to have similar solutions: a warm-start call to the SVM solver does not help much. [sent-519, score-0.521]

73 On the other hand, with the smooth trajectories of dm in SimpleMKL, the previous SVM solution is often a good guess for the 2508 S IMPLE MKL 4 3. [sent-520, score-0.475]

74 Table 1 also shows the results obtained when replacing the update scheme described in Algorithm 1 by a usual reduced gradient update, which, at each iteration, modiﬁes d by computing the optimal step size on the descent direction D (12). [sent-524, score-0.177]

75 We see that the gradient descent updates require many more calls to the SVM solver and a number of gradient computations comparable with SILP. [sent-526, score-0.299]

76 1 0 −2 10 −1 10 0 1 10 10 2 10 0 −2 10 3 10 C −1 10 0 1 10 10 2 10 3 10 C Pima Wpbc Figure 4: Regularization paths of some dm and the number of selected kernels versus C; left row: Pima; right row: Wpbc. [sent-556, score-0.605]

77 Figure 4 shows the variations of the number of selected kernels and the values of d along the regularization path for the Pima and Wpbc data sets. [sent-563, score-0.206]

78 The number of kernels is not a monotone function of C: for small values of C, the number of kernels is somewhat constant, then, it rises rapidly. [sent-564, score-0.26]

79 We ﬁrst look at the situation where kernel matrices can be pre-computed and stored in memory, before reporting experiments where the memory are too high, leading to repeated kernel evaluations. [sent-583, score-0.252]

80 For these experiments, the stopping criterion is based on the variation of the weights dm . [sent-600, score-0.524]

81 As shown in Figure 2, the kernel weights rapidly reach a steady state and many iterations are spent for ﬁne tuning the weight and reach the duality gap tolerance. [sent-601, score-0.255]

82 2 Multiple Kernel Regression Examples Several research papers have already claimed that using multiple kernel learning can lead to better generalization performances in some classiﬁcation problems (Lanckriet et al. [sent-612, score-0.176]

83 Furthermore, according to Equation (14), we look for the regression function f (x) as a linear combination of functions each belonging to a wavelet based reproducing kernel Hilbert space. [sent-620, score-0.2]

84 The algorithm we use is a classical SVM regression algorithm with multiple kernels where each kernel is built from a set of wavelets. [sent-621, score-0.308]

85 These kernels have been obtained according to the expression: K(x, x ) = ∑ ∑ j s 1 ψ j,s (x)ψ j,s (x ) 2j where ψ(·) is a mother wavelet and j,s are respectively the dilation and translation parameters of the wavelet ψ j,s (·). [sent-622, score-0.279]

86 Our hope when using multiple kernel learning in this context is to capture the multiscale structure of the target function. [sent-626, score-0.187]

87 Furthermore, such a kernel has to be built according to the multiscale structure we wish to capture. [sent-628, score-0.157]

88 In this experiment, we have used three different choices of multiple kernels setting. [sent-629, score-0.16]

89 According to these settings, we have 10 dilation kernels and 48 dilation-translation kernels. [sent-636, score-0.175]

90 The ﬁrst column give the performance of a SVM regression using a single kernel which is the average sum of all the kernels used for the two other results. [sent-748, score-0.278]

91 As expected, using a multiple kernel learning setting outperforms the single kernel setting especially when the target function presents multiscale structure. [sent-764, score-0.313]

92 Interestingly, for these two signals, performances of the multiple kernel settings also depend on the signal structure. [sent-766, score-0.202]

93 We also notice that for the Blocks signal using multiple kernels only slightly improves performance compared to a single kernel. [sent-772, score-0.186]

94 3 Multiclass Problem For selecting the kernel and regularization parameter of a SVM, one usually tries all pairs of parameters and picks the couple that achieves the best cross-validation performance. [sent-810, score-0.182]

95 Using an MKL approach, one can instead let the algorithm combine all available kernels (obtained by sampling the parameter) and just selects the regularization parameter by cross-validation. [sent-811, score-0.186]

96 In our MKL one-against-all approach, we have used a polynomial kernel of degree 1 to 3 and Gaussian kernel for which σ belongs to [0. [sent-822, score-0.252]

97 We can see that the generalization performances of an MKL approach is either similar or better than the performance obtained when selecting the kernel through cross-validation, even though we have roughly searched the kernel and regularization parameter space. [sent-831, score-0.328]

98 The main added value of the smoothness of our new objective function is that descent methods become practical and efﬁcient means to solve the optimization problem that wraps a single kernel SVM solver. [sent-842, score-0.277]

99 This efﬁciency permits to demonstrate the usefulness of our algorithm on wavelet kernel based regression. [sent-847, score-0.178]

100 Possible extensions of this work include other learning problems, such as semi-supervised learning or kernel eigenvalue problem like kernel Fisher discriminant analysis. [sent-849, score-0.252]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mkl', 0.478), ('dm', 0.475), ('simplemkl', 0.389), ('silp', 0.248), ('svm', 0.166), ('bach', 0.148), ('km', 0.133), ('hm', 0.132), ('kernels', 0.13), ('kernel', 0.126), ('akotomamonjy', 0.112), ('anu', 0.112), ('randvalet', 0.112), ('imple', 0.102), ('multiclass', 0.095), ('descent', 0.082), ('fm', 0.082), ('gradient', 0.076), ('eval', 0.075), ('lanckriet', 0.073), ('canu', 0.068), ('duality', 0.067), ('rakotomamonjy', 0.063), ('bonnans', 0.057), ('sonnenburg', 0.057), ('regularization', 0.056), ('planes', 0.052), ('wavelet', 0.052), ('objective', 0.05), ('primal', 0.049), ('dual', 0.048), ('solver', 0.046), ('grandvalet', 0.045), ('dilation', 0.045), ('linchirp', 0.045), ('rouen', 0.045), ('pima', 0.043), ('gap', 0.043), ('wave', 0.04), ('dgsvm', 0.037), ('wpbc', 0.037), ('cutting', 0.037), ('differentiability', 0.037), ('differentiable', 0.036), ('formulation', 0.035), ('oscillations', 0.032), ('duan', 0.031), ('multiscale', 0.031), ('yi', 0.031), ('stopping', 0.03), ('alain', 0.03), ('djmax', 0.03), ('insa', 0.03), ('loosli', 0.03), ('multiple', 0.03), ('kkt', 0.028), ('zien', 0.027), ('signal', 0.026), ('ionosphere', 0.026), ('ong', 0.026), ('jmin', 0.025), ('chal', 0.025), ('lemar', 0.025), ('dk', 0.024), ('rkhs', 0.024), ('optimality', 0.023), ('abe', 0.022), ('amato', 0.022), ('antoniadis', 0.022), ('djmin', 0.022), ('litis', 0.022), ('luenberger', 0.022), ('simplesvm', 0.022), ('spamdata', 0.022), ('yves', 0.022), ('regression', 0.022), ('lagrange', 0.022), ('lagrangian', 0.022), ('convex', 0.022), ('cpu', 0.021), ('performances', 0.02), ('path', 0.02), ('sonar', 0.019), ('convexity', 0.019), ('direction', 0.019), ('phane', 0.019), ('jmax', 0.019), ('harchaoui', 0.019), ('liver', 0.019), ('owing', 0.019), ('optimization', 0.019), ('calls', 0.019), ('weights', 0.019), ('lasso', 0.019), ('xi', 0.018), ('francis', 0.018), ('wavelets', 0.018), ('keerthi', 0.018), ('trend', 0.017), ('forthcoming', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 86 jmlr-2008-SimpleMKL

Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet

2 0.24565193 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

Author: Jieping Ye, Shuiwang Ji, Jianhui Chen

Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-speciﬁed kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semideﬁnite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-inﬁnite linear programming (SILP) formulation is derived to further improve the efﬁciency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semideﬁnite programming, quadratically constrained quadratic programming, semi-inﬁnite linear programming

3 0.085213639 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufﬁciently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classiﬁcation results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

4 0.080545776 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efﬁcient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the ﬁtting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classiﬁcation tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classiﬁcation, kernel logistic regression, hierarchical classiﬁcation, cross validation optimization, Newton-Raphson optimization

5 0.080480918 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufﬁcient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

6 0.079235457 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

7 0.077130452 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

8 0.071084417 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

9 0.053907942 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

10 0.044865213 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

11 0.044085931 58 jmlr-2008-Max-margin Classification of Data with Absent Features

12 0.041988019 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

13 0.041557841 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

14 0.041356325 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

15 0.041222371 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

16 0.040444527 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

17 0.03861187 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

18 0.038603973 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

19 0.036856253 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

20 0.034779966 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.2), (1, -0.144), (2, 0.128), (3, 0.064), (4, 0.209), (5, -0.121), (6, 0.063), (7, 0.319), (8, 0.044), (9, 0.052), (10, 0.008), (11, -0.082), (12, 0.107), (13, -0.027), (14, -0.092), (15, -0.182), (16, 0.039), (17, 0.241), (18, 0.086), (19, -0.017), (20, 0.011), (21, -0.211), (22, 0.13), (23, 0.022), (24, -0.14), (25, 0.18), (26, -0.092), (27, 0.085), (28, 0.198), (29, -0.071), (30, 0.114), (31, 0.053), (32, 0.005), (33, 0.04), (34, -0.116), (35, -0.055), (36, -0.042), (37, -0.027), (38, 0.044), (39, -0.005), (40, 0.01), (41, -0.105), (42, -0.07), (43, -0.029), (44, 0.032), (45, 0.049), (46, -0.032), (47, -0.02), (48, -0.047), (49, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92043835 86 jmlr-2008-SimpleMKL

Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet

2 0.87215269 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

Author: Jieping Ye, Shuiwang Ji, Jianhui Chen

3 0.33990183 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a ﬁnite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classiﬁer with a larger or even an inﬁnite number of hypotheses. In addition, constructing an inﬁnite ensemble itself is a challenging task. In this paper, we formulate an inﬁnite ensemble learning framework based on the support vector machine (SVM). The framework can output an inﬁnite and nonsparse ensemble through embedding inﬁnitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies inﬁnitely many decision stumps, and the perceptron kernel embodies inﬁnitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies inﬁnitely many decision trees, and can thus be explained through inﬁnite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the beneﬁt of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

4 0.30439097 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

5 0.25803611 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

Author: Bo Jiang, Xuegong Zhang, Tianxi Cai

Abstract: Support vector machine (SVM) is one of the most popular and promising classiﬁcation algorithms. After a classiﬁcation rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with ﬁnite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine

6 0.25598839 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

7 0.22560228 92 jmlr-2008-Universal Multi-Task Kernels

8 0.21847191 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

9 0.21081519 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

10 0.1855955 58 jmlr-2008-Max-margin Classification of Data with Absent Features

11 0.18081379 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

12 0.17361462 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

13 0.17319584 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

14 0.16906187 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

15 0.16269414 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

16 0.16197921 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

17 0.16049895 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

18 0.16037408 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

19 0.15932651 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)

20 0.15861115 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (5, 0.026), (9, 0.014), (31, 0.017), (40, 0.041), (54, 0.051), (58, 0.071), (66, 0.048), (67, 0.351), (76, 0.016), (78, 0.023), (88, 0.066), (92, 0.038), (93, 0.014), (94, 0.076), (99, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74327856 86 jmlr-2008-SimpleMKL

Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet

2 0.38658309 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

Author: Jieping Ye, Shuiwang Ji, Jianhui Chen

3 0.38237566 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

4 0.3786377 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett

Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efﬁcient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments conﬁrm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efﬁciently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random ﬁelds ∗. These authors contributed equally. c 2008 Michael Col

5 0.3681089 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Author: Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin

Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classiﬁcation and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while ﬁxing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efﬁcient and stable than state of the art methods such as Pegasos and TRON. Keywords: linear support vector machines, document classiﬁcation, coordinate descent

6 0.36670142 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

7 0.36585951 56 jmlr-2008-Magic Moments for Structured Output Prediction

8 0.36379799 83 jmlr-2008-Robust Submodular Observation Selection

9 0.36333951 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

10 0.36321083 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

11 0.3631978 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

12 0.36259958 58 jmlr-2008-Max-margin Classification of Data with Absent Features

13 0.36236224 57 jmlr-2008-Manifold Learning: The Price of Normalization

14 0.36087948 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

15 0.35701835 9 jmlr-2008-Active Learning by Spherical Subdivision

16 0.35096052 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

17 0.3467223 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

18 0.34522292 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

19 0.34495312 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

20 0.34480646 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks