jmlr jmlr2008 jmlr2008-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
Reference: text
sentIndex sentText sentNum sentScore
1 FR Idiap Research Institute, Centre du Parc 1920 Martigny, Switzerland∗ Editor: Nello Cristianini Abstract Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. [sent-10, score-0.252]
2 In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. [sent-14, score-0.236]
3 Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. [sent-15, score-0.175]
4 Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. [sent-19, score-0.277]
5 Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. [sent-20, score-0.409]
6 In kernel methods, the data representation is implicitly chosen through the socalled kernel K(x, x ). [sent-27, score-0.252]
7 This kernel actually plays two roles: it defines the similarity between two examples x and x , while defining an appropriate regularization term for the learning problem. [sent-28, score-0.182]
8 For kernel algorithms, the solution of the learning problem is of the form f (x) = ∑ αi K(x, xi ) + b , (1) i=1 where αi and b are some coefficients to be learned from examples, while K(·, ·) is a given positive definite kernel associated with a reproducing kernel Hilbert space (RKHS) H . [sent-30, score-0.396]
9 Recent applications have shown that using multiple kernels instead of a single one can enhance the interpretability of the decision function and improve performances (Lanckriet et al. [sent-32, score-0.18]
10 In such cases, a convenient approach is to consider that the kernel K(x, x ) is actually a convex combination of basis kernels: M K(x, x ) = ∑ M dm Km (x, x ) , with dm ≥ 0 , m=1 ∑ dm = 1 , m=1 where M is the total number of kernels. [sent-34, score-1.573]
11 Alternatively, the kernels Km can simply be classical kernels (such as Gaussian kernels) with different parameters. [sent-37, score-0.26]
12 Within this framework, the problem of data representation through the kernel is then transferred to the choice of weights dm . [sent-38, score-0.62]
13 Learning both the coefficients αi and the weights dm in a single optimization problem is known as the multiple kernel learning (MKL) problem. [sent-39, score-0.669]
14 Indeed, we replace the mixed-norm regularization by a weighted 2 -norm regularization, where the sparsity of the linear combination of kernels is controlled by a 1 norm constraint on the kernel weights. [sent-61, score-0.312]
15 Indeed, our algorithm is simple, essentially based on a gradient descent on the SVM objective value. [sent-68, score-0.208]
16 We iteratively determine the combination of kernels by a gradient descent wrapping a standard SVM solver, which is SimpleSVM in our case. [sent-69, score-0.288]
17 However, they differ in that we use reduced gradient descent in the primal, whereas Sonnenburg et al. [sent-72, score-0.158]
18 For instance, in addition to the empirical efficiency comparison, we also show, in a SVM regression problem involving wavelet kernels, that automatic learning of the kernels leads to far better performances. [sent-79, score-0.204]
19 1 Functional Framework Before entering into the details of the MKL optimization problem, we first present the functional framework adopted for multiple kernel learning. [sent-95, score-0.175]
20 For any m, let dm be a non-negative coefficient and Hm be the Hilbert space 2493 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET derived from Hm as follows: f Hm < ∞} , dm Hm = { f | f ∈ H m : endowed with the inner product 1 f , g Hm = f,g dm m . [sent-100, score-1.425]
21 This means that, if dm = 0 then a function f belongs to the Hilbert space Hm only if f = 0 ∈ Hm . [sent-102, score-0.475]
22 Within this framework, Hm is a RKHS with kernel K(x, x ) = dm Km (x, x ) since ∀ f ∈ Hm ⊆ Hm , f (x) = f (·), Km (x, ·) m 1 f (·), dm Km (x, ·) m dm f (·), dm Km (x, ·) Hm . [sent-104, score-2.026]
23 = = Now, if we define H as the direct sum of the spaces Hm , that is, H= M M Hm , m=1 then, a classical result on RKHS (Aronszajn, 1950) says that H is a RKHS of kernel M K(x, x ) = ∑ dm Km (x, x ) . [sent-105, score-0.601]
24 In the MKL framework, one looks for a decision function of the form f (x) + b = ∑m fm (x) + b, where each function f m belongs to a different RKHS Hm associated with a kernel Km . [sent-114, score-0.208]
25 1 1 fm 2 m +C ∑ ξi H 2 ∑ dm m i yi ∑ fm (xi ) + yi b ≥ 1 − ξi ∀i m ξi ≥ 0 ∀i ∑ dm = 1 , dm ≥ 0 (2) ∀m , m where each dm controls the squared norm of f m in the objective function. [sent-118, score-2.176]
26 The smaller dm is, the smoother f m (as measured by f m Hm ) should be. [sent-119, score-0.475]
27 When dm = 0, fm Hm has also to be equal to zero to yield a finite objective value. [sent-120, score-0.607]
28 The 1 -norm constraint on the vector d is a sparsity constraint that will force some dm to be zero, thus encouraging sparse basis kernel expansions. [sent-121, score-0.601]
29 2 1 fm Hm +C ∑ ξi 2 ∑ m i yi ∑ fm (xi ) + yi b ≥ 1 − ξi ∀i (3) m ξi ≥ 0 ∀i. [sent-127, score-0.226]
30 Note that the objective function of this problem is not smooth since f m Hm is not differentiable at fm = 0. [sent-128, score-0.168]
31 However, what makes this formulation interesting is that the mixed-norm penalization of f = ∑m fm is a soft-thresholding penalizer that leads to a sparse solution, for which the algorithm performs kernel selection (Bach, 2008). [sent-129, score-0.243]
32 When setting to zero the gradient of the Lagrangian with respect to the primal variables, we get the following (a) (b) 1 fm (·) = ∑ αi yi Km (·, xi ) , dm i ∑ αi yi = 0, ∀m, (5) i (c) C − αi − νi = 0 , ∀i, 2 1 f m Hm (d) − + λ − ηm = 0 , 2 2 dm ∀m . [sent-141, score-1.237]
33 We note again here that f m (·) has to go to 0 if the coefficient dm vanishes. [sent-142, score-0.475]
34 Instead of using an alternate optimization algorithm, we prefer to consider here the following constrained optimization problem: M min J(d) such that d where J(d) = ∑ dm = 1, dm ≥ 0 , (7) m=1 min { f },b,ξ s. [sent-160, score-0.988]
35 1 1 fm 2 m +C ∑ ξi ∀i H 2 ∑ dm m i yi ∑ fm (xi ) + yi b ≥ 1 − ξi (8) m ξi ≥ 0 ∀i . [sent-162, score-0.701]
36 By setting to zero the derivatives of this Lagrangian according to the primal variables, we get conditions (5) (a) to (c), from which we derive the associated dual problem 1 αi α j yi y j ∑ dm Km (xi , x j ) + ∑ αi α 2∑ m i i, j with ∑ αi yi = 0 max − (9) i C ≥ αi ≥ 0 ∀i , 1. [sent-168, score-0.634]
37 (2004a) formulation differs slightly, in that the kernels are weighted by some pre-defined coefficients that were not considered here. [sent-170, score-0.165]
38 2497 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET which is identified as the standard SVM dual formulation using the combined kernel K(x i , x j ) = ∑m dm Km (xi , x j ). [sent-171, score-0.684]
39 Because of strong duality, J(d) is also the objective value of the dual problem: J(d) = − 1 α α yi y j ∑ dm Km (xi , x j ) + ∑ αi , 2∑ i j m i i, j (10) where α maximizes (9). [sent-173, score-0.604]
40 Thus, by simple differentiation of the dual function (9) with respect to dm , we have: ∂J 1 = − ∑ αi α j yi y j Km (xi , x j ) ∂dm 2 i, j ∀m . [sent-188, score-0.554]
41 With our positivity assumption on the kernel matrices, J(·) is convex and differentiable with Lipschitz gradient (Lemar´ chal and Sagastizabal, 1997). [sent-193, score-0.285]
42 Once the gradient of J(d) is computed, d is updated by using a descent direction ensuring that the equality constraint and the non-negativity constraints on d are satisfied. [sent-195, score-0.177]
43 However, if there is an index m such that dm = 0 and [∇red J]m > 0, using this direction would violate the positivity constraint for d m . [sent-202, score-0.494]
44 This gives the descent direction for updating d as ∂J ∂J if dm = 0 and ∂dm − ∂dµ > 0 0 − ∂J + ∂J if dm > 0 and m = µ Dm = (12) ∂dm ∂dµ ∂J ∂J − for m = µ . [sent-204, score-1.051]
45 Here, as detailed in Algorithm 1, we go one step beyond: once a descent direction D has been computed, we first look for the maximal admissible step size in that direction and check whether the objective value decreases or not. [sent-206, score-0.17]
46 From the primal and dual objectives provided respectively in (2) and (6), the MKL duality gap is 1 DualGap = J(d ) − ∑ αi + max ∑ αi α j yi y j Km (xi , x j ) , 2 m i, j i 2499 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET Algorithm 1 SimpleMKL algorithm 1 set dm = M for m = 1, . [sent-224, score-0.713]
47 , M while stopping criterion not met do compute J(d) by using an SVM solver with K = ∑m dm Km ∂J compute ∂dm for m = 1, . [sent-227, score-0.551]
48 If J(d ) has been obtained through the dual problem (9), then this MKL duality gap can also be computed from the single kernel SVM algorithm duality gap DGSVM . [sent-232, score-0.394]
49 Otherwise, we have DGSVM = J(d ) + 1 α α yi y j ∑ dm Km (xi , x j ) − ∑ αi 2∑ i j m i i, j then the MKL duality gap becomes DualGap = DGSVM − 1 1 ∑ αi α j yi y j ∑ dm Km (xi , x j ) + 2 max ∑ αi α j yi y j Km (xi , x j ) . [sent-234, score-1.153]
50 Consequently, we terminate the algorithm when max ∑ αi α j yi y j Km (xi , x j ) − ∑ αi α j yi y j ∑ dm Km (xi , x j ) ≤ ε . [sent-238, score-0.537]
51 For the general MKL problem (7), the first order optimality conditions are 2500 S IMPLE MKL obtained through the KKT conditions: ∂J + λ − ηm = 0 ∀m , ∂dm ηm · dm = 0 ∀m , where λ and {ηm } are respectively the Lagrange multipliers for the equality and inequality constraints of (7). [sent-241, score-0.498]
52 These KKT conditions imply ∂J ∂dm ∂J ∂dm = −λ if dm > 0 , ≥ −λ if dm = 0 . [sent-242, score-0.95]
53 Let’s define dJmin and dJmax as dJmin = ∂J {dm |dm >0} ∂dm min and dJmax = ∂J , {dm |dm >0} ∂dm max then, the necessary optimality conditions are approximated by the following termination conditions: |dJmin − dJmax | ≤ ε and ∂J ≥ dJmax ∂dm if dm = 0 . [sent-245, score-0.498]
54 They, however, differ in computational efficiency, because the kernel weights dm are optimized in quite different ways, as detailed below. [sent-256, score-0.62]
55 Let us first recall that our differentiable function J(d) is defined as: max − 1 ∑ αi α j yi y j ∑ dm Km (xi , x j ) + ∑ αi α 2 i, j m i J(d) = with ∑ αi yi = 0, C ≥ αi ≥ 0 ∀i , i and both algorithms aim at minimizing this differentiable function. [sent-257, score-0.609]
56 2501 R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET Figure 1: Illustrating three iterations of the SILP algorithm and a gradient descent algorithm for a one-dimensional problem. [sent-259, score-0.158]
57 Provided the step size is chosen correctly, gradient descent converges directly towards the optimal solution without overshooting (from d0 to d ). [sent-262, score-0.158]
58 Figure 1 illustrates the behaviour of both algorithms in a simple case, with oscillations for cutting planes and direct convergence for gradient descent. [sent-270, score-0.197]
59 , 1997; Scholkopf and Smola, 2001), we have min 1 ∑ 1 fm 2 +C ∑(ξi + ξ∗ ) i Hm fm ,b,ξi 2 dm m i s. [sent-293, score-0.639]
60 yi − ∑ fm (xi ) − b ≤ ε + ξi ∀i J(d) = m ∑ fm (xi ) + b − yi ≤ ε + ξ∗ ∀i i m ξi ≥ 0, ξ∗ ≤ 0 ∀i , i (14) and for one-class SVMs (Sch¨ lkopf and Smola, 2001), we have: o min f ,b,ξ m i J(d) = s. [sent-295, score-0.226]
61 1 1 1 ∑ dm f m 2 m + ν H 2 m ∑ fm (xi ) ≥ b − ξi ∑ ξi − b i m ξi ≥ 0 . [sent-297, score-0.557]
62 For instance, multiple kernel learning has been transposed to least-square fitting and logistic regression (Bach et al. [sent-332, score-0.178]
63 Some of them are based on projected gradient or on coordinate descent algorithm. [sent-337, score-0.158]
64 The first point is to show that our gradient descent algorithm is efficient. [sent-361, score-0.158]
65 The examples we use are based on wavelet-based regression in which the multiple kernel learning framework naturally fits. [sent-365, score-0.178]
66 The final experiment aims at evaluating the multiple kernel approach in a model selection problem for some multiclass problems. [sent-366, score-0.251]
67 The candidate kernels are: • Gaussian kernels with 10 different bandwidths σ, on all variables and on each single variable; • polynomial kernels of degree 1 to 3, again on all and each single variable. [sent-378, score-0.39]
68 4 20 40 60 Iterations 80 100 0 0 120 Pima 150 200 Iterations 250 300 350 Ionosphere Figure 2: Evolution of the three largest weights dm for SimpleMKL and SILP; left row: Pima; right row: Ionosphere. [sent-400, score-0.494]
69 The updates of dm based on the descent algorithm of SimpleMKL are rather conservative (small steps departing from 1/M for all dm ), whereas the oscillations of cutting planes are likely to favor extreme solutions, hitting the edges of the simplex. [sent-417, score-1.153]
70 5 Table 1: Average performance measures for the two MKL algorithms and a plain gradient descent algorithm. [sent-512, score-0.158]
71 4, when the number of kernels is large, computing the gradient may be expensive compared to SVM retraining with warm-start techniques. [sent-517, score-0.206]
72 On the one hand, the large variations in subsequents dm values for SILP, entail that subsequent SVM problems are not likely to have similar solutions: a warm-start call to the SVM solver does not help much. [sent-519, score-0.521]
73 On the other hand, with the smooth trajectories of dm in SimpleMKL, the previous SVM solution is often a good guess for the 2508 S IMPLE MKL 4 3. [sent-520, score-0.475]
74 Table 1 also shows the results obtained when replacing the update scheme described in Algorithm 1 by a usual reduced gradient update, which, at each iteration, modifies d by computing the optimal step size on the descent direction D (12). [sent-524, score-0.177]
75 We see that the gradient descent updates require many more calls to the SVM solver and a number of gradient computations comparable with SILP. [sent-526, score-0.299]
76 1 0 −2 10 −1 10 0 1 10 10 2 10 0 −2 10 3 10 C −1 10 0 1 10 10 2 10 3 10 C Pima Wpbc Figure 4: Regularization paths of some dm and the number of selected kernels versus C; left row: Pima; right row: Wpbc. [sent-556, score-0.605]
77 Figure 4 shows the variations of the number of selected kernels and the values of d along the regularization path for the Pima and Wpbc data sets. [sent-563, score-0.206]
78 The number of kernels is not a monotone function of C: for small values of C, the number of kernels is somewhat constant, then, it rises rapidly. [sent-564, score-0.26]
79 We first look at the situation where kernel matrices can be pre-computed and stored in memory, before reporting experiments where the memory are too high, leading to repeated kernel evaluations. [sent-583, score-0.252]
80 For these experiments, the stopping criterion is based on the variation of the weights dm . [sent-600, score-0.524]
81 As shown in Figure 2, the kernel weights rapidly reach a steady state and many iterations are spent for fine tuning the weight and reach the duality gap tolerance. [sent-601, score-0.255]
82 2 Multiple Kernel Regression Examples Several research papers have already claimed that using multiple kernel learning can lead to better generalization performances in some classification problems (Lanckriet et al. [sent-612, score-0.176]
83 Furthermore, according to Equation (14), we look for the regression function f (x) as a linear combination of functions each belonging to a wavelet based reproducing kernel Hilbert space. [sent-620, score-0.2]
84 The algorithm we use is a classical SVM regression algorithm with multiple kernels where each kernel is built from a set of wavelets. [sent-621, score-0.308]
85 These kernels have been obtained according to the expression: K(x, x ) = ∑ ∑ j s 1 ψ j,s (x)ψ j,s (x ) 2j where ψ(·) is a mother wavelet and j,s are respectively the dilation and translation parameters of the wavelet ψ j,s (·). [sent-622, score-0.279]
86 Our hope when using multiple kernel learning in this context is to capture the multiscale structure of the target function. [sent-626, score-0.187]
87 Furthermore, such a kernel has to be built according to the multiscale structure we wish to capture. [sent-628, score-0.157]
88 In this experiment, we have used three different choices of multiple kernels setting. [sent-629, score-0.16]
89 According to these settings, we have 10 dilation kernels and 48 dilation-translation kernels. [sent-636, score-0.175]
90 The first column give the performance of a SVM regression using a single kernel which is the average sum of all the kernels used for the two other results. [sent-748, score-0.278]
91 As expected, using a multiple kernel learning setting outperforms the single kernel setting especially when the target function presents multiscale structure. [sent-764, score-0.313]
92 Interestingly, for these two signals, performances of the multiple kernel settings also depend on the signal structure. [sent-766, score-0.202]
93 We also notice that for the Blocks signal using multiple kernels only slightly improves performance compared to a single kernel. [sent-772, score-0.186]
94 3 Multiclass Problem For selecting the kernel and regularization parameter of a SVM, one usually tries all pairs of parameters and picks the couple that achieves the best cross-validation performance. [sent-810, score-0.182]
95 Using an MKL approach, one can instead let the algorithm combine all available kernels (obtained by sampling the parameter) and just selects the regularization parameter by cross-validation. [sent-811, score-0.186]
96 In our MKL one-against-all approach, we have used a polynomial kernel of degree 1 to 3 and Gaussian kernel for which σ belongs to [0. [sent-822, score-0.252]
97 We can see that the generalization performances of an MKL approach is either similar or better than the performance obtained when selecting the kernel through cross-validation, even though we have roughly searched the kernel and regularization parameter space. [sent-831, score-0.328]
98 The main added value of the smoothness of our new objective function is that descent methods become practical and efficient means to solve the optimization problem that wraps a single kernel SVM solver. [sent-842, score-0.277]
99 This efficiency permits to demonstrate the usefulness of our algorithm on wavelet kernel based regression. [sent-847, score-0.178]
100 Possible extensions of this work include other learning problems, such as semi-supervised learning or kernel eigenvalue problem like kernel Fisher discriminant analysis. [sent-849, score-0.252]
wordName wordTfidf (topN-words)
[('mkl', 0.478), ('dm', 0.475), ('simplemkl', 0.389), ('silp', 0.248), ('svm', 0.166), ('bach', 0.148), ('km', 0.133), ('hm', 0.132), ('kernels', 0.13), ('kernel', 0.126), ('akotomamonjy', 0.112), ('anu', 0.112), ('randvalet', 0.112), ('imple', 0.102), ('multiclass', 0.095), ('descent', 0.082), ('fm', 0.082), ('gradient', 0.076), ('eval', 0.075), ('lanckriet', 0.073), ('canu', 0.068), ('duality', 0.067), ('rakotomamonjy', 0.063), ('bonnans', 0.057), ('sonnenburg', 0.057), ('regularization', 0.056), ('planes', 0.052), ('wavelet', 0.052), ('objective', 0.05), ('primal', 0.049), ('dual', 0.048), ('solver', 0.046), ('grandvalet', 0.045), ('dilation', 0.045), ('linchirp', 0.045), ('rouen', 0.045), ('pima', 0.043), ('gap', 0.043), ('wave', 0.04), ('dgsvm', 0.037), ('wpbc', 0.037), ('cutting', 0.037), ('differentiability', 0.037), ('differentiable', 0.036), ('formulation', 0.035), ('oscillations', 0.032), ('duan', 0.031), ('multiscale', 0.031), ('yi', 0.031), ('stopping', 0.03), ('alain', 0.03), ('djmax', 0.03), ('insa', 0.03), ('loosli', 0.03), ('multiple', 0.03), ('kkt', 0.028), ('zien', 0.027), ('signal', 0.026), ('ionosphere', 0.026), ('ong', 0.026), ('jmin', 0.025), ('chal', 0.025), ('lemar', 0.025), ('dk', 0.024), ('rkhs', 0.024), ('optimality', 0.023), ('abe', 0.022), ('amato', 0.022), ('antoniadis', 0.022), ('djmin', 0.022), ('litis', 0.022), ('luenberger', 0.022), ('simplesvm', 0.022), ('spamdata', 0.022), ('yves', 0.022), ('regression', 0.022), ('lagrange', 0.022), ('lagrangian', 0.022), ('convex', 0.022), ('cpu', 0.021), ('performances', 0.02), ('path', 0.02), ('sonar', 0.019), ('convexity', 0.019), ('direction', 0.019), ('phane', 0.019), ('jmax', 0.019), ('harchaoui', 0.019), ('liver', 0.019), ('owing', 0.019), ('optimization', 0.019), ('calls', 0.019), ('weights', 0.019), ('lasso', 0.019), ('xi', 0.018), ('francis', 0.018), ('wavelets', 0.018), ('keerthi', 0.018), ('trend', 0.017), ('forthcoming', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
2 0.24565193 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
3 0.085213639 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
4 0.080545776 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
5 0.080480918 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
6 0.079235457 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
7 0.077130452 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
8 0.071084417 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
9 0.053907942 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
10 0.044865213 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
11 0.044085931 58 jmlr-2008-Max-margin Classification of Data with Absent Features
12 0.041988019 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
13 0.041557841 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
14 0.041356325 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
15 0.041222371 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
16 0.040444527 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
17 0.03861187 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
18 0.038603973 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
19 0.036856253 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
20 0.034779966 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
topicId topicWeight
[(0, 0.2), (1, -0.144), (2, 0.128), (3, 0.064), (4, 0.209), (5, -0.121), (6, 0.063), (7, 0.319), (8, 0.044), (9, 0.052), (10, 0.008), (11, -0.082), (12, 0.107), (13, -0.027), (14, -0.092), (15, -0.182), (16, 0.039), (17, 0.241), (18, 0.086), (19, -0.017), (20, 0.011), (21, -0.211), (22, 0.13), (23, 0.022), (24, -0.14), (25, 0.18), (26, -0.092), (27, 0.085), (28, 0.198), (29, -0.071), (30, 0.114), (31, 0.053), (32, 0.005), (33, 0.04), (34, -0.116), (35, -0.055), (36, -0.042), (37, -0.027), (38, 0.044), (39, -0.005), (40, 0.01), (41, -0.105), (42, -0.07), (43, -0.029), (44, 0.032), (45, 0.049), (46, -0.032), (47, -0.02), (48, -0.047), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.92043835 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
2 0.87215269 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
3 0.33990183 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
4 0.30439097 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
5 0.25803611 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
Author: Bo Jiang, Xuegong Zhang, Tianxi Cai
Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine
6 0.25598839 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
7 0.22560228 92 jmlr-2008-Universal Multi-Task Kernels
8 0.21847191 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
9 0.21081519 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
10 0.1855955 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.18081379 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
12 0.17361462 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
13 0.17319584 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
14 0.16906187 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
15 0.16269414 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
16 0.16197921 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
17 0.16049895 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
18 0.16037408 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
19 0.15932651 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
20 0.15861115 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
topicId topicWeight
[(0, 0.031), (5, 0.026), (9, 0.014), (31, 0.017), (40, 0.041), (54, 0.051), (58, 0.071), (66, 0.048), (67, 0.351), (76, 0.016), (78, 0.023), (88, 0.066), (92, 0.038), (93, 0.014), (94, 0.076), (99, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.74327856 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
2 0.38658309 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
3 0.38237566 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
4 0.3786377 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett
Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random fields ∗. These authors contributed equally. c 2008 Michael Col
5 0.3681089 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
Author: Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON. Keywords: linear support vector machines, document classification, coordinate descent
6 0.36670142 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
7 0.36585951 56 jmlr-2008-Magic Moments for Structured Output Prediction
8 0.36379799 83 jmlr-2008-Robust Submodular Observation Selection
9 0.36333951 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
10 0.36321083 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
11 0.3631978 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
12 0.36259958 58 jmlr-2008-Max-margin Classification of Data with Absent Features
13 0.36236224 57 jmlr-2008-Manifold Learning: The Price of Normalization
14 0.36087948 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
15 0.35701835 9 jmlr-2008-Active Learning by Spherical Subdivision
16 0.35096052 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
17 0.3467223 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
18 0.34522292 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
20 0.34480646 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks