jmlr jmlr2011 jmlr2011-105 knowledge-graph by maker-knowledge-mining

105 jmlr-2011-lp-Norm Multiple Kernel Learning

Source: pdf

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efﬁcient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artiﬁcial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. [sent-10, score-0.359]

2 This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. [sent-14, score-0.246]

3 Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. [sent-21, score-0.229]

4 (2005) study hyperkernels on the space of kernels and alternative approaches include selecting ¨ o u kernels by DC programming (Argyriou et al. [sent-42, score-0.236]

5 Although ﬁnding non-linear kernel mixtures (G¨ nen o and Alpaydin, 2008; Varma and Babu, 2009) generally results in non-convex optimization problems, Cortes et al. [sent-44, score-0.231]

6 This spawned the new ﬁeld of multiple kernel learning (MKL), the automatic combination of several kernel functions. [sent-49, score-0.327]

7 All the above mentioned multiple kernel learning formulations promote sparse solutions in terms of the mixing coefﬁcients. [sent-88, score-0.237]

8 1 Outline of the Presented Achievements On the theoretical side, we cast multiple kernel learning as a general regularized risk minimization problem for arbitrary convex loss functions, Hilbertian regularizers, and arbitrary norm-penalties on θ. [sent-101, score-0.241]

9 Our optimization problem subsumes state-of-the-art approaches to multiple kernel learning, covering sparse and nonsparse MKL by arbitrary p-norm regularization (1 ≤ p ≤ ∞) on the mixing coefﬁcients as well as the incorporation of prior knowledge by allowing for non-isotropic regularizers. [sent-104, score-0.319]

10 Moreover our implementation employs kernel caching techniques, which enables training on ten thousands of data points or thousands of kernels respectively. [sent-108, score-0.299]

11 The results demonstrate (i) that combining kernels is now tractable on large data sets, (ii) that it can provide cutting edge classiﬁcation accuracy, and (iii) that depending on the task at hand, different kernel mixture regularizations are required for achieving optimal performance. [sent-120, score-0.269]

12 (2010) present a more general dual view of ℓ2 -norm MKL and show advantages of ℓ2 -norm over an unweighted-sum kernel SVM on six bioinformatics data sets. [sent-143, score-0.248]

13 Multiple Kernel Learning—A Unifying View In this section we cast multiple kernel learning into a uniﬁed framework: we present a regularized loss minimization formulation with additional norm constraints on the kernel mixing coefﬁcients. [sent-152, score-0.444]

14 We derive generalized dual optimization problems without making speciﬁc assumptions on the norm regularizers or the loss function, beside that the latter is convex. [sent-154, score-0.251]

15 Prior knowledge on kernel mixtures and kernel asymmetries can be incorporated by non-isotropic norm regularizers. [sent-158, score-0.39]

16 M, each giving rise to a reproducing kernel km of Hm . [sent-179, score-0.26]

17 Convex approaches to multiple kernel learning consider linear kernel mixtures kθ = ∑ θm km , θm ≥ 0. [sent-180, score-0.457]

18 Compared to Equation (1), the primal model for learning with multiple kernels is extended to M hw,b,θ (x) = ˜ ∑ θm wm , ψm (x) Hm + b = w, ψθ (x) H + b ˜ ˜ m=1 where the parameter vector w and the composite feature map ψθ have a block structure w = (w⊤ , . [sent-181, score-0.492]

19 the optimal kernel mixture ∑M θm km in addition to regularizing θ to avoid overﬁtting. [sent-191, score-0.26]

20 Hence, in terms m=1 of regularized risk minimization, the optimization problem becomes inf w,b,θ:θ≥0 ˜ 1 n ∑V n i=1 M ∑ m=1 λ M ˜ ˜ H ˜˜ θm wm , ψm (xi ) Hm + b, yi + ∑ wm 2 m + µΩ[θ], 2 m=1 957 (2) K LOFT, B REFELD , S ONNENBURG AND Z IEN for µ > 0. [sent-192, score-0.701]

21 ˜ ˜ Previous approaches to multiple kernel learning employ regularizers of the form Ω(θ) = θ 1 to promote sparse kernel mixtures. [sent-194, score-0.419]

22 The non-convexity arising from the θm wm product in ˜ √ the loss term of Equation (2) is not inherent and can be resolved by substituting wm ← θm wm . [sent-196, score-0.843]

23 C ∑V i=1 2 θ M ∑ m=1 2 1 M w m Hm wm , ψm (xi ) Hm + b, yi + ∑ 2 m=1 θm (4) ≤ 1. [sent-207, score-0.311]

24 In particular, we have θ∗ ∞ < ∞, from which we conclude by (5), that wm = 0 holds for all m, which contradicts our assumption. [sent-228, score-0.274]

25 M ∑ wm , ψm (x) + b, yi + m=1 1 M wm ∑ 2 m=1 θm 2 2 ≤ τ, for some τ > 0. [sent-232, score-0.585]

26 Let us begin with rewriting Optimization Problem (4) by expanding the decision values into slack variables as follows 2 1 M wm Hm C ∑ V (ti , yi ) + ∑ 2 m=1 θm i=1 n inf w,b,t,θ M s. [sent-248, score-0.345]

27 ∀i : ∑ wm , ψm (xi ) Hm + b = ti ; m=1 (6) θ 2 ≤1; θ≥0, where · is an arbitrary norm in Rm and · HM denotes the Hilbertian norm of Hm . [sent-250, score-0.408]

28 The Lagrangian saddle point problem is then given by n sup inf α,β,γ: w,b,t,θ β≥0,γ≥0 C ∑ V (ti , yi ) + i=1 n M − ∑ αi 1 wm , ψm (xi ) Hm + b − ti + β θ 2 ∑ i=1 2 1 M w m Hm ∑ 2 m=1 θm m=1 2 1 − γ⊤ θ. [sent-255, score-0.371]

29 2 − Denoting the Lagrangian by L and setting its ﬁrst partial derivatives with respect to w and b to 0 reveals the optimality conditions 1⊤ α = 0; n wm = θm ∑ αi ψm (xi ), ∀ m = 1, . [sent-256, score-0.274]

30 (9) m=1 ∗ The above dual generalizes multiple kernel learning to arbitrary convex loss functions and norms. [sent-275, score-0.284]

31 Hence we have a decomposition of the dual into a loss term (in terms of the dual loss) and a regularizer (in terms of the dual norm). [sent-285, score-0.251]

32 A similar dualization technique yielding singly-valued dual loss terms is presented in Rifkin and Lippert (2007); it is based on Fenchel duality and limited to strictly positive deﬁnite kernel matrices. [sent-291, score-0.266]

33 (2004) A common approach in multiple kernel learning is to employ regularizers of the form Ω(θ) = θ 1 . [sent-314, score-0.236]

34 3 A Smooth Variant of Group Lasso Yuan and Lin (2006) studied the following optimization problem for the special case Hm = Rdm and ψm = idRdm , also known as group lasso, min w M C n ∑ yi − ∑ wm , ψm (xi ) Hm 2 i=1 m=1 962 2 + 1 M ∑ w m Hm . [sent-327, score-0.37]

35 First, we note that Equation (13) can be equivalently expressed as (Micchelli and Pontil, 2005, Lemma 26) inf w,θ:θ≥0 2 M C n ∑ yi − ∑ wm , ψm (xi ) Hm 2 i=1 m=1 2 1 M w m Hm + ∑ , 2 m=1 θm s. [sent-330, score-0.345]

36 This gives rise to the primal n inf w,θ:θ≥0 C ∑ max 0, i=1 M ∑ wm , ψm (xi ) Hm m=1 2 1 M wm Hm + ∑ , 2 m=1 θm s. [sent-344, score-0.625]

37 Alternatively, E could be obtained by computing pairwise kernel alignments Ei j = Ki i K j given a dot product on the space of kernels such as the Frobenius dot product (Ong et al. [sent-359, score-0.304]

38 To this end, note that for the dual norm it holds 2 · 2 −1 = 1 · 2 , so E 2 E that we obtain from (9) the following dual n sup α,γ: 1⊤ α=0,γ≥0 −C ∑ V ∗ − i=1 αi , yi − C 1 ⊤ α Km α + γm 2 M , m=1 E which is the desired non-isotropic MKL problem. [sent-365, score-0.262]

39 C ∑V i=1 θ 2 p M ∑ wm , ψm (xi ) Hm + b, yi m=1 ≤ 1. [sent-373, score-0.311]

40 2 1 M wm Hm + ∑ 2 m=1 θm (14) p Using that the dual norm of the ℓ p -norm is the ℓ p∗ -norm, where p∗ := p−1 , and noting that γ∗ = 0 in the optimal point, we obtain from Optimization Problem (9) the following ℓ p -norm MKL dual: 4. [sent-374, score-0.407]

41 , 2009) do so by setting up a two-layer optimization procedure: a master problem, which is parameterized only by θ, is solved to determine the kernel mixture; to solve this master problem, repeatedly a slave problem is solved which amounts to training a standard SVM on a mixture kernel. [sent-403, score-0.267]

42 Albeit appearing advantageous, wrapper methods suffer from two shortcomings: (i) Due to kernel cache limitations, the kernel matrices have to be pre-computed and stored or many kernel computations have to be carried out repeatedly, inducing heavy wastage of either memory or time. [sent-406, score-0.549]

43 (2006a) and extends it to the optimization of non-sparse kernel mixtures induced by an ℓ p -norm penalty. [sent-413, score-0.231]

44 1 A S IMPLE W RAPPER A PPROACH BASED ON AN A NALYTICAL U PDATE We ﬁrst present an easy-to-implement wrapper version of our optimization approach to multiple kernel learning. [sent-428, score-0.307]

45 Given ﬁxed (possibly suboptimal) w = 0 and b, the minimal θ in Optimization Problem (14) is attained for 2 p+1 wm H m θm = 1/p 2p p+1 , ∑M′ =1 wm′ H ′ m m ∀m = 1, . [sent-438, score-0.274]

46 (16) Proof 6 We start the derivation, by equivalently translating Optimization Problem (14) via Theorem 1 into n inf w,b,θ:θ≥0 ˜ C ∑V i=1 M ∑ wm , ψm (xi ) Hm + b, yi m=1 2 1 M w m Hm µ + θ 2, + ∑ p 2 m=1 θm 2 (17) with µ > 0. [sent-442, score-0.345]

47 θ to zero yields the following condition on the optimality of θ, − wm 2 m ∂ 1 θ H +µ· 2 2θ2 ∂θm m 2 p = 0, ∀m = 1, . [sent-446, score-0.274]

48 Inserting (19) in the latter 2p/p+1 1/p equation leads to ζ = ∑M m=1 wm Hm . [sent-455, score-0.274]

49 The resulting dual problem is of the form (detailed derivations omitted) n sup −C ∑ V ∗ − α:1⊤ α=0 i=1 αi 1 M , yi − ∑ θm α⊤ Km α, C 2 m=1 (20) and the KKT conditions yield wm = θm ∑n αi ψm (xi ) in the optimal point, hence i=1 wm 2 = θ2 αKm α, m ∀ m = 1, . [sent-467, score-0.677]

50 2 T OWARDS L ARGE -S CALE MKL—I NTERLEAVING SVM AND MKL O PTIMIZATION However, a disadvantage of the above wrapper approach still is that it deploys a full blown kernel matrix. [sent-489, score-0.223]

51 We start by expanding Optimization Problem (14) into 2 1 M wm Hm , ∑ 2 m=1 θm n min w,b,ξ,θ C ∑ ξi + i=1 M s. [sent-559, score-0.274]

52 ∀i : ∑ m=1 wm , ψm (xi ) Hm + b ≥ 1 − ξi ; ξ ≥ 0; θ 2 p θ ≥ 0, ≤ 1; thereby extending the second block of variables, (w, b), into (w, b, ξ). [sent-561, score-0.33]

53 In the problem’s current form, the possibility of an optimal θm = 0 while wm = 0 renders the objective function nondifferentiable. [sent-563, score-0.274]

54 ∀i : ∑ m=1 1 M ∑ exp(−θm ) wm 2 m=1 wm , ψm (xi ) Rn 2 Rn , + b ≥ 1 − ξi ; ξ ≥ 0; exp(θ) 2 p ≤ 1, where we employ the notation exp(θ) = (exp(θ1 ), · · · , exp(θM ))⊤ . [sent-571, score-0.57]

55 First, the standard Hilbert space setup necessarily implies that wm ≥ 0 for all m. [sent-583, score-0.274]

56 However, for any wm ≤ 0 it also follows that θ⋆ = 0 as long as at least one strictly positive wm′ > 0 exists. [sent-585, score-0.274]

57 In addition one can choose the optimization scheme, that is, decide whether the interleaved optimization algorithm or the wrapper algorithm should be applied. [sent-612, score-0.305]

58 In the more conventional family of approaches, the wrapper algorithms, an optimization scheme on θ wraps around a single kernel SVM. [sent-617, score-0.282]

59 The second, much faster approach performs interleaved optimization and thus requires modiﬁcation of the core SVM optimization algorithm. [sent-625, score-0.233]

60 To reduce the implementation effort, we implement a single function 1 perform mkl step(∑α , objm ), that has the arguments ∑α := ∑n αi and objm = 2 αT Km α, that is, i=1 the current linear α-term and the SVM objectives for each kernel. [sent-627, score-0.639]

61 970 ℓ p -N ORM M ULTIPLE K ERNEL L EARNING interleaved optimization case, called as a callback function (after each chunking step or a couple of SMO steps), or it is called by the wrapper algorithm (after each SVM optimization to full precision). [sent-636, score-0.328]

62 11 While for the wrapper algorithms only a single kernel SVM needs to be solved and thus a single large kernel cache should be used, the story is different for interleaved optimization. [sent-645, score-0.489]

63 This implies that the kernels have to be normalized in a sensible way in order to represent an “uninformative prior” as to which kernels are useful. [sent-654, score-0.236]

64 Formally, we ﬁnd a positive ˜ rescaling ρm of the kernel, such that the rescaled kernel km (·, ·) = ρm km (·, ·) and the corresponding √ ˜ feature map Φm (·) = ρm Φm (·) satisfy 1 n ˜ ˜ x ∑ Φm (xi ) − Φm (¯ ) n i=1 2 =1 11. [sent-662, score-0.369]

65 The above equation can be equivalently be expressed in terms of kernel functions as 1 n ˜ 1 n n ˜ km (xi , xi ) − 2 ∑ ∑ km (xi , x j ) = 1, ∑ n i=1 n i=1 j=1 so that the ﬁnal normalization rule is k(x, x) −→ ¯ k(x, x) ¯ 1 n 1 ∑n k(xi , xi ) − n2 i=1 ∑n j=1 , k(xi , x j ) i, . [sent-669, score-0.39]

66 To this aim let us recall the primal MKL problem (14) and consider the special case of ℓ p -norm MKL given by n inf w,b,θ:θ≥0 C ∑V i=1 M ∑ m=1 2 1 M wm Hm , wm , ψm (xi ) Hm + b, yi + ∑ 2 m=1 θm s. [sent-682, score-0.662]

67 (24) The subsequent proposition shows that (24) equivalently can be translated into the following mixednorm formulation, n inf w,b ˜ C ∑V i=1 M ∑ m=1 1 M q wm , ψm (xi ) Hm + b, yi + ∑ wm H , m 2 m=1 (25) 2p ˜ where q = p+1 , and C is a constant. [sent-685, score-0.619]

68 2 it follows that for any ﬁxed w in (24) it holds for the w-optimal θ: 2 p+1 ∃ζ : θm = ζ wm H , m ∀m = 1, . [sent-694, score-0.274]

69 Plugging the above equation into (24) yields n inf w,b Deﬁning q := 2p p+1 C ∑V i=1 M ∑ m=1 2p 1 M p+1 wm , ψm (xi ) Hm + b, yi + wm H . [sent-698, score-0.619]

70 (2011) showed in particular that for q ≥ 2, Equation (25) is equivalent to n sup inf θ:θ≥0, θ 2 ≤1 w,b r ˜ C ∑V i=1 M ∑ m=1 1 M wm , ψm (xi ) Hm + b, yi + ∑ θm wm 2 m , H 2 m=1 q where r := q−2 . [sent-704, score-0.645]

71 It is straightforward to show that for every ﬁxed (possibly suboptimal) pair (w, b) the optimal θ is given by 2 r−1 wm H m θm = ∑M′ =1 m 2r r−1 1/r , wm′ H ′ m ∀m = 1, . [sent-706, score-0.274]

72 Therefore, the corresponding update formula in our framework is −2 r−1 wm H m θm = ∑M′ =1 m −2r r−1 1/r wm′ H ′ m , ∀m = 1, . [sent-713, score-0.274]

73 (2010a), we consider the following hypothesis class for p ∈ [1, ∞]: p HM := M h : X → R h(x) = ∑ m=1 θm wm , ψm (x) Hm , w H ≤ 1, θ p ≤1 . [sent-729, score-0.296]

74 = E sup w: w H ≤1, θ: θ E sup w: w H ≤1, θ: θ 1 1 q 1−1 ≤M q p θm wm , ψm (xi ) Hm M 1 n σi ∑ ∑ n i=1 m=1 M 1 n σi ∑ ∑ n i=1 m=1 q ≤1 θm wm , ψm (xi ) Hm 1 1 θm M q − p wm , ψm (x) Hm M q − p R(HM ). [sent-765, score-0.874]

75 5 it holds that p HM = M h : X → R h(x) = q ∑ wm , ψm (x) Hm , w 2,q m=1 ≤ 1, q := 2p/(p + 1) , 1/q is the ℓ2,q -block norm. [sent-862, score-0.274]

76 Second, let us generalize the set by introducing an additional parameter C as follows C p HM := M h : X → R h(x) = ∑ wm , ψm (x) Hm , w 2,q m=1 ≤ C, q := 2p/(p + 1) . [sent-864, score-0.274]

77 Then, each feature is input to a linear kernel and the resulting kernel matrices are multiplicatively normalized as described in Section 4. [sent-954, score-0.347]

78 Hence, ν(θ) gives the fraction of noise kernels in the working kernel set. [sent-957, score-0.269]

79 In contrast, the vanilla SVM using an unweighted sum kernel performs best when all kernels are equally informative, however, its performance does not approach the Bayes error rate. [sent-979, score-0.269]

80 04 0 0 44 64 82 92 ν(θ) = fraction of noise kernels [in %] 0 98 44 66 82 92 98 82 92 98 ν(θ) = fraction of noise kernels [in %] (a) (b) 1. [sent-1010, score-0.236]

81 As expected, sparse MKL performs best in sparse scenarios, while non-sparse MKL performs best in moderate or non-sparse scenarios, and for uniform scenarios the unweighted-sum kernel SVM performs best. [sent-1021, score-0.215]

82 There are four sets of 16 kernels each, in which each kernel picks up very similar information: they only differ in number and placing of gaps in all substrings of length 5 of a given part of the protein sequence. [sent-1139, score-0.269]

83 Further, Ong and Zien (2008) studied single kernel SVMs for each kernel individually and found that in most cases the 16 kernels from the same subset perform very similarly. [sent-1143, score-0.42]

84 (2006b), we employ ﬁve different kernels representing the TSS signal (weighted degree with shift), the promoter (spectrum), the 1st exon (spectrum), angles (linear), and energies (linear). [sent-1160, score-0.223]

85 5 1 2 3 4 5 0 kernel id Figure 5: Pairwise alignments of the kernel matrices are shown for the gene start recognition experiment. [sent-1217, score-0.361]

86 However, the energy kernel shows only a slight correlation with the remaining kernels, which is surprisingly little compared to its single kernel performance (AUC=0. [sent-1231, score-0.302]

87 We conclude that this kernel carries complementary and orthogonal information about the learning problem and should thus be included in the resulting kernel mixture. [sent-1233, score-0.302]

88 They provided kernel matrices capturing expression data (EXP), cellular localization (LOC), and the phylogenetic proﬁle (PHY); additionally we use the integration of the former 3 kernels (INT) which matches our deﬁnition of an unweighted-sum kernel. [sent-1243, score-0.293]

89 Increasing the number of kernels by including recombined and product kernels does improve the results obtained by MKL for small values of p, but the maximal AUC values are not statistically signiﬁcantly different from those of ℓ∞ -norm MKL. [sent-1252, score-0.236]

90 We conjecture that the performance of the unweighted-sum kernel SVM can be explained by all three kernels performing well individually. [sent-1253, score-0.269]

91 2 3 1 2 3 0 kernel−id Figure 6: Pairwise alignments of the kernel matrices are shown for the metabolic gene network experiment. [sent-1314, score-0.237]

92 We experiment with MKL using precomputed kernels (excluding the kernel computation time from the timings) and MKL based on on-the-ﬂy computed kernel matrices measuring training time including kernel computations. [sent-1334, score-0.672]

93 As expected, the SVM with the unweighted-sum kernel using precomputed kernel matrices is the fastest method. [sent-1350, score-0.373]

94 In contrast, kernel caching and interleaved optimization still allow to train our algorithm on kernel matrices of size 20000 × 20000, which would usually not completely ﬁt into memory since they require about 149GB. [sent-1356, score-0.5]

95 The bandwidths of the kernels are scaled such that for M kernels 2σ2 ∈ 1. [sent-1366, score-0.236]

96 On the other hand, our interleaved optimizers which allow for effective caching can easily cope with 10,000 kernels of the same size (74GB). [sent-1392, score-0.233]

97 Using efﬁcient kernel caching, they allow for truely large-scale multiple kernel learning well beyond the limits imposed by having to precompute and store the complete kernel matrices. [sent-1396, score-0.478]

98 Finally, we note that performing MKL with 1,000 precomputed kernel matrices of size 1,000 times 1,000 requires less than 3 minutes for the SILP. [sent-1397, score-0.222]

99 By proposing ℓ p -norm multiple kernel learning, conceiving an optimization scheme of unprecedented efﬁciency, and providing a really efﬁcient implementation (http://doc. [sent-1402, score-0.235]

100 However, in general sparse MKL solutions are sensitive to kernel normalization, and in particular in the presence of strongly correlated kernels the selection of kernels may be somewhat arbitrary. [sent-1424, score-0.419]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mkl', 0.639), ('hm', 0.341), ('wm', 0.274), ('kernel', 0.151), ('ien', 0.151), ('loft', 0.151), ('onnenburg', 0.151), ('orm', 0.151), ('refeld', 0.151), ('kernels', 0.118), ('interleaved', 0.115), ('sonnenburg', 0.11), ('km', 0.109), ('ultiple', 0.106), ('kloft', 0.087), ('ernel', 0.08), ('svm', 0.079), ('wrapper', 0.072), ('zien', 0.068), ('norm', 0.067), ('dual', 0.066), ('cortes', 0.064), ('kmy', 0.062), ('promoter', 0.062), ('optimization', 0.059), ('rademacher', 0.056), ('sip', 0.055), ('ong', 0.052), ('norms', 0.052), ('analytical', 0.049), ('simplemkl', 0.048), ('bleakley', 0.048), ('precomputed', 0.047), ('ivanov', 0.047), ('silp', 0.047), ('conversion', 0.046), ('auc', 0.046), ('primal', 0.043), ('tss', 0.041), ('tikhonov', 0.041), ('rakotomamonjy', 0.04), ('regularizers', 0.038), ('earning', 0.038), ('yi', 0.037), ('alignments', 0.035), ('subcellular', 0.035), ('shogun', 0.035), ('fly', 0.034), ('mcc', 0.034), ('inf', 0.034), ('block', 0.032), ('sparse', 0.032), ('regularizer', 0.032), ('bioinformatics', 0.031), ('training', 0.03), ('varma', 0.03), ('transcription', 0.03), ('hessianmkl', 0.029), ('nath', 0.029), ('mixing', 0.029), ('sch', 0.028), ('xm', 0.028), ('duality', 0.028), ('metabolic', 0.027), ('slave', 0.027), ('scenario', 0.027), ('sup', 0.026), ('szafranski', 0.026), ('rm', 0.025), ('multiple', 0.025), ('sparsity', 0.025), ('thereby', 0.024), ('hinge', 0.024), ('cm', 0.024), ('matrices', 0.024), ('chunking', 0.023), ('chapelle', 0.023), ('risk', 0.023), ('vice', 0.023), ('regularization', 0.023), ('svms', 0.023), ('brefeld', 0.022), ('hypothesis', 0.022), ('employ', 0.022), ('lanckriet', 0.022), ('convex', 0.021), ('loss', 0.021), ('lkopf', 0.021), ('normalization', 0.021), ('berlin', 0.021), ('mixtures', 0.021), ('mohri', 0.021), ('multiplicatively', 0.021), ('rifkin', 0.021), ('genomic', 0.021), ('translates', 0.021), ('alternatingly', 0.021), ('bajic', 0.021), ('exon', 0.021), ('franc', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

2 0.33254707 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

3 0.21372634 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efﬁcient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efﬁciency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aﬂalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

4 0.10440829 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the deﬁnition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classiﬁcation, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, uniﬁed similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to ﬁlter similarity measurements, resulting in a simpliﬁed and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

5 0.084254041 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

Abstract: In this article, we propose a family of efﬁcient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be deﬁned based on this Weisfeiler-Lehman sequence of graphs, including a highly efﬁcient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classiﬁcation benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classiﬁcation, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT

6 0.069871671 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

7 0.065041423 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

8 0.061666857 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

9 0.057020463 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

10 0.051774438 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

11 0.051747199 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

12 0.051519055 48 jmlr-2011-Kernel Analysis of Deep Networks

13 0.046985783 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

14 0.04656012 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

15 0.043721411 91 jmlr-2011-The Sample Complexity of Dictionary Learning

16 0.037233915 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

17 0.035054453 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

18 0.035047598 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

19 0.034189425 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

20 0.033954613 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.242), (1, -0.066), (2, 0.258), (3, -0.385), (4, 0.125), (5, 0.067), (6, -0.125), (7, -0.044), (8, 0.006), (9, -0.301), (10, 0.081), (11, 0.03), (12, -0.136), (13, -0.004), (14, 0.082), (15, -0.09), (16, -0.223), (17, 0.09), (18, 0.008), (19, 0.103), (20, 0.146), (21, -0.026), (22, 0.204), (23, -0.066), (24, 0.03), (25, -0.043), (26, 0.009), (27, 0.062), (28, 0.077), (29, 0.002), (30, -0.031), (31, 0.083), (32, 0.048), (33, 0.023), (34, 0.037), (35, -0.012), (36, 0.129), (37, -0.029), (38, 0.025), (39, 0.015), (40, 0.02), (41, -0.01), (42, 0.06), (43, -0.043), (44, 0.036), (45, 0.032), (46, -0.011), (47, 0.006), (48, 0.005), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93530178 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

2 0.92341465 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

3 0.5244891 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

4 0.35161996 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

5 0.34644657 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

6 0.33963639 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

7 0.28218806 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

8 0.26772702 48 jmlr-2011-Kernel Analysis of Deep Networks

9 0.26030284 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

10 0.2366723 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

11 0.22578888 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

12 0.21111223 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

13 0.20977594 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

14 0.19983822 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

15 0.19451949 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

16 0.18369974 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

17 0.17865708 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

18 0.1714929 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

19 0.17011558 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

20 0.16937982 91 jmlr-2011-The Sample Complexity of Dictionary Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.036), (6, 0.014), (9, 0.446), (10, 0.031), (24, 0.041), (31, 0.102), (32, 0.03), (41, 0.018), (60, 0.02), (66, 0.011), (71, 0.012), (73, 0.059), (78, 0.063), (90, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93261516 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difﬁcult to optimize, and in this paper, we propose efﬁcient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efﬁcient and scales gracefully to millions of variables, which we illustrate in two types of applications: ﬁrst, we consider ﬁxed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespeciﬁed arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

same-paper 2 0.88212729 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

3 0.83088189 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach

Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm deﬁned as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost ﬂow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efﬁcient and scalable algorithms exploiting these two strategies, which are signiﬁcantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network ﬂow optimization, alternating direction method of multipliers

4 0.67631775 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach

Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are deﬁned as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a speciﬁc set of allowed nonzero patterns for the solutions of such problems. We ﬁrst explore the relationship between the groups deﬁning the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to speciﬁc prior knowledge expressed in terms of nonzero patterns. We also present an efﬁcient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm

5 0.60925484 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

6 0.55344951 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

7 0.52729779 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

8 0.52321744 91 jmlr-2011-The Sample Complexity of Dictionary Learning

9 0.51924354 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

10 0.51884276 59 jmlr-2011-Learning with Structured Sparsity

11 0.51760375 48 jmlr-2011-Kernel Analysis of Deep Networks

12 0.49835926 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.49797195 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

14 0.48339206 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

15 0.46406981 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

16 0.46073532 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

17 0.46013689 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

18 0.46011442 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

19 0.45946217 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

20 0.45687738 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem