nips nips2011 nips2011-286 knowledge-graph by maker-knowledge-mining

286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

Source: pdf

Author: Marius Kloft, Gilles Blanchard

Abstract: We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. [sent-5, score-0.569]

2 Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. [sent-6, score-0.111]

3 We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. [sent-7, score-0.261]

4 Nevertheless, after more than a decade of research it still remains an unsolved problem to ﬁnd the best abstraction or kernel for a problem at hand. [sent-9, score-0.131]

5 Most frequently, the kernel is selected from a candidate set according to its generalization performance on a validation set. [sent-10, score-0.131]

6 Clearly, the performance of such an algorithm is limited by the best kernel in the set. [sent-11, score-0.131]

7 Unfortunately, in the current state of research, there is little hope that in the near future a machine will be able to automatically ﬁnd—or even engineer—the best kernel for a particular problem at hand [25]. [sent-12, score-0.131]

8 However, by restricting to a less general problem, can we hope to achieve the automatic kernel selection? [sent-13, score-0.131]

9 [18] it was shown that learning a support vector machine (SVM) [9] and a convex kernel combination at the same time is computationally feasible. [sent-15, score-0.131]

10 , [10]) revealed further negative evidence and peaked in the provocative question “Can learning kernels help performance? [sent-22, score-0.096]

11 ” posed by Corinna Cortes in an invited talk at ICML 2009 [5]. [sent-23, score-0.058]

12 A ﬁrst step towards a model of kernel learning ∗ Marius Kloft is also with Friedrich Miescher Laboratory, Max Planck Society, T¨ bingen. [sent-25, score-0.131]

13 33−norm MKL SVM C H P Z S V L1 L4 L14 L30 SW1SW2 MKL experiment in terms of accuracy (L EFT) and kernel weights output by MKL (R IGHT). [sent-47, score-0.131]

14 that is useful for practical applications was made in [7, 13, 14]: by imposing an q -norm penalty (q > 1) rather than an 1 -norm one on the kernel combination coefﬁcients. [sent-48, score-0.151]

15 This q -norm MKL is an empirical minimization algorithm that operates on the multi-kernel class consisting of functions f : x → w, φk (x) with w k ≤ D, where φk is the kernel mapping into the reproducing kernel Hilbert space (RKHS) Hk with kernel k and norm . [sent-49, score-0.445]

16 k , while the kernel k itself ranges over the M set of possible kernels k = m=1 θm km θ q ≤ 1, θ ≥ 0 . [sent-50, score-0.274]

17 A conceptual milestone going back to the work of [1] and [20] is that this multi-kernel class can equivalently be represented as a block-norm regularized linear class in the product RKHS: Hp,D,M = fw : x → w, φ(x) w = (w(1) , . [sent-51, score-0.541]

18 In Figure 1, we show exemplary results of an p -norm MKL experiment, achieved on the protein fold prediction dataset used in [4] (see supplementary material A for experimental details). [sent-55, score-0.056]

19 We ﬁrst observe that, as expected, p -norm MKL enforces strong sparsity in the coefﬁcients θm when p = 1 and no sparsity at all otherwise (but various degrees of soft sparsity for intermediate p). [sent-56, score-0.188]

20 Crucially, the performance (as measured by the test error) is not monotonic as a function of p; p = 1 (sparse MKL) yields the same performance as the regular SVM using a uniform kernel combination, but optimal performance is attained for some intermediate value of p—namely, p = 1. [sent-57, score-0.223]

21 Clearly, the complexity of (1) will be greater than one that is based on a single kernel only. [sent-60, score-0.174]

22 To this end, the main aim of this paper is to analyze the sample complexity of the hypothesis class (1). [sent-62, score-0.064]

23 In the present work, we base our main analysis on the theory of local Rademacher complexities, which allows to derive improved and more precise rates of convergence that cover the whole range of p ∈ [1, ∞]. [sent-64, score-0.057]

24 • A lower bound is shown that beside absolute constants matches the upper bounds, showing that our results are tight. [sent-67, score-0.112]

25 • The generalization performance of p -norm MKL as guaranteed by the excess risk bound is studied for varying values of p, shedding light on the appropriateness of a small/large p in various learning scenarios. [sent-68, score-0.218]

26 Furthermore, we also present a simpler, more general proof of the global Rademacher bound shown in [8] (at the expense of a slightly worse constant). [sent-69, score-0.156]

27 A comparison of the rates obtained with local and global Rademacher analysis is carried out in Section 3. [sent-70, score-0.088]

28 We denote the (normalized) kernel matrices corresponding to k and km by K and Km , respectively, i. [sent-73, score-0.197]

29 We work with operators in Hilbert spaces and will use instead of the usual vector/matrix notation φ(x)φ(x) the tensor notation φ(x) ⊗ φ(x) ∈ HS(H), which is a Hilbert-Schmidt operator H → H deﬁned as (φ(x) ⊗ φ(x))u = φ(x), u φ(x). [sent-90, score-0.048]

30 We denote by J = Eφ(x) ⊗ φ(x) and Jm = Eφm (x) ⊗ φm (x) the uncentered covariance operators corresponding to variables φ(x) and φm (x), respectively; it holds that 2 2 tr(J) = E φ(x) 2 and tr(Jm ) = E φm (x) 2 . [sent-92, score-0.095]

31 Global Rademacher Complexities We ﬁrst review global Rademacher complexities (GRC) in multiple kernel learning. [sent-93, score-0.203]

32 The global Rademacher n 1 complexity is deﬁned as R(Hp ) = E supfw ∈Hp w, n i=1 σi φ(xi ) , where (σi )1≤i≤n is an i. [sent-101, score-0.124]

33 This bound is tight and improves a series of loose results that were given for p = 1 in the past (see [8] and references therein). [sent-110, score-0.085]

34 In fact, the above result can be extended to the whole range of p ∈ [1, ∞] (in the supplementary material we present a quite simple proof using c = 1): Proposition 1 (G LOBAL R ADEMACHER COMPLEXITY BOUND ). [sent-111, score-0.075]

35 For any p ≥ 1 the empirical version of global Rademacher complexity of the multi-kernel class Hp can be bounded as R(Hp ) ≤ min D t∈[p,∞] t∗ n 1 tr(Km ) n M m=1 t∗ 2 . [sent-112, score-0.131]

36 Interestingly, the above GRC bound is not monotonic in p and thus the minimum is not always attained for t := p. [sent-113, score-0.13]

37 We deﬁne the local Rademacher comn 1 2 plexity (LRC) of Hp as Rr (Hp ) = E supfw ∈Hp :P fw ≤r w, n i=1 σi φ(xi ) , where P fw := 2 E(fw (φ(x)))2 . [sent-121, score-1.082]

38 We will also use the following assumption in the bounds for the case p ∈ [1, 2]: Assumption (U) (no-correlation). [sent-123, score-0.042]

39 For example, if X = RM , the above means that the input variable x ∈ X has independent coordinates, and the kernels k1 , . [sent-129, score-0.077]

40 To state the bounds, note that covariance operators enjoy ∞ discrete eigenvalue-eigenvector decompositions J = Eφ(x) ⊗ φ(x) = j=1 λj uj ⊗ uj and ∞ (m) (m) (m) (m) Jm = Ex(m) ⊗ x(m) = j=1 λj uj ⊗ uj , where (uj )j≥1 and (uj )j≥1 form orthonormal bases of H and Hm , respectively. [sent-134, score-0.761]

41 Assume that the kernels are uniformly bounded ( k ∞ ≤ B < ∞) and that Assumption (U) holds. [sent-136, score-0.077]

42 The local Rademacher complexity of the multi-kernel class Hp can be bounded for any p ∈ [1, 2] as √ 1 ∞ M 16 BeDM t∗ t∗ 1− t2 2 t∗ 2 λ(m) ∗ min rM , ceD + . [sent-137, score-0.134]

43 Rr (Hp ) ≤ min j ∗ n n t∈[p,2] m=1 t j=1 2 3 Theorem 3 (L OCAL R ADEMACHER COMPLEXITY BOUND , p ∈ [2, ∞] ). [sent-138, score-0.036]

44 For any p ∈ [2, ∞], 2 n Rr (Hp ) ≤ min t∈[p,∞] ∞ 2 min(r, D2 M t∗ −1 λj ). [sent-139, score-0.036]

45 j=1 It is interesting to compare the above bounds for the special case p = 2 with the ones of Bartlett et al. [sent-140, score-0.042]

46 The main term of the bound of Theorem 3 (taking t = p = 2) is then essentially determined by (m) 1/2 M ∞ 1 . [sent-142, score-0.085]

47 It is also interesting to study the case p = 1: by using t = (log(M ))∗ in Theorem 2, we obtain the ∞ M (m) √ 1/2 3 2 3 2 2 bound Rr (H1 ) ≤ 16 + Be D log(M ) , for j=1 min rM, e D (log M ) λj n n m=1 ∞ 2 all M ≥ e . [sent-144, score-0.121]

48 the proof of Theorem 3 is straightforward and shown in the supplementary material C. [sent-146, score-0.075]

49 In order to exploit the no-correlation assumption, we will work in large parts of the proof with the centered class Hp = fw w 2,p ≤ D , wherein fw : x → w, φ(x) , and φ(x) := φ(x) − Eφ(x). [sent-151, score-1.11]

50 (2) 1 p∗ p∗ 2 m=1 M Jensen Eφm (x), Eφm (x) = m=1 2 p∗ 2 E φm (x), φm (x) 1 p∗ M = tr(Jm ) m=1 m=1 p∗ 2 (3) so that we can express the complexity of the centered class in terms of the uncentered one as follows: n n 1 1 Rr (Hp ) ≤ E sup w, σi φ(xi ) + E sup w, σi Eφ(x) . [sent-153, score-0.269]

51 n i=1 n i=1 fw ∈Hp , fw ∈Hp , 2 P fw ≤r 2 P fw ≤r 2 2 Concerning the ﬁrst term of the above upper bound, using (2) we have P fw ≤ P fw , and thus E sup w, fw ∈Hp , 2 P fw ≤r 1 n n σi φ(xi ) ≤ E sup w, fw ∈Hp , 2 P fw ≤r i=1 1 n n σi φ(xi ) = Rr (Hp ). [sent-154, score-5.129]

52 i=1 Now to bound the second term, we write n √ 1 E sup w, σi Eφ(x) ≤ n sup w, Eφ(x) . [sent-155, score-0.197]

53 n i=1 fw ∈Hp , fw ∈Hp , 2 P fw ≤r 2 P fw ≤r Now observe that we have w, Eφ(x) H¨ lder o ≤ (3) w 2,p Eφ(x) 2,p∗ ≤ w 2,p tr(Jm ) M m=1 p∗ 2 2 P fw . [sent-156, score-2.528]

54 (4) as well as w, Eφ(x) = Efw (x) ≤ 2 This shows that there is no loss in working with the centered class instead of the uncentered one. [sent-158, score-0.114]

55 First we note that, since the (centered) covariance operator Eφm (x) ⊗ φm (x) is also a self-adjoint Hilbert-Schmidt operator on Hm , there (m) (m) (m) (m) ∞ exists an eigendecomposition Eφm (x) ⊗ φm (x) = j=1 λj uj ⊗ uj , wherein (uj )j≥1 is an orthogonal basis of Hm . [sent-161, score-0.423]

56 As a consequence, for all j and m, M 2 P fw 1 n E wm , φm (x) m=1 n 2 (m) σi φm (xi ), uj i=1 (m) = 2 (m) λj w m , uj m=1 j=1 n 1 (m) 1 u , n j n = ∞ M 2 E(fw (x))2 = E = (5) (m) (m) E φm (xi ) ⊗ φm (xi ) uj λj . [sent-163, score-1.048]

57 We can express the LRC in terms of the eigendecompositon as follows Rr (Hp ) = E sup 1 n w, 2 fw ∈Hp :P fw ≤r M C. [sent-168, score-1.054]

58 2 in the supplementary material) to further bound the right ≤ p∗ M m=1 n σi φm (xi ), uj p∗ 2 (m) 2 n i=1 1 j>hm n E (m) n i=1 1 j>hm n term in the above expression as E φm (xi ), uj 1 p∗ (m) uj M m=1 2,p∗ . [sent-175, score-0.662]

59 Note that for p ≥ 2 it holds that ∗ p /2 ≤ 1, and thus it sufﬁces to employ Jensen’s inequality once again to move the expectation ∗ operator inside the inner term. [sent-176, score-0.066]

60 Thus 5 by the subadditivity of the root function hm n M m=1 r ≤ M m=1 r Rr (Hp ) ≤ hm n BM p∗ + n M (m) λj m=1 j=hm +1 ∞ ep∗ 2 D2 n + ∞ 2 ep∗ 2 n +D √ M (m) λj + p∗ 2 m=1 j=hm +1 p∗ 2 1 BeDM p∗ p∗ . [sent-180, score-0.779]

61 Thus the following preliminary bound ∞ 4ep∗ 2 D2 n are 1/2 M m=1 hm is nonzero) so that in any case we get n− 2 M m=1 hm all √ M (m) λj + m=1 j=hm +1 p∗ 2 1 BeDM p∗ p∗ , n (8) for all nonnegative integers hm ≥ 0. [sent-183, score-1.28]

62 (11) Since the above holds for all nonnegative integers hm , the result follows, completing the proof. [sent-187, score-0.471]

63 1 Lower and Excess Risk Bounds To investigate the tightness of the presented upper bounds on the LRC of Hp , we consider the case where φ1 (x), . [sent-189, score-0.069]

64 This shows that the upper bounds of the previous section are tight. [sent-208, score-0.069]

65 As an application of our results to prediction problems such as classiﬁcation or regression, we also n 1 ˆ bound the excess loss of empirical minimization, f := argminf n i=1 l(f (xi ), yi ), w. [sent-209, score-0.21]

66 to a loss ˆ function l: P (lf − lf ∗ ) := E l(f (x), y) − E l(f ∗ (x), y), where f ∗ := argminf E l(f (x), y) . [sent-212, score-0.171]

67 [2] to show the following excess risk bound under the assump(m) tion of algebraically decreasing eigenvalues of the kernel matrices, i. [sent-214, score-0.349]

68 ∃d > 0, α > 1, ∀m : λj ≤ −α dj (proof shown in the supplementary material E): (m) Theorem 5. [sent-216, score-0.056]

69 Let l be a Lipschitz continuous loss with constant L and assume there is a positive constant F such that ∀f ∈ F : P (f − f ∗ )2 ≤ F P (lf − lf ∗ ). [sent-218, score-0.138]

70 Then for all z > 0 with probability at least 1 − e−z the excess loss of the multi-kernel class Hp can be bounded for p ∈ [1, 2] as P (lf − lf ∗ ) ≤ min ˆ 186 t∈[p,2] 3−α dD2 L2 t∗ 2 1−α 1 1+α α−1 2 F α+1 M 1+ 1+α 1 t∗ −1 α n− 1+α √ 1 1 47 BDLM t∗ t∗ (22BDLM t∗ + 27F )z + + . [sent-219, score-0.287]

71 n n 1 1 We see from the above bound that convergence can be almost as slow as O p∗ M p∗ n− 2 (if α ≈ 1 is small ) and almost as fast as O n−1 (if α is large). [sent-220, score-0.085]

72 3 Interpretation of Bounds In this section, we discuss the rates of Theorem 5 obtained by local analysis bounds, that is ∀t ∈ [p, 2] : P (lf − lf ∗ ) = O ˆ t∗ D 2 1+α 2 M 1+ 1+α 1 t∗ −1 α n− 1+α . [sent-221, score-0.195]

73 (13) On the other hand, the global Rademacher complexity directly leads to a bound of the form [8] 1 1 (14) ∀t ∈ [p, 2] : P (lf − lf ∗ ) = O t∗ DM t∗ n− 2 . [sent-222, score-0.297]

74 Clearly, the rate obtained through local analysis is better in n since α > 1. [sent-224, score-0.054]

75 Regarding the rate in the number of kernels M and the radius D, a straightforward calculation shows that the local analysis improves √ 1 over the global one whenever M p /D = O( n) . [sent-225, score-0.162]

76 Second, if p ≤ (log M )∗ , the best choice in (13) and (14) is t = (log M )∗ so that 1 1 (15) P (lf − lf ∗ ) ≤ O min M n−1 , min t∗ DM t∗ n− 2 ˆ t∈[p,2] √ M and the phase transition occurs for D log M = O( n). [sent-229, score-0.228]

77 This situation is to be compared to the sharp analysis of the optimal convergence rate of convex aggregation of M functions obtained by [27] in the framework of squared error loss regression, which is shown to be 1/2 1 M O min M , n log √n . [sent-231, score-0.097]

78 This corresponds to the setting studied here with D = 1, p = 1 n and α → ∞, and we see that our bound recovers (up to log factors) in this case this sharp bound and the related phase transition phenomenon. [sent-232, score-0.21]

79 (5), Assumption (U)—a similar assumption was also used in [23]—can be relaxed to a more general, RIP-like assumption as used in [16]; this comes at the expense of an additional factor in the bounds (details omitted here). [sent-234, score-0.063]

80 As a practical application of the presented bounds, we analyze the impact of the norm-parameter p on the accuracy of p -norm MKL in various learning scenarios, showing why an intermediate p often turns out to be optimal in practical applications. [sent-236, score-0.087]

81 As indicated in the introduction, there is empirical evidence that the performance of p -norm MKL crucially depends on the choice of the norm parameter p (for example, cf. [sent-237, score-0.05]

82 Figure 1 7 60 50 40 bound 30 20 bound 40 45 50 55 60 65 110 90 bound 60 70 80 w* w* w* 1. [sent-238, score-0.255]

83 5 Figure 2: Illustration of the three analyzed learning scenarios (T OP) differing in their soft sparsity of the Bayes hypothesis w∗ (parametrized by β) and corresponding values of the bound factor νt as a function of p (B OTTOM). [sent-257, score-0.162]

84 A soft sparse (L EFT), a intermediate non-sparse (C ENTER), and an almost uniform w∗ (R IGHT). [sent-258, score-0.092]

85 To start with, ﬁrst note that the choice of p only affects the excess risk bound in the factor (cf. [sent-261, score-0.218]

86 Plugging in this value for Dp , the bound factor νp becomes νp := w∗ 2 1+α p 2 2 min t∗ 1+α M 1+ 1+α 1 t∗ −1 . [sent-270, score-0.121]

87 Note that the soft sparsity of w∗ is increased from the left hand to the right hand side. [sent-275, score-0.077]

88 2, while for the intermediate case (C ENTER) p = 1. [sent-277, score-0.047]

89 This means that if the true Bayes hypothesis has an intermediately dense representation (which is frequently encountered in practical applications), our bound gives the strongest generalization guarantees to p -norm MKL using an intermediate choice of p. [sent-281, score-0.173]

90 4 Conclusion We derived a sharp upper bound on the local Rademacher complexity of p -norm multiple kernel learning. [sent-282, score-0.342]

91 We also proved a lower bound that matches the upper one and shows that our result is tight. [sent-283, score-0.112]

92 Using the local Rademacher complexity bound, we derived an excess risk bound that attains α the fast rate of O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. [sent-284, score-0.383]

93 In a practical case study, we found that the optimal value of that bound depends on the true Bayesoptimal kernel weights. [sent-285, score-0.236]

94 If the true weights exhibit soft sparsity but are not strongly sparse, then the generalization bound is minimized for an intermediate p. [sent-286, score-0.209]

95 This is not only intuitive but also supports empirical studies showing that sparse MKL (p = 1) rarely works in practice, while some intermediate choice of p can improve performance. [sent-287, score-0.047]

96 Multiple kernel learning, conic duality, and the SMO algorithm. [sent-303, score-0.131]

97 Let the kernel ﬁgure it out: Principled learning of pre-processing for kernel classiﬁers. [sent-373, score-0.262]

98 The best constant in the rosenthal inequality for nonnegative random variables. [sent-378, score-0.104]

99 Local Rademacher complexities and oracle inequalities in risk minimization. [sent-412, score-0.082]

100 Minimax-optimal rates for sparse additive models over kernel classes via convex programming. [sent-472, score-0.154]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fw', 0.499), ('hp', 0.456), ('hm', 0.375), ('mkl', 0.298), ('uj', 0.183), ('rademacher', 0.169), ('rr', 0.141), ('lf', 0.138), ('kernel', 0.131), ('jm', 0.106), ('excess', 0.092), ('bound', 0.085), ('efw', 0.083), ('kloft', 0.081), ('kernels', 0.077), ('tep', 0.073), ('bedm', 0.067), ('lrc', 0.067), ('km', 0.066), ('cortes', 0.057), ('sup', 0.056), ('tr', 0.054), ('centered', 0.053), ('ep', 0.053), ('ademacher', 0.05), ('eft', 0.05), ('marius', 0.05), ('supfw', 0.05), ('intermediate', 0.047), ('soft', 0.045), ('ight', 0.044), ('rosenthal', 0.044), ('complexity', 0.043), ('bounds', 0.042), ('complexities', 0.041), ('risk', 0.041), ('rm', 0.04), ('uncentered', 0.04), ('lanckriet', 0.04), ('nonnegative', 0.039), ('xi', 0.039), ('sonnenburg', 0.038), ('jensen', 0.037), ('min', 0.036), ('local', 0.034), ('bartlett', 0.034), ('argminf', 0.033), ('elating', 0.033), ('grc', 0.033), ('invited', 0.033), ('ocal', 0.033), ('ounding', 0.033), ('ller', 0.033), ('lder', 0.033), ('sch', 0.032), ('sparsity', 0.032), ('norm', 0.031), ('hs', 0.031), ('integers', 0.031), ('global', 0.031), ('mohri', 0.03), ('subadditivity', 0.029), ('aq', 0.029), ('spec', 0.029), ('operators', 0.029), ('hilbert', 0.028), ('material', 0.028), ('theorem', 0.028), ('supplementary', 0.028), ('svm', 0.027), ('upper', 0.027), ('blanchard', 0.027), ('brefeld', 0.027), ('holds', 0.026), ('talk', 0.025), ('spectra', 0.025), ('gilles', 0.025), ('eigenvalue', 0.025), ('attained', 0.024), ('decay', 0.023), ('tsch', 0.023), ('rates', 0.023), ('dp', 0.023), ('sharp', 0.022), ('monotonic', 0.021), ('bm', 0.021), ('class', 0.021), ('frequently', 0.021), ('expense', 0.021), ('inequality', 0.021), ('rate', 0.02), ('practical', 0.02), ('wherein', 0.019), ('proof', 0.019), ('operator', 0.019), ('aggregation', 0.019), ('rkhs', 0.019), ('evidence', 0.019), ('phase', 0.018), ('enter', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

2 0.32804143 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

Author: Taiji Suzuki

Abstract: In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓp -MKL that imposes ℓp -mixed-norm regularization instead of ℓ1 -mixed-norm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. As a by-product of our general result, we show a fast learning rate of ℓp -MKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization. 1

3 0.2284397 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

Author: Vikas Sindhwani, Aurelie C. Lozano

Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1

4 0.11776754 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

Author: Iasonas Kokkinos

Abstract: In this work we use Branch-and-Bound (BB) to efﬁciently detect objects with deformable part models. Instead of evaluating the classiﬁer score exhaustively over image locations and scales, we use BB to focus on promising image locations. The core problem is to compute bounds that accommodate part deformations; for this we adapt the Dual Trees data structure [7] to our problem. We evaluate our approach using Mixture-of-Deformable Part Models [4]. We obtain exactly the same results but are 10-20 times faster on average. We also develop a multiple-object detection variation of the system, where hypotheses for 20 categories are inserted in a common priority queue. For the problem of ﬁnding the strongest category in an image this results in a 100-fold speedup.

5 0.10455104 171 nips-2011-Metric Learning with Multiple Kernels

Author: Jun Wang, Huyen T. Do, Adam Woznica, Alexandros Kalousis

Abstract: Metric learning has become a very active research ﬁeld. The most popular representative–Mahalanobis metric learning–can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes ﬁnding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predeﬁned kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we ﬁll this gap and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. 1

6 0.084526099 198 nips-2011-On U-processes and clustering performance

7 0.063565947 288 nips-2011-Thinning Measurement Models and Questionnaire Design

8 0.063459143 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

9 0.062291425 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

10 0.061237551 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding

11 0.056812644 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

12 0.055410855 26 nips-2011-Additive Gaussian Processes

13 0.054051738 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

14 0.053585127 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

15 0.05329347 139 nips-2011-Kernel Bayes' Rule

16 0.048453379 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

17 0.048153669 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

18 0.048132118 162 nips-2011-Lower Bounds for Passive and Active Learning

19 0.044653263 145 nips-2011-Learning Eigenvectors for Free

20 0.044302151 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.136), (1, -0.034), (2, -0.05), (3, -0.109), (4, -0.045), (5, 0.067), (6, 0.047), (7, -0.081), (8, 0.02), (9, 0.051), (10, -0.162), (11, 0.16), (12, 0.125), (13, 0.158), (14, -0.105), (15, 0.015), (16, 0.097), (17, 0.018), (18, -0.017), (19, 0.229), (20, -0.03), (21, -0.205), (22, -0.119), (23, -0.007), (24, 0.05), (25, -0.03), (26, -0.047), (27, -0.01), (28, 0.183), (29, 0.027), (30, -0.164), (31, -0.175), (32, -0.059), (33, 0.072), (34, -0.067), (35, 0.076), (36, -0.113), (37, 0.038), (38, 0.007), (39, 0.044), (40, -0.047), (41, 0.079), (42, 0.006), (43, -0.004), (44, 0.079), (45, -0.02), (46, 0.019), (47, 0.028), (48, 0.097), (49, -0.131)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94531029 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

2 0.88941371 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

Author: Taiji Suzuki

3 0.69598824 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

Author: Vikas Sindhwani, Aurelie C. Lozano

4 0.42857859 171 nips-2011-Metric Learning with Multiple Kernels

Author: Jun Wang, Huyen T. Do, Adam Woznica, Alexandros Kalousis

5 0.42529973 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur

Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classiﬁer in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classiﬁer. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efﬁcient learning algorithms in Banach spaces.

6 0.39549583 139 nips-2011-Kernel Bayes' Rule

7 0.36935413 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

8 0.33564201 288 nips-2011-Thinning Measurement Models and Questionnaire Design

9 0.31352693 198 nips-2011-On U-processes and clustering performance

10 0.31205451 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

11 0.27855071 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

12 0.2766338 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

13 0.27504674 26 nips-2011-Additive Gaussian Processes

14 0.25261921 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

15 0.25252241 162 nips-2011-Lower Bounds for Passive and Active Learning

16 0.24588923 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

17 0.23909184 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

18 0.23824403 282 nips-2011-The Fast Convergence of Boosting

19 0.23061235 61 nips-2011-Contextual Gaussian Process Bandit Optimization

20 0.22613801 202 nips-2011-On the Universality of Online Mirror Descent

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.034), (4, 0.047), (17, 0.287), (20, 0.039), (26, 0.042), (31, 0.063), (33, 0.015), (43, 0.086), (45, 0.113), (48, 0.024), (57, 0.024), (65, 0.062), (74, 0.036), (83, 0.025), (99, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.79384774 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

Author: Gautam Kunapuli, Richard Maclin, Jude W. Shavlik

Abstract: Knowledge-based support vector machines (KBSVMs) incorporate advice from domain experts, which can improve generalization signiﬁcantly. A major limitation that has not been fully addressed occurs when the expert advice is imperfect, which can lead to poorer models. We propose a model that extends KBSVMs and is able to not only learn from data and advice, but also simultaneously improves the advice. The proposed approach is particularly effective for knowledge discovery in domains with few labeled examples. The proposed model contains bilinear constraints, and is solved using two iterative approaches: successive linear programming and a constrained concave-convex approach. Experimental results demonstrate that these algorithms yield useful reﬁnements to expert advice, as well as improve the performance of the learning algorithm overall.

same-paper 2 0.75348407 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

3 0.67389423 276 nips-2011-Structured sparse coding via lateral inhibition

Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun

Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modiﬁcations of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efﬁcient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-speciﬁed. 1

4 0.5405398 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

Author: Vikas Sindhwani, Aurelie C. Lozano

5 0.53127521 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Author: Sham M. Kakade, Varun Kanade, Ohad Shamir, Adam Kalai

Abstract: Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail non-convex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) provided the ﬁrst provably efﬁcient method, the Isotron algorithm, for learning SIMs and GLMs, under the assumption that the data is in fact generated under a GLM and under certain monotonicity and Lipschitz (bounded slope) constraints. The Isotron algorithm interleaves steps of perceptron-like updates with isotonic regression (ﬁtting a one-dimensional non-decreasing function). However, to obtain provable performance, the method requires a fresh sample every iteration. In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efﬁcient. We modify the isotonic regression step in Isotron to ﬁt a Lipschitz monotonic function, and also provide an efﬁcient O(n log(n)) algorithm for this step, improving upon the previous O(n2 ) algorithm. We provide a brief empirical study, demonstrating the feasibility of our algorithms in practice. 1

6 0.53018838 244 nips-2011-Selecting Receptive Fields in Deep Networks

7 0.52363783 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.5205344 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

9 0.5187574 22 nips-2011-Active Ranking using Pairwise Comparisons

10 0.51829642 265 nips-2011-Sparse recovery by thresholded non-negative least squares

11 0.5178104 198 nips-2011-On U-processes and clustering performance

12 0.5173642 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

13 0.51714158 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

14 0.51697624 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

15 0.51599509 80 nips-2011-Efficient Online Learning via Randomized Rounding

16 0.51567721 231 nips-2011-Randomized Algorithms for Comparison-based Search

17 0.51513082 258 nips-2011-Sparse Bayesian Multi-Task Learning

18 0.51434124 199 nips-2011-On fast approximate submodular minimization

19 0.51339281 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

20 0.51337802 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning