nips nips2011 nips2011-286 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. [sent-5, score-0.569]
2 Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. [sent-6, score-0.111]
3 We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. [sent-7, score-0.261]
4 Nevertheless, after more than a decade of research it still remains an unsolved problem to find the best abstraction or kernel for a problem at hand. [sent-9, score-0.131]
5 Most frequently, the kernel is selected from a candidate set according to its generalization performance on a validation set. [sent-10, score-0.131]
6 Clearly, the performance of such an algorithm is limited by the best kernel in the set. [sent-11, score-0.131]
7 Unfortunately, in the current state of research, there is little hope that in the near future a machine will be able to automatically find—or even engineer—the best kernel for a particular problem at hand [25]. [sent-12, score-0.131]
8 However, by restricting to a less general problem, can we hope to achieve the automatic kernel selection? [sent-13, score-0.131]
9 [18] it was shown that learning a support vector machine (SVM) [9] and a convex kernel combination at the same time is computationally feasible. [sent-15, score-0.131]
10 , [10]) revealed further negative evidence and peaked in the provocative question “Can learning kernels help performance? [sent-22, score-0.096]
11 ” posed by Corinna Cortes in an invited talk at ICML 2009 [5]. [sent-23, score-0.058]
12 A first step towards a model of kernel learning ∗ Marius Kloft is also with Friedrich Miescher Laboratory, Max Planck Society, T¨ bingen. [sent-25, score-0.131]
13 33−norm MKL SVM C H P Z S V L1 L4 L14 L30 SW1SW2 MKL experiment in terms of accuracy (L EFT) and kernel weights output by MKL (R IGHT). [sent-47, score-0.131]
14 that is useful for practical applications was made in [7, 13, 14]: by imposing an q -norm penalty (q > 1) rather than an 1 -norm one on the kernel combination coefficients. [sent-48, score-0.151]
15 This q -norm MKL is an empirical minimization algorithm that operates on the multi-kernel class consisting of functions f : x → w, φk (x) with w k ≤ D, where φk is the kernel mapping into the reproducing kernel Hilbert space (RKHS) Hk with kernel k and norm . [sent-49, score-0.445]
16 k , while the kernel k itself ranges over the M set of possible kernels k = m=1 θm km θ q ≤ 1, θ ≥ 0 . [sent-50, score-0.274]
17 A conceptual milestone going back to the work of [1] and [20] is that this multi-kernel class can equivalently be represented as a block-norm regularized linear class in the product RKHS: Hp,D,M = fw : x → w, φ(x) w = (w(1) , . [sent-51, score-0.541]
18 In Figure 1, we show exemplary results of an p -norm MKL experiment, achieved on the protein fold prediction dataset used in [4] (see supplementary material A for experimental details). [sent-55, score-0.056]
19 We first observe that, as expected, p -norm MKL enforces strong sparsity in the coefficients θm when p = 1 and no sparsity at all otherwise (but various degrees of soft sparsity for intermediate p). [sent-56, score-0.188]
20 Crucially, the performance (as measured by the test error) is not monotonic as a function of p; p = 1 (sparse MKL) yields the same performance as the regular SVM using a uniform kernel combination, but optimal performance is attained for some intermediate value of p—namely, p = 1. [sent-57, score-0.223]
21 Clearly, the complexity of (1) will be greater than one that is based on a single kernel only. [sent-60, score-0.174]
22 To this end, the main aim of this paper is to analyze the sample complexity of the hypothesis class (1). [sent-62, score-0.064]
23 In the present work, we base our main analysis on the theory of local Rademacher complexities, which allows to derive improved and more precise rates of convergence that cover the whole range of p ∈ [1, ∞]. [sent-64, score-0.057]
24 • A lower bound is shown that beside absolute constants matches the upper bounds, showing that our results are tight. [sent-67, score-0.112]
25 • The generalization performance of p -norm MKL as guaranteed by the excess risk bound is studied for varying values of p, shedding light on the appropriateness of a small/large p in various learning scenarios. [sent-68, score-0.218]
26 Furthermore, we also present a simpler, more general proof of the global Rademacher bound shown in [8] (at the expense of a slightly worse constant). [sent-69, score-0.156]
27 A comparison of the rates obtained with local and global Rademacher analysis is carried out in Section 3. [sent-70, score-0.088]
28 We denote the (normalized) kernel matrices corresponding to k and km by K and Km , respectively, i. [sent-73, score-0.197]
29 We work with operators in Hilbert spaces and will use instead of the usual vector/matrix notation φ(x)φ(x) the tensor notation φ(x) ⊗ φ(x) ∈ HS(H), which is a Hilbert-Schmidt operator H → H defined as (φ(x) ⊗ φ(x))u = φ(x), u φ(x). [sent-90, score-0.048]
30 We denote by J = Eφ(x) ⊗ φ(x) and Jm = Eφm (x) ⊗ φm (x) the uncentered covariance operators corresponding to variables φ(x) and φm (x), respectively; it holds that 2 2 tr(J) = E φ(x) 2 and tr(Jm ) = E φm (x) 2 . [sent-92, score-0.095]
31 Global Rademacher Complexities We first review global Rademacher complexities (GRC) in multiple kernel learning. [sent-93, score-0.203]
32 The global Rademacher n 1 complexity is defined as R(Hp ) = E supfw ∈Hp w, n i=1 σi φ(xi ) , where (σi )1≤i≤n is an i. [sent-101, score-0.124]
33 This bound is tight and improves a series of loose results that were given for p = 1 in the past (see [8] and references therein). [sent-110, score-0.085]
34 In fact, the above result can be extended to the whole range of p ∈ [1, ∞] (in the supplementary material we present a quite simple proof using c = 1): Proposition 1 (G LOBAL R ADEMACHER COMPLEXITY BOUND ). [sent-111, score-0.075]
35 For any p ≥ 1 the empirical version of global Rademacher complexity of the multi-kernel class Hp can be bounded as R(Hp ) ≤ min D t∈[p,∞] t∗ n 1 tr(Km ) n M m=1 t∗ 2 . [sent-112, score-0.131]
36 Interestingly, the above GRC bound is not monotonic in p and thus the minimum is not always attained for t := p. [sent-113, score-0.13]
37 We define the local Rademacher comn 1 2 plexity (LRC) of Hp as Rr (Hp ) = E supfw ∈Hp :P fw ≤r w, n i=1 σi φ(xi ) , where P fw := 2 E(fw (φ(x)))2 . [sent-121, score-1.082]
38 We will also use the following assumption in the bounds for the case p ∈ [1, 2]: Assumption (U) (no-correlation). [sent-123, score-0.042]
39 For example, if X = RM , the above means that the input variable x ∈ X has independent coordinates, and the kernels k1 , . [sent-129, score-0.077]
40 To state the bounds, note that covariance operators enjoy ∞ discrete eigenvalue-eigenvector decompositions J = Eφ(x) ⊗ φ(x) = j=1 λj uj ⊗ uj and ∞ (m) (m) (m) (m) Jm = Ex(m) ⊗ x(m) = j=1 λj uj ⊗ uj , where (uj )j≥1 and (uj )j≥1 form orthonormal bases of H and Hm , respectively. [sent-134, score-0.761]
41 Assume that the kernels are uniformly bounded ( k ∞ ≤ B < ∞) and that Assumption (U) holds. [sent-136, score-0.077]
42 The local Rademacher complexity of the multi-kernel class Hp can be bounded for any p ∈ [1, 2] as √ 1 ∞ M 16 BeDM t∗ t∗ 1− t2 2 t∗ 2 λ(m) ∗ min rM , ceD + . [sent-137, score-0.134]
43 Rr (Hp ) ≤ min j ∗ n n t∈[p,2] m=1 t j=1 2 3 Theorem 3 (L OCAL R ADEMACHER COMPLEXITY BOUND , p ∈ [2, ∞] ). [sent-138, score-0.036]
44 For any p ∈ [2, ∞], 2 n Rr (Hp ) ≤ min t∈[p,∞] ∞ 2 min(r, D2 M t∗ −1 λj ). [sent-139, score-0.036]
45 j=1 It is interesting to compare the above bounds for the special case p = 2 with the ones of Bartlett et al. [sent-140, score-0.042]
46 The main term of the bound of Theorem 3 (taking t = p = 2) is then essentially determined by (m) 1/2 M ∞ 1 . [sent-142, score-0.085]
47 It is also interesting to study the case p = 1: by using t = (log(M ))∗ in Theorem 2, we obtain the ∞ M (m) √ 1/2 3 2 3 2 2 bound Rr (H1 ) ≤ 16 + Be D log(M ) , for j=1 min rM, e D (log M ) λj n n m=1 ∞ 2 all M ≥ e . [sent-144, score-0.121]
48 the proof of Theorem 3 is straightforward and shown in the supplementary material C. [sent-146, score-0.075]
49 In order to exploit the no-correlation assumption, we will work in large parts of the proof with the centered class Hp = fw w 2,p ≤ D , wherein fw : x → w, φ(x) , and φ(x) := φ(x) − Eφ(x). [sent-151, score-1.11]
50 (2) 1 p∗ p∗ 2 m=1 M Jensen Eφm (x), Eφm (x) = m=1 2 p∗ 2 E φm (x), φm (x) 1 p∗ M = tr(Jm ) m=1 m=1 p∗ 2 (3) so that we can express the complexity of the centered class in terms of the uncentered one as follows: n n 1 1 Rr (Hp ) ≤ E sup w, σi φ(xi ) + E sup w, σi Eφ(x) . [sent-153, score-0.269]
51 n i=1 n i=1 fw ∈Hp , fw ∈Hp , 2 P fw ≤r 2 P fw ≤r 2 2 Concerning the first term of the above upper bound, using (2) we have P fw ≤ P fw , and thus E sup w, fw ∈Hp , 2 P fw ≤r 1 n n σi φ(xi ) ≤ E sup w, fw ∈Hp , 2 P fw ≤r i=1 1 n n σi φ(xi ) = Rr (Hp ). [sent-154, score-5.129]
52 i=1 Now to bound the second term, we write n √ 1 E sup w, σi Eφ(x) ≤ n sup w, Eφ(x) . [sent-155, score-0.197]
53 n i=1 fw ∈Hp , fw ∈Hp , 2 P fw ≤r 2 P fw ≤r Now observe that we have w, Eφ(x) H¨ lder o ≤ (3) w 2,p Eφ(x) 2,p∗ ≤ w 2,p tr(Jm ) M m=1 p∗ 2 2 P fw . [sent-156, score-2.528]
54 (4) as well as w, Eφ(x) = Efw (x) ≤ 2 This shows that there is no loss in working with the centered class instead of the uncentered one. [sent-158, score-0.114]
55 First we note that, since the (centered) covariance operator Eφm (x) ⊗ φm (x) is also a self-adjoint Hilbert-Schmidt operator on Hm , there (m) (m) (m) (m) ∞ exists an eigendecomposition Eφm (x) ⊗ φm (x) = j=1 λj uj ⊗ uj , wherein (uj )j≥1 is an orthogonal basis of Hm . [sent-161, score-0.423]
56 As a consequence, for all j and m, M 2 P fw 1 n E wm , φm (x) m=1 n 2 (m) σi φm (xi ), uj i=1 (m) = 2 (m) λj w m , uj m=1 j=1 n 1 (m) 1 u , n j n = ∞ M 2 E(fw (x))2 = E = (5) (m) (m) E φm (xi ) ⊗ φm (xi ) uj λj . [sent-163, score-1.048]
57 We can express the LRC in terms of the eigendecompositon as follows Rr (Hp ) = E sup 1 n w, 2 fw ∈Hp :P fw ≤r M C. [sent-168, score-1.054]
58 2 in the supplementary material) to further bound the right ≤ p∗ M m=1 n σi φm (xi ), uj p∗ 2 (m) 2 n i=1 1 j>hm n E (m) n i=1 1 j>hm n term in the above expression as E φm (xi ), uj 1 p∗ (m) uj M m=1 2,p∗ . [sent-175, score-0.662]
59 Note that for p ≥ 2 it holds that ∗ p /2 ≤ 1, and thus it suffices to employ Jensen’s inequality once again to move the expectation ∗ operator inside the inner term. [sent-176, score-0.066]
60 Thus 5 by the subadditivity of the root function hm n M m=1 r ≤ M m=1 r Rr (Hp ) ≤ hm n BM p∗ + n M (m) λj m=1 j=hm +1 ∞ ep∗ 2 D2 n + ∞ 2 ep∗ 2 n +D √ M (m) λj + p∗ 2 m=1 j=hm +1 p∗ 2 1 BeDM p∗ p∗ . [sent-180, score-0.779]
61 Thus the following preliminary bound ∞ 4ep∗ 2 D2 n are 1/2 M m=1 hm is nonzero) so that in any case we get n− 2 M m=1 hm all √ M (m) λj + m=1 j=hm +1 p∗ 2 1 BeDM p∗ p∗ , n (8) for all nonnegative integers hm ≥ 0. [sent-183, score-1.28]
62 (11) Since the above holds for all nonnegative integers hm , the result follows, completing the proof. [sent-187, score-0.471]
63 1 Lower and Excess Risk Bounds To investigate the tightness of the presented upper bounds on the LRC of Hp , we consider the case where φ1 (x), . [sent-189, score-0.069]
64 This shows that the upper bounds of the previous section are tight. [sent-208, score-0.069]
65 As an application of our results to prediction problems such as classification or regression, we also n 1 ˆ bound the excess loss of empirical minimization, f := argminf n i=1 l(f (xi ), yi ), w. [sent-209, score-0.21]
66 to a loss ˆ function l: P (lf − lf ∗ ) := E l(f (x), y) − E l(f ∗ (x), y), where f ∗ := argminf E l(f (x), y) . [sent-212, score-0.171]
67 [2] to show the following excess risk bound under the assump(m) tion of algebraically decreasing eigenvalues of the kernel matrices, i. [sent-214, score-0.349]
68 ∃d > 0, α > 1, ∀m : λj ≤ −α dj (proof shown in the supplementary material E): (m) Theorem 5. [sent-216, score-0.056]
69 Let l be a Lipschitz continuous loss with constant L and assume there is a positive constant F such that ∀f ∈ F : P (f − f ∗ )2 ≤ F P (lf − lf ∗ ). [sent-218, score-0.138]
70 Then for all z > 0 with probability at least 1 − e−z the excess loss of the multi-kernel class Hp can be bounded for p ∈ [1, 2] as P (lf − lf ∗ ) ≤ min ˆ 186 t∈[p,2] 3−α dD2 L2 t∗ 2 1−α 1 1+α α−1 2 F α+1 M 1+ 1+α 1 t∗ −1 α n− 1+α √ 1 1 47 BDLM t∗ t∗ (22BDLM t∗ + 27F )z + + . [sent-219, score-0.287]
71 n n 1 1 We see from the above bound that convergence can be almost as slow as O p∗ M p∗ n− 2 (if α ≈ 1 is small ) and almost as fast as O n−1 (if α is large). [sent-220, score-0.085]
72 3 Interpretation of Bounds In this section, we discuss the rates of Theorem 5 obtained by local analysis bounds, that is ∀t ∈ [p, 2] : P (lf − lf ∗ ) = O ˆ t∗ D 2 1+α 2 M 1+ 1+α 1 t∗ −1 α n− 1+α . [sent-221, score-0.195]
73 (13) On the other hand, the global Rademacher complexity directly leads to a bound of the form [8] 1 1 (14) ∀t ∈ [p, 2] : P (lf − lf ∗ ) = O t∗ DM t∗ n− 2 . [sent-222, score-0.297]
74 Clearly, the rate obtained through local analysis is better in n since α > 1. [sent-224, score-0.054]
75 Regarding the rate in the number of kernels M and the radius D, a straightforward calculation shows that the local analysis improves √ 1 over the global one whenever M p /D = O( n) . [sent-225, score-0.162]
76 Second, if p ≤ (log M )∗ , the best choice in (13) and (14) is t = (log M )∗ so that 1 1 (15) P (lf − lf ∗ ) ≤ O min M n−1 , min t∗ DM t∗ n− 2 ˆ t∈[p,2] √ M and the phase transition occurs for D log M = O( n). [sent-229, score-0.228]
77 This situation is to be compared to the sharp analysis of the optimal convergence rate of convex aggregation of M functions obtained by [27] in the framework of squared error loss regression, which is shown to be 1/2 1 M O min M , n log √n . [sent-231, score-0.097]
78 This corresponds to the setting studied here with D = 1, p = 1 n and α → ∞, and we see that our bound recovers (up to log factors) in this case this sharp bound and the related phase transition phenomenon. [sent-232, score-0.21]
79 (5), Assumption (U)—a similar assumption was also used in [23]—can be relaxed to a more general, RIP-like assumption as used in [16]; this comes at the expense of an additional factor in the bounds (details omitted here). [sent-234, score-0.063]
80 As a practical application of the presented bounds, we analyze the impact of the norm-parameter p on the accuracy of p -norm MKL in various learning scenarios, showing why an intermediate p often turns out to be optimal in practical applications. [sent-236, score-0.087]
81 As indicated in the introduction, there is empirical evidence that the performance of p -norm MKL crucially depends on the choice of the norm parameter p (for example, cf. [sent-237, score-0.05]
82 Figure 1 7 60 50 40 bound 30 20 bound 40 45 50 55 60 65 110 90 bound 60 70 80 w* w* w* 1. [sent-238, score-0.255]
83 5 Figure 2: Illustration of the three analyzed learning scenarios (T OP) differing in their soft sparsity of the Bayes hypothesis w∗ (parametrized by β) and corresponding values of the bound factor νt as a function of p (B OTTOM). [sent-257, score-0.162]
84 A soft sparse (L EFT), a intermediate non-sparse (C ENTER), and an almost uniform w∗ (R IGHT). [sent-258, score-0.092]
85 To start with, first note that the choice of p only affects the excess risk bound in the factor (cf. [sent-261, score-0.218]
86 Plugging in this value for Dp , the bound factor νp becomes νp := w∗ 2 1+α p 2 2 min t∗ 1+α M 1+ 1+α 1 t∗ −1 . [sent-270, score-0.121]
87 Note that the soft sparsity of w∗ is increased from the left hand to the right hand side. [sent-275, score-0.077]
88 2, while for the intermediate case (C ENTER) p = 1. [sent-277, score-0.047]
89 This means that if the true Bayes hypothesis has an intermediately dense representation (which is frequently encountered in practical applications), our bound gives the strongest generalization guarantees to p -norm MKL using an intermediate choice of p. [sent-281, score-0.173]
90 4 Conclusion We derived a sharp upper bound on the local Rademacher complexity of p -norm multiple kernel learning. [sent-282, score-0.342]
91 We also proved a lower bound that matches the upper one and shows that our result is tight. [sent-283, score-0.112]
92 Using the local Rademacher complexity bound, we derived an excess risk bound that attains α the fast rate of O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. [sent-284, score-0.383]
93 In a practical case study, we found that the optimal value of that bound depends on the true Bayesoptimal kernel weights. [sent-285, score-0.236]
94 If the true weights exhibit soft sparsity but are not strongly sparse, then the generalization bound is minimized for an intermediate p. [sent-286, score-0.209]
95 This is not only intuitive but also supports empirical studies showing that sparse MKL (p = 1) rarely works in practice, while some intermediate choice of p can improve performance. [sent-287, score-0.047]
96 Multiple kernel learning, conic duality, and the SMO algorithm. [sent-303, score-0.131]
97 Let the kernel figure it out: Principled learning of pre-processing for kernel classifiers. [sent-373, score-0.262]
98 The best constant in the rosenthal inequality for nonnegative random variables. [sent-378, score-0.104]
99 Local Rademacher complexities and oracle inequalities in risk minimization. [sent-412, score-0.082]
100 Minimax-optimal rates for sparse additive models over kernel classes via convex programming. [sent-472, score-0.154]
wordName wordTfidf (topN-words)
[('fw', 0.499), ('hp', 0.456), ('hm', 0.375), ('mkl', 0.298), ('uj', 0.183), ('rademacher', 0.169), ('rr', 0.141), ('lf', 0.138), ('kernel', 0.131), ('jm', 0.106), ('excess', 0.092), ('bound', 0.085), ('efw', 0.083), ('kloft', 0.081), ('kernels', 0.077), ('tep', 0.073), ('bedm', 0.067), ('lrc', 0.067), ('km', 0.066), ('cortes', 0.057), ('sup', 0.056), ('tr', 0.054), ('centered', 0.053), ('ep', 0.053), ('ademacher', 0.05), ('eft', 0.05), ('marius', 0.05), ('supfw', 0.05), ('intermediate', 0.047), ('soft', 0.045), ('ight', 0.044), ('rosenthal', 0.044), ('complexity', 0.043), ('bounds', 0.042), ('complexities', 0.041), ('risk', 0.041), ('rm', 0.04), ('uncentered', 0.04), ('lanckriet', 0.04), ('nonnegative', 0.039), ('xi', 0.039), ('sonnenburg', 0.038), ('jensen', 0.037), ('min', 0.036), ('local', 0.034), ('bartlett', 0.034), ('argminf', 0.033), ('elating', 0.033), ('grc', 0.033), ('invited', 0.033), ('ocal', 0.033), ('ounding', 0.033), ('ller', 0.033), ('lder', 0.033), ('sch', 0.032), ('sparsity', 0.032), ('norm', 0.031), ('hs', 0.031), ('integers', 0.031), ('global', 0.031), ('mohri', 0.03), ('subadditivity', 0.029), ('aq', 0.029), ('spec', 0.029), ('operators', 0.029), ('hilbert', 0.028), ('material', 0.028), ('theorem', 0.028), ('supplementary', 0.028), ('svm', 0.027), ('upper', 0.027), ('blanchard', 0.027), ('brefeld', 0.027), ('holds', 0.026), ('talk', 0.025), ('spectra', 0.025), ('gilles', 0.025), ('eigenvalue', 0.025), ('attained', 0.024), ('decay', 0.023), ('tsch', 0.023), ('rates', 0.023), ('dp', 0.023), ('sharp', 0.022), ('monotonic', 0.021), ('bm', 0.021), ('class', 0.021), ('frequently', 0.021), ('expense', 0.021), ('inequality', 0.021), ('rate', 0.02), ('practical', 0.02), ('wherein', 0.019), ('proof', 0.019), ('operator', 0.019), ('aggregation', 0.019), ('rkhs', 0.019), ('evidence', 0.019), ('phase', 0.018), ('enter', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. 1
2 0.32804143 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
Author: Taiji Suzuki
Abstract: In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓp -MKL that imposes ℓp -mixed-norm regularization instead of ℓ1 -mixed-norm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. As a by-product of our general result, we show a fast learning rate of ℓp -MKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization. 1
3 0.2284397 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1
4 0.11776754 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
Author: Iasonas Kokkinos
Abstract: In this work we use Branch-and-Bound (BB) to efficiently detect objects with deformable part models. Instead of evaluating the classifier score exhaustively over image locations and scales, we use BB to focus on promising image locations. The core problem is to compute bounds that accommodate part deformations; for this we adapt the Dual Trees data structure [7] to our problem. We evaluate our approach using Mixture-of-Deformable Part Models [4]. We obtain exactly the same results but are 10-20 times faster on average. We also develop a multiple-object detection variation of the system, where hypotheses for 20 categories are inserted in a common priority queue. For the problem of finding the strongest category in an image this results in a 100-fold speedup.
5 0.10455104 171 nips-2011-Metric Learning with Multiple Kernels
Author: Jun Wang, Huyen T. Do, Adam Woznica, Alexandros Kalousis
Abstract: Metric learning has become a very active research field. The most popular representative–Mahalanobis metric learning–can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes finding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we fill this gap and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. 1
6 0.084526099 198 nips-2011-On U-processes and clustering performance
7 0.063565947 288 nips-2011-Thinning Measurement Models and Questionnaire Design
8 0.063459143 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
9 0.062291425 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
10 0.061237551 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
11 0.056812644 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
12 0.055410855 26 nips-2011-Additive Gaussian Processes
13 0.054051738 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
14 0.053585127 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
15 0.05329347 139 nips-2011-Kernel Bayes' Rule
16 0.048453379 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
17 0.048153669 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
18 0.048132118 162 nips-2011-Lower Bounds for Passive and Active Learning
19 0.044653263 145 nips-2011-Learning Eigenvectors for Free
20 0.044302151 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
topicId topicWeight
[(0, 0.136), (1, -0.034), (2, -0.05), (3, -0.109), (4, -0.045), (5, 0.067), (6, 0.047), (7, -0.081), (8, 0.02), (9, 0.051), (10, -0.162), (11, 0.16), (12, 0.125), (13, 0.158), (14, -0.105), (15, 0.015), (16, 0.097), (17, 0.018), (18, -0.017), (19, 0.229), (20, -0.03), (21, -0.205), (22, -0.119), (23, -0.007), (24, 0.05), (25, -0.03), (26, -0.047), (27, -0.01), (28, 0.183), (29, 0.027), (30, -0.164), (31, -0.175), (32, -0.059), (33, 0.072), (34, -0.067), (35, 0.076), (36, -0.113), (37, 0.038), (38, 0.007), (39, 0.044), (40, -0.047), (41, 0.079), (42, 0.006), (43, -0.004), (44, 0.079), (45, -0.02), (46, 0.019), (47, 0.028), (48, 0.097), (49, -0.131)]
simIndex simValue paperId paperTitle
same-paper 1 0.94531029 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. 1
2 0.88941371 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
Author: Taiji Suzuki
Abstract: In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓp -MKL that imposes ℓp -mixed-norm regularization instead of ℓ1 -mixed-norm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. As a by-product of our general result, we show a fast learning rate of ℓp -MKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization. 1
3 0.69598824 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1
4 0.42857859 171 nips-2011-Metric Learning with Multiple Kernels
Author: Jun Wang, Huyen T. Do, Adam Woznica, Alexandros Kalousis
Abstract: Metric learning has become a very active research field. The most popular representative–Mahalanobis metric learning–can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes finding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we fill this gap and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. 1
5 0.42529973 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur
Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.
6 0.39549583 139 nips-2011-Kernel Bayes' Rule
7 0.36935413 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
8 0.33564201 288 nips-2011-Thinning Measurement Models and Questionnaire Design
9 0.31352693 198 nips-2011-On U-processes and clustering performance
10 0.31205451 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
11 0.27855071 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
12 0.2766338 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
13 0.27504674 26 nips-2011-Additive Gaussian Processes
14 0.25261921 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
15 0.25252241 162 nips-2011-Lower Bounds for Passive and Active Learning
16 0.24588923 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
17 0.23909184 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
18 0.23824403 282 nips-2011-The Fast Convergence of Boosting
19 0.23061235 61 nips-2011-Contextual Gaussian Process Bandit Optimization
20 0.22613801 202 nips-2011-On the Universality of Online Mirror Descent
topicId topicWeight
[(0, 0.034), (4, 0.047), (17, 0.287), (20, 0.039), (26, 0.042), (31, 0.063), (33, 0.015), (43, 0.086), (45, 0.113), (48, 0.024), (57, 0.024), (65, 0.062), (74, 0.036), (83, 0.025), (99, 0.024)]
simIndex simValue paperId paperTitle
1 0.79384774 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
Author: Gautam Kunapuli, Richard Maclin, Jude W. Shavlik
Abstract: Knowledge-based support vector machines (KBSVMs) incorporate advice from domain experts, which can improve generalization significantly. A major limitation that has not been fully addressed occurs when the expert advice is imperfect, which can lead to poorer models. We propose a model that extends KBSVMs and is able to not only learn from data and advice, but also simultaneously improves the advice. The proposed approach is particularly effective for knowledge discovery in domains with few labeled examples. The proposed model contains bilinear constraints, and is solved using two iterative approaches: successive linear programming and a constrained concave-convex approach. Experimental results demonstrate that these algorithms yield useful refinements to expert advice, as well as improve the performance of the learning algorithm overall.
same-paper 2 0.75348407 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. 1
3 0.67389423 276 nips-2011-Structured sparse coding via lateral inhibition
Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun
Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modifications of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efficient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-specified. 1
4 0.5405398 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1
5 0.53127521 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
Author: Sham M. Kakade, Varun Kanade, Ohad Shamir, Adam Kalai
Abstract: Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail non-convex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) provided the first provably efficient method, the Isotron algorithm, for learning SIMs and GLMs, under the assumption that the data is in fact generated under a GLM and under certain monotonicity and Lipschitz (bounded slope) constraints. The Isotron algorithm interleaves steps of perceptron-like updates with isotonic regression (fitting a one-dimensional non-decreasing function). However, to obtain provable performance, the method requires a fresh sample every iteration. In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient. We modify the isotonic regression step in Isotron to fit a Lipschitz monotonic function, and also provide an efficient O(n log(n)) algorithm for this step, improving upon the previous O(n2 ) algorithm. We provide a brief empirical study, demonstrating the feasibility of our algorithms in practice. 1
6 0.53018838 244 nips-2011-Selecting Receptive Fields in Deep Networks
7 0.52363783 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
8 0.5205344 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
9 0.5187574 22 nips-2011-Active Ranking using Pairwise Comparisons
10 0.51829642 265 nips-2011-Sparse recovery by thresholded non-negative least squares
11 0.5178104 198 nips-2011-On U-processes and clustering performance
12 0.5173642 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning
13 0.51714158 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
14 0.51697624 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
15 0.51599509 80 nips-2011-Efficient Online Learning via Randomized Rounding
16 0.51567721 231 nips-2011-Randomized Algorithms for Comparison-based Search
17 0.51513082 258 nips-2011-Sparse Bayesian Multi-Task Learning
18 0.51434124 199 nips-2011-On fast approximate submodular minimization
19 0.51339281 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
20 0.51337802 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning