jmlr jmlr2012 jmlr2012-81 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity
Reference: text
sentIndex sentText sentNum sentScore
1 We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. [sent-8, score-0.377]
2 Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity 1. [sent-9, score-0.188]
3 K LOFT AND B LANCHARD from, one might hope to achieve automatic kernel selection: clearly, cross-validation based model selection (Stone, 1974) can be applied if the number of base kernels is decent. [sent-29, score-0.174]
4 (2004) it was shown that it is computationally feasible to simultaneously learn a support vector machine and a linear combination of kernels at the same time, if we require the so-formed kernel combinations to be positive definite and trace-norm normalized. [sent-32, score-0.174]
5 Empirical evidence has accumulated showing that sparse-MKL optimized kernel combinations rarely help in practice and frequently are to be outperformed by a regular SVM using an unweighted-sum kernel K = ∑m Km (Cortes et al. [sent-37, score-0.184]
6 The ℓq -norm MKL is an empirical minimization algorithm that operates on the multi-kernel class consisting of functions f : x → w, φk (x) with w k ≤ D, where φk is the kernel mapping into the reproducing kernel Hilbert space (RKHS) Hk with kernel k and norm . [sent-45, score-0.306]
7 k , while the kernel k itself ranges over the set of possible kernels k = ∑M θm km θ q ≤ 1, θ ≥ 0 . [sent-46, score-0.266]
8 (2004) and Micchelli and Pontil (2005) is that the above multi-kernel class can equivalently be represented as a block-norm regularized linear class in the product Hilbert space H := H1 × · · · × HM , where Hm denotes the RKHS associated to kernel km , 1 ≤ m ≤ M. [sent-53, score-0.184]
9 More precisely, denoting by φm the kernel feature mapping associated to kernel km over input space X , and φ : x ∈ X → (φ1 (x), . [sent-54, score-0.298]
10 , φM (x)) ∈ H , the class of functions defined above coincides with Hp,D,M = fw : x → w, φ(x) w = (w(1) , . [sent-57, score-0.337]
11 , w(M) kM p = (m) ∑M m=1 w p 1/p ; km for simplicity, we will frequently write w(m) 2466 2 = w(m) km . [sent-64, score-0.184]
12 • The generalization performance of ℓ p -norm MKL as guaranteed by the excess risk bound is studied for varying values of p, shedding light on the appropriateness of a small/large p in various learning scenarios. [sent-86, score-0.27]
13 2467 (m) ≤ d j−α , where λ j is the jth eigenvalue of K LOFT AND B LANCHARD Furthermore, we also present a simple proof of a global Rademacher bound similar to the one shown in Cortes et al. [sent-92, score-0.155]
14 Correspondingly, the hypothesis class we are interested in reads Hp,D,M = fw : x → w, x w 2,p ≤ D . [sent-101, score-0.408]
15 We denote the kernel matrices corresponding to k and km by K and Km , respectively. [sent-110, score-0.184]
16 The global Rademacher complexity is defined as R(Hp ) = E sup w, fw ∈H p 1 n ∑ σi x i n i=1 (2) where (σi )1≤i≤n is an i. [sent-129, score-0.505]
17 The interest in the global Rademacher complexity comes from that if known it can be used to bound the generalization error (Koltchinskii, 2001; Bartlett and Mendelson, 2002). [sent-137, score-0.184]
18 (2010) it was shown using a combinatorial argument that the empirical version of the global Rademacher complexity can be bounded as R(Hp ) ≤ D where c = 23 22 cp∗ 2n tr(Km ) M m=1 and tr(K) denotes the trace of the kernel matrix K. [sent-139, score-0.182]
19 Lemma 1) to bound the empirical version of the global Rademacher complexity as follows: def. [sent-160, score-0.184]
20 R(Hp ) = Eσ sup w, fw ∈H p H¨ lder o ≤ D Eσ 1 n ∑ σi x i n i=1 1 n ∑ σi x i n i=1 M Jensen ≤ D Eσ ∑ m=1 M ∗ p n K. [sent-161, score-0.468]
21 Note that there is a very good reason to state the above bound in terms of t ≥ p instead of solely in terms of p: the Rademacher complexity R(Hp ) is not monotonic in p and thus it is not always the best choice to take t := p in the above bound. [sent-164, score-0.145]
22 This can be readily seen, for example, for the easy case where all kernels have the same trace—in that case the bound translates into R(Hp ) ≤ D 2 t ∗M t∗ tr(K1 ) n . [sent-165, score-0.176]
23 This has interesting consequences: for any p ≤ (2 log M)∗ we can take the bound R(Hp ) ≤ D e log(M) tr(K1 ) , which has n only a mild dependency on the number of kernels; note that in particular we can take this bound for the ℓ1 -norm class R(H1 ) for all M > 1. [sent-167, score-0.188]
24 For example, when the traces of the kernels are bounded, the above bound is essentially determined 1 by O ∗ M p∗ O p √n log M √ . [sent-189, score-0.176]
25 (2010) observe that in the case p = 1 (traditional MKL), the bound of Proposition 2 grows logarithmically in the number of kernels M, and claim that the RCC approach would lead to a bound which is multiplicative in M. [sent-197, score-0.27]
26 This is because the RCC of a kernel class is the same as the RCC of its convex hull, and the RCC of the base class containing only the M individual kernels is logarithmic in M. [sent-199, score-0.174]
27 The Local Rademacher Complexity of Multiple Kernel Learning We first give a gentle introduction to local Rademacher complexities in general and then present the main result of this paper: a lower and an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning. [sent-202, score-0.36]
28 Then the local Rademacher complexity is defined as Rr ( F ) = E 1 n ∑ σi f (xi ) , f ∈F :P f 2 ≤r n i=1 sup (4) where P f 2 := E( f (x))2 . [sent-211, score-0.174]
29 The latter is a byproduct from the application of McDiarmid’s inequality (McDiarmid, 1989)—a uniform version of H¨ ffding’s inequality—,which is used in Bartlett and Mendelson (2002) and Koltchinskii (2001) o to relate the global Rademacher complexity with the excess risk. [sent-235, score-0.261]
30 (7) Hereby σ2 (F ) := sup f ∈F E f 2 is a bound on the (uncentered) “variance” of the functions considered. [sent-238, score-0.172]
31 Now, denoting the right-hand side of (7) by δ, we obtain the following bound for the restricted class: ∃C > 0 so that with probability larger then 1 − exp(−2t) it holds P fˆ − P f ∗ ≤ C R(Fδ ) + σ(Fδ ) t t + . [sent-239, score-0.147]
32 , xn drawn from P, the local Rademacher complexity is 2 given by Rr (Hp ) = E sup fw ∈Hp :P fw ≤r w, 1 ∑n σi xi , where P fw := E( fw (x))2 . [sent-251, score-1.545]
33 , x(M) satisfy M cδ ∑ M 2 E wm , x(m) ≤E m=1 ∑ wm , x(m) 2 . [sent-256, score-0.22]
34 m=1 Since Hm , Hm′ are RKHSs with kernels km , km′ , if we go back to the input random variable in the original space X ∈ X , the above property means that for any fixed t,t ′ ∈ X , the variables km (X,t) and km′ (X,t ′ ) have a low correlation. [sent-257, score-0.266]
35 Proof of Theorem 5 The proof is based on first relating the complexity of the class Hp with its centered counterpart, that is, where all functions fw ∈ Hp are centered around their expected value. [sent-286, score-0.482]
36 Then we compute the complexity of the centered class by decomposing the complexity into blocks, applying the no-correlation assumption, and using the inequalities of H¨ lder and Rosenthal. [sent-287, score-0.202]
37 We start the proof by ˜ ˜ Hp = f˜w noting that f˜w (x) = fw (x) − w, Ex = fw (x) − E w, x = fw (x) − E fw (x), so that, by the bias-variance decomposition, it holds that 2 2 P fw = E fw (x)2 = E ( fw (x) − E fw (x))2 + (E fw (x))2 = P f˜w + P fw 2 . [sent-292, score-3.401]
38 n i=1 w ∈H p i=1 2 P f˜w ≤r Now to bound the second term, we write E sup w, fw ∈H p , 2 P fw ≤r 1 n 1 n ∑ σi Ex = E n ∑ σi n i=1 i=1 w, Ex E fw ∈H p , 2 P fw ≤r √ w, Ex ≤ sup = sup fw ∈H p , 2 P fw ≤r n sup fw ∈H p , 2 P fw ≤r 2 n 1 ∑ σi n i=1 1 2 w, Ex . [sent-294, score-3.102]
39 Now observe finally that we have w, Ex H¨ lder o ≤ (10) w 2,p Ex 2,p∗ ≤ w tr(Jm ) 2,p M m=1 p∗ 2 as well as w, Ex = E fw (x) ≤ 2 P fw . [sent-295, score-0.727]
40 As a consequence, ˜ ˜ 2 P f˜w = M E( fw (x))2 = E ˜ M 2 wm , x(m) ˜ ∑ m=1 M (A) ≥ cδ l,m=1 m=1 ∞ M wm , Ex(m) ⊗ x(m) wm ˜ ˜ ∑ wl , Ex(l) ⊗ x(m) wm ˜ ˜ ∑ = (m) 2 (m) wm , u j ˜ (m) σ i. [sent-300, score-0.887]
41 (16) m=1 p∗ 2 Thus the following preliminary bound follows from (11) by (15) and (16): Rr (Hp ) ≤ 4rc−1 ∑M hm m=1 δ + n ∞ 4ep∗ 2 D2 n ∑ (m) λj M m=1 j=hm +1 p∗ 2 √ 1 BeDM p∗ p∗ , + n (17) for all nonnegative integers hm ≥ 0. [sent-321, score-0.913]
42 We could stop here as the above bound is already the one that will be used in the subsequent section for the computation of the excess loss bounds. [sent-322, score-0.229]
43 The eigendecomposition Ex ⊗ x = ∑∞ λ j u j ⊗ u j yields j=1 2 P fw = E( fw (x))2 = E w, x 2 ∞ = w, (Ex ⊗ x)w = ∑ λj w, u j 2 , (21) j=1 and, for all j E 1 n ∑ σi x i , u j n i=1 2 = E = 1 n ∑ σi σ l x i , u j n2 i,l=1 σ i. [sent-331, score-0.698]
44 j=h+1 √ √ Since the above holds for all h, the result now follows from A + B ≤ 2(A + B) for all nonnegative real numbers A, B (which holds by the concavity of the square root function): Rr (Hp ) ≤ ∞ 2 2 min rh + D2 M p∗ −1 ∑ λ j = n 0≤h≤n j=h+1 2480 2 n ∞ ∑ min(r, D2 M j=1 2 p∗ −1 λ j ). [sent-338, score-0.166]
45 Lower Bound In this subsection we investigate the tightness of our bound on the local Rademacher complexity of Hp . [sent-340, score-0.19]
46 Then, the following lower bound holds for the local Rademacher complexity of Hp for any p ≥ 1: Rr (Hp,D,M ) ≥ RrM (H1,DM1/p∗ ,1 ). [sent-359, score-0.221]
47 Proof First note that since the x(i) are centered and uncorrelated, that M 2 P fw = ∑ 2 wm , x(m) M = 2 wm , x(m) . [sent-360, score-0.604]
48 Then, the following lower bound holds for the local Rademacher complexity of Hp . [sent-370, score-0.221]
49 (23) j=1 We would like to compare the above lower bound with the upper bound of Theorem 5. [sent-372, score-0.188]
50 A similar comparison can be performed for the upper bound of Theorem 6: by Remark 8 the bound reads ∞ 2 2 (m) M Rr (Hp ) ≤ ∑ min(r, D2M p∗ −1 λ j ) m=1 1 , n j=1 2 (1) which for i. [sent-374, score-0.223]
51 f ∈F n i=1 In order to evaluate the performance of this so-called empirical risk minimization (ERM) algorithm we study the excess loss, P(l fˆ − l f ∗ ) := E l( fˆ(x), y) − E l( f ∗ (x), y). [sent-394, score-0.176]
52 (2005) and Koltchinskii (2006) it was shown that the rate of convergence of the excess risk is basically determined by the fixed point of the local Rademacher complexity. [sent-396, score-0.221]
53 (24) Then, denoting by r∗ the fixed point of 2FL R r 4L2 (F ) for all x > 0 with probability at least 1 − e−x the excess loss can be bounded as P(l fˆ − l f ∗ ) ≤ 7 r∗ (11L(b − a) + 27F)x + . [sent-403, score-0.157]
54 In general, the hinge loss does not satisfy (24) on an arbitrary convex class F ; for this reason, there is no direct, general “fast rate” excess loss analogue to the popular margin-radius bounds obtained through global Rademacher analysis. [sent-417, score-0.228]
55 In this sense, the local Rademacher complexity bound we have presented here can in principle be plugged in into the SVM analysis of Blanchard et al. [sent-421, score-0.19]
56 Lemma 12 shows that in order to obtain an excess risk bound on the multi-kernel class Hp it suffices to compute the fixed point of our bound on the local Rademacher complexity presented in Section 3. [sent-424, score-0.46]
57 For the fixed point r∗ of the local Rademacher complexity 2FLR r 2 (Hp ) it holds 4L 4c−1 F 2 ∑M hm m=1 δ + 8FL r ≤ min 0≤hm ≤∞ n ep∗ 2 D2 n ∗ ∞ ∑ (m) λj M p∗ 2 m=1 j=hm +1 √ 1 4 BeDFLM p∗ p∗ + . [sent-427, score-0.553]
58 Defining 4c−1 F 2 ∑M hm m=1 a= δ n and b = 4FL ep∗ 2 D2 n ∞ ∑ j=hm +1 (m) λj M m=1 p∗ 2 √ 1 2 BeDFLM p∗ p∗ , + n √ in order to find a fixed point of (17) we need to solve for r = ar + b, which is equivalent to solving r2 − (a + 2b)r + b2 = 0 for a positive root. [sent-429, score-0.392]
59 We now address the issue of computing actual rates of convergence of the fixed point r∗ under the assumption of algebraically decreasing eigenvalues of the kernel matrices, this means, we assume (m) ∃dm : λ j ≤ dm j−αm for some αm > 1. [sent-433, score-0.19]
60 This is a common assumption and, for example, met for finite rank kernels and convolution kernels (Williamson et al. [sent-434, score-0.184]
61 Notice that this implies ∞ ∑ j=hm +1 (m) λj ∞ ≤ dm = − ∑ j=hm +1 j−αm ≤ dm dm 1−αm . [sent-436, score-0.234]
62 n (26) Inserting the result of (25) into the above bound and setting the derivative with respect to hm to zero we find the optimal hm as 2 hm = 4cδ dm ep∗2 D2 F −2 L2 M p∗ −2 n 1 1+αm . [sent-438, score-1.348]
63 For example, the latter is the case if all kernels have finite rank and also the convolution kernel is an example of this type. [sent-460, score-0.174]
64 Notice that we of course could repeat the above discussion to obtain excess risk bounds for the case p ≥ 2 as well, but since it is very questionable that this will lead to new insights, it is omitted for simplicity. [sent-461, score-0.211]
65 Discussion In this section we compare the obtained local Rademacher bound with the global one, discuss related work as well as the assumption (A), and give a practical application of the bounds by studying the appropriateness of small/large p in various learning scenarios. [sent-463, score-0.233]
66 Local Rademacher Bounds In this section, we discuss the rates obtained from the bound in Theorem 14 for the excess risk and compare them to the rates obtained using the global Rademacher complexity bound of Corollary 4. [sent-466, score-0.454]
67 On the other hand, the global Rademacher complexity directly leads to a bound on the supremum of the centered empirical process indexed by F and thus also provides a bound on the excess risk (see, e. [sent-469, score-0.501]
68 Therefore, using Corollary 4, wherein we upper bound the trace of each Jm by the constant B (and subsume it under the O-notation), we have a second bound on the excess risk of the form ∀t ∈ [p, 2] : 1 1 P(l fˆ − l f ∗ ) = O t ∗ DM t ∗ n− 2 . [sent-473, score-0.385]
69 In general, we have a bound 2486 O N THE C ONVERGENCE R ATE OF ℓ p -N ORM MKL on the excess risk given by the minimum of (28) and (29); a straightforward calculation shows that the local Rademacher analysis improves over the global one whenever 1 √ Mp = O( n). [sent-477, score-0.354]
70 In this case taking the minimum of the two bounds reads ∀p ≤ (log M)∗ : 2 1+α 1 P(l fˆ − l f ∗ ) ≤ O min(D(log M)n− 2 , D log M α−1 α M 1+α n− 1+α ) , (30) and the phase transition when the local Rademacher bound improves over the global one occurs for √ M = O( n). [sent-482, score-0.248]
71 The global Rademacher bound on the other hand, still depends crucially on the ℓ p norm constraint. [sent-487, score-0.184]
72 This corresponds to the setting studied here with D = 1, p = 1 and α → ∞, and we see that the bound (30) recovers (up to log factors) in this case this sharp bound and the related phase transition phenomenon. [sent-489, score-0.188]
73 However, a similarity to our setup is that an algebraic decay of the eigenvalues of the kernel matrices is assumed for the computation of the excess risk bounds and that a so-called incoherence assumption is imposed on the kernels, which is similar to our Assumption (A). [sent-495, score-0.375]
74 We now compare the excess risk bounds of Suzuki (2011) for the case of homogeneous eigenvalue decays, that is, P(l fˆ − l f ∗ ) = O D 2 1+α 2 M 1+ 1+α 1 p∗ −1 α n− 1+α , to the ones shown in this paper, that is, (28)—we thereby disregard constants and the O(n−1 ) terms. [sent-497, score-0.233]
75 Regarding the dependency in p, we observe that 2 our bound involves a factor of (t ∗ ) 1+α (for some t ∈ [p, 2] that is not present in the bound of Suzuki (2011). [sent-501, score-0.188]
76 m If the kernel spaces are one-dimensional, in which case ℓ1 -penalized MKL reduces qualitatively to standard lasso-type methods, this assumption is known to be necessary to grant the validity of bounds taking into account the sparsity pattern of the Bayes function. [sent-512, score-0.208]
77 In other words, with one-dimensional kernel spaces, the two constraints (on the L2 (P)-norm of the function and on the ℓ p block-norm of the coefficients) appearing in the definition of local Rademacher complexity are essentially not active simultaneously. [sent-519, score-0.188]
78 We believe that this phenomenon can be (at least partly) explained on base of our excess risk bound obtained in the last section. [sent-534, score-0.27]
79 To this end we will analyze the dependency of the excess risk bounds on the chosen norm parameter p. [sent-535, score-0.241]
80 Since our excess risk bound is only formulated for p ≤ 2, we will limit the analysis to the range p ∈ [1, 2]. [sent-537, score-0.27]
81 To start with, first note that the choice of p only affects the excess risk bound in the factor (cf. [sent-538, score-0.27]
82 Each scenario has a Bayes hypothesis w∗ with a different soft sparsity (parametrized by β). [sent-542, score-0.165]
83 It might surprise the reader that we consider the term in D in the bound although it seems from the bound that it does not depend on p. [sent-546, score-0.188]
84 To illustrate this, let us assume that the Bayes hypothesis belongs to the space H and can be ∗ represented by w∗ ; assume further that the block components satisfy wm 2 = m−β , m = 1, . [sent-549, score-0.146]
85 0 50 40 20 30 bound 60 40 45 50 55 60 65 bound 90 80 60 70 bound 110 O N THE C ONVERGENCE R ATE OF ℓ p -N ORM MKL 1. [sent-571, score-0.282]
86 This means that if the true Bayes hypothesis has an intermediately dense representation, our bound gives the strongest generalization guarantees to ℓ p -norm MKL using an intermediate choice of p. [sent-587, score-0.154]
87 This is also intuitive: if the truth exhibits some soft sparsity but is not strongly sparse, we expect non-sparse MKL to perform better than strongly sparse MKL or the unweighted-sum kernel SVM. [sent-588, score-0.197]
88 Then, each feature is input into a linear kernel and the resulting kernel matrices are multiplicatively normalized as described in Kloft et al. [sent-620, score-0.184]
89 In contrast, the vanilla SVM using a uniform kernel combination performs best when all kernels are equally informative. [sent-633, score-0.174]
90 To furthermore analyze whether the p that are minimizing the bound are reflected in the empirical results, we compute the test errors of the various MKL variants again, using the setup above except that we employ a local search for finding the optimal p. [sent-640, score-0.159]
91 We observe a striking coincidence of the optimal p as predicted by the bound and the p that worked best empirically: In the sparsest scenario (shown on the lower right-hand side), the bound predicts p ∈ [1, 1. [sent-642, score-0.237]
92 In the extreme case, that is, the uniform scenario, the bound indicates a p that lies well beyond the valid interval of the bound (i. [sent-658, score-0.188]
93 For the setup of homogeneous eigenvalue decay of the kernels as considered in the toy experiment setup here, their bound is optimal for p = 1, regardless of the sparsity of the Bayes hypothesis. [sent-670, score-0.331]
94 Conclusion We derived a sharp upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning under the assumption of uncorrelated kernels. [sent-675, score-0.338]
95 Using the local Rademacher complexity bound, α we derived an excess risk bound that attains the fast rate of O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. [sent-677, score-0.42]
96 In a practical case study, we found that the optimal value of that bound depends on the true Bayes-optimal kernel weights. [sent-678, score-0.186]
97 If the true weights exhibit soft sparsity but are not strongly sparse, then the generalization bound is minimized for an intermediate p. [sent-679, score-0.223]
98 , w(M) ), for any q ∈ [1, ∞], the hypothesis class M f :x→ ∑ wm , θm φm (x) w 2 m=1 ≤ D, θ q ≤1 , (33) is identical to the block norm class Hp,D,M = f : x → w, φ(x) where p := as 2q q+1 . [sent-709, score-0.176]
99 (2011), by the definition of the dual norm · ∗ of a norm · , that is, x ∗ = supy x, y − y , it holds w 2 2,p = wm 2 2 M M m=1 p/2 = ∑ θm sup θ: θ (p/2)∗ ≤1 wm 2 2 . [sent-721, score-0.389]
100 Let the kernel figure it out: Principled learning of pre-processing for kernel classifiers. [sent-919, score-0.184]
wordName wordTfidf (topN-words)
[('mkl', 0.395), ('hm', 0.392), ('fw', 0.337), ('hp', 0.317), ('rademacher', 0.264), ('lanchard', 0.177), ('loft', 0.151), ('rr', 0.149), ('excess', 0.135), ('orm', 0.129), ('bedm', 0.121), ('ex', 0.113), ('wm', 0.11), ('onvergence', 0.105), ('ate', 0.105), ('ep', 0.101), ('bound', 0.094), ('jm', 0.092), ('kernel', 0.092), ('km', 0.092), ('kloft', 0.088), ('kernels', 0.082), ('rc', 0.079), ('sup', 0.078), ('dm', 0.078), ('rcc', 0.074), ('rosenthal', 0.074), ('suzuki', 0.062), ('sparsity', 0.061), ('bayes', 0.06), ('bedflm', 0.056), ('kq', 0.056), ('cq', 0.056), ('lder', 0.053), ('koltchinskii', 0.051), ('complexity', 0.051), ('tr', 0.048), ('uncentered', 0.048), ('exi', 0.048), ('cortes', 0.047), ('centered', 0.047), ('eq', 0.047), ('local', 0.045), ('soft', 0.044), ('bartlett', 0.043), ('dmax', 0.043), ('risk', 0.041), ('bousquet', 0.04), ('spectra', 0.04), ('tep', 0.04), ('young', 0.04), ('jensen', 0.039), ('global', 0.039), ('mw', 0.037), ('inequality', 0.036), ('uncorrelated', 0.036), ('hypothesis', 0.036), ('bounds', 0.035), ('rh', 0.035), ('talagrand', 0.035), ('reads', 0.035), ('scenarios', 0.035), ('nonnegative', 0.035), ('min', 0.034), ('complexities', 0.033), ('bm', 0.033), ('ez', 0.032), ('decay', 0.032), ('holds', 0.031), ('norm', 0.03), ('blanchard', 0.029), ('steele', 0.028), ('subadditivity', 0.028), ('raskutti', 0.026), ('rm', 0.026), ('sparsest', 0.025), ('scenario', 0.024), ('eigendecomposition', 0.024), ('intermediate', 0.024), ('mcdiarmid', 0.023), ('uj', 0.023), ('ying', 0.023), ('xi', 0.023), ('lemma', 0.023), ('eigenvalue', 0.022), ('denoting', 0.022), ('aq', 0.021), ('wherein', 0.021), ('gilles', 0.021), ('potsdam', 0.021), ('crucially', 0.021), ('mendelson', 0.021), ('remark', 0.02), ('rip', 0.02), ('setup', 0.02), ('assumption', 0.02), ('operators', 0.02), ('erm', 0.019), ('hinge', 0.019), ('alo', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity
2 0.24925324 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
Author: Francesco Orabona, Luo Jie, Barbara Caputo
Abstract: In recent years there has been a lot of interest in designing principled classification algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach. Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem. We prove theoretically and experimentally that 1) our algorithm has a faster convergence rate as the number of kernels grows; 2) the training complexity is linear in the number of training examples; 3) very few iterations are sufficient to reach good solutions. Experiments on standard benchmark databases support our claims. Keywords: multiple kernel learning, learning kernels, online optimization, stochastic subgradient descent, convergence bounds, large scale
3 0.19405076 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir
Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines. Keywords: multiple kernel learning, kernel fisher discriminant analysis, regularised least squares, support vector machines
4 0.1045379 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
5 0.10405479 111 jmlr-2012-Structured Sparsity and Generalization
Author: Andreas Maurer, Massimiliano Pontil
Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.
6 0.099756442 80 jmlr-2012-On Ranking and Generalization Bounds
7 0.092473947 73 jmlr-2012-Multi-task Regression using Minimal Penalties
8 0.081841998 91 jmlr-2012-Plug-in Approach to Active Learning
9 0.081002191 97 jmlr-2012-Regularization Techniques for Learning with Matrices
10 0.07735756 4 jmlr-2012-A Kernel Two-Sample Test
11 0.074157171 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
12 0.069083899 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
13 0.057568669 59 jmlr-2012-Linear Regression With Random Projections
14 0.054389946 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
15 0.050102394 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
16 0.047989506 44 jmlr-2012-Feature Selection via Dependence Maximization
17 0.044638887 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
18 0.044202834 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
19 0.043760687 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
20 0.043452036 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
topicId topicWeight
[(0, -0.254), (1, -0.018), (2, -0.052), (3, 0.376), (4, 0.208), (5, -0.023), (6, -0.124), (7, -0.045), (8, 0.04), (9, 0.075), (10, -0.066), (11, -0.109), (12, 0.081), (13, 0.042), (14, -0.182), (15, -0.224), (16, -0.107), (17, -0.031), (18, -0.009), (19, -0.044), (20, -0.063), (21, -0.022), (22, 0.006), (23, 0.087), (24, 0.001), (25, 0.056), (26, -0.102), (27, 0.023), (28, -0.08), (29, -0.145), (30, 0.033), (31, 0.023), (32, 0.007), (33, 0.022), (34, 0.083), (35, -0.053), (36, 0.074), (37, 0.031), (38, 0.032), (39, 0.088), (40, 0.018), (41, 0.012), (42, 0.101), (43, -0.053), (44, 0.05), (45, 0.004), (46, 0.05), (47, 0.005), (48, 0.077), (49, -0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.918643 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity
2 0.69886613 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir
Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines. Keywords: multiple kernel learning, kernel fisher discriminant analysis, regularised least squares, support vector machines
3 0.67621547 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
Author: Francesco Orabona, Luo Jie, Barbara Caputo
Abstract: In recent years there has been a lot of interest in designing principled classification algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach. Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem. We prove theoretically and experimentally that 1) our algorithm has a faster convergence rate as the number of kernels grows; 2) the training complexity is linear in the number of training examples; 3) very few iterations are sufficient to reach good solutions. Experiments on standard benchmark databases support our claims. Keywords: multiple kernel learning, learning kernels, online optimization, stochastic subgradient descent, convergence bounds, large scale
4 0.43953207 111 jmlr-2012-Structured Sparsity and Generalization
Author: Andreas Maurer, Massimiliano Pontil
Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.
5 0.43132076 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
6 0.3618674 80 jmlr-2012-On Ranking and Generalization Bounds
7 0.35752532 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
8 0.35473609 91 jmlr-2012-Plug-in Approach to Active Learning
9 0.35121655 73 jmlr-2012-Multi-task Regression using Minimal Penalties
10 0.33295792 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
11 0.32466993 4 jmlr-2012-A Kernel Two-Sample Test
12 0.31508672 97 jmlr-2012-Regularization Techniques for Learning with Matrices
13 0.29591751 59 jmlr-2012-Linear Regression With Random Projections
14 0.28966522 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
15 0.28378925 44 jmlr-2012-Feature Selection via Dependence Maximization
16 0.28317231 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
17 0.26976642 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
18 0.2553376 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
19 0.25445935 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
20 0.23786113 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
topicId topicWeight
[(7, 0.014), (21, 0.042), (26, 0.033), (29, 0.046), (35, 0.012), (49, 0.402), (56, 0.012), (57, 0.011), (64, 0.013), (69, 0.011), (75, 0.097), (77, 0.012), (79, 0.018), (92, 0.096), (96, 0.098)]
simIndex simValue paperId paperTitle
1 0.96792281 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding
Abstract: This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper. Keywords: Gaussian process regression, domain decomposition method, partial independent conditional, bagging for Gaussian process, local probabilistic regression
2 0.87980151 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan
Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. Finally, we present two extensions for grouped variable selection and logistic regression. Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods
same-paper 3 0.81024951 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity
4 0.50544447 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
5 0.47204024 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
6 0.46962315 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
7 0.4617137 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
8 0.45835161 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
9 0.45647198 111 jmlr-2012-Structured Sparsity and Generalization
10 0.45285934 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
11 0.45163208 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
12 0.45139033 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
13 0.45122793 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
14 0.45103231 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
15 0.45091838 13 jmlr-2012-Active Learning via Perfect Selective Classification
16 0.449527 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
17 0.4485561 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
18 0.44835263 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
19 0.44652116 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
20 0.44503915 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss