jmlr jmlr2012 jmlr2012-77 knowledge-graph by maker-knowledge-mining

77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Source: pdf

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-inﬁnite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to signiﬁcant improvements in the efﬁciency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, ﬁxed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the uniﬁed framework of regularised kernel machines. Keywords: multiple kernel learning, kernel ﬁsher discriminant analysis, regularised least squares, support vector machines

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. [sent-14, score-0.664]

2 Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the uniﬁed framework of regularised kernel machines. [sent-20, score-0.325]

3 Keywords: multiple kernel learning, kernel ﬁsher discriminant analysis, regularised least squares, support vector machines 1. [sent-21, score-0.559]

4 It is well recognised that in kernel methods, the choice of kernel function is critically important, since it completely determines the embedding of the data in the feature space. [sent-29, score-0.35]

5 (2002, 2004) study a binary classiﬁcation problem, and their key idea is to learn a linear combination of a given set of base kernels by maximising the margin between the two classes or by maximising kernel alignment. [sent-39, score-0.422]

6 Moreover, with the interleaved algorithm, only columns of the kernel matrices that correspond to the “active” dual variables need to be loaded into memory, extending MKL’s applicability to large scale problems. [sent-72, score-0.34]

7 The learning problem in Equation (1) imposes an ℓ1 regularisation on the kernel weights. [sent-73, score-0.427]

8 It has been known that ℓ1 norm regularisation tends to produce sparse solutions (R¨ tsch, 2001), which a means during the learning most kernels are assigned zero weights. [sent-74, score-0.496]

9 (2009), non-sparse versions of MKL are proposed, where an ℓ2 norm regularisation is imposed instead of ℓ1 norm. [sent-79, score-0.314]

10 (2008, 2009, 2011) show that the regularisation norm contributes signiﬁcantly to the performance of MKL, and conﬁrm that in general a smaller regularisation norm produces more sparse kernel weights. [sent-86, score-0.803]

11 (2008), a multiple kernel FDA (MK-FDA) is introduced, where an ℓ1 norm is used to regularise the kernel weights. [sent-102, score-0.412]

12 (2008) to a general ℓ p norm regularisation by bringing latest advances in non-sparse MKL to MK-FDA. [sent-106, score-0.314]

13 Our contribution can be summarised as follows: • We provide a SIP formulation of ℓ p MK-FDA for both binary and multiclass problems, and adapt the wrapper algorithm in Sonnenburg et al. [sent-107, score-0.317]

14 We conﬁrm that as in the case of ℓ p MK-SVM, in ℓ p MK-FDA, a smaller regularisation norm in general leads to more sparse kernel weights. [sent-111, score-0.489]

15 We also show that by selecting the regularisation norm p on an independent validation set, the “intrinsic sparsity” of the given set of base kernels can be learnt. [sent-112, score-0.575]

16 As a result, using the learnt optimal norm p in ℓ p MK-FDA offers better performance than ﬁxed norm MK-FDAs. [sent-113, score-0.279]

17 In terms of efﬁciency, our wrapper-based ℓ p MK-FDA is comparable to the interleaved ℓ p MK-SVM on small/medium sized binary problems, but can be signiﬁcantly faster on multiclass problems. [sent-115, score-0.321]

18 (Section 3) • Finally, we discuss the connection between (MK-)FDA and (MK-)SVM, from the perspectives of both loss function and version space, under the uniﬁed framework of regularised kernel machines. [sent-119, score-0.325]

19 ℓ p Norm Multiple Kernel FDA In this section we ﬁrst present our ℓ p regularised MK-FDA for binary problems and then for multiclass problems. [sent-129, score-0.347]

20 We deﬁne optimality in terms of the class separation criterion of FDA, that is, the learnt kernel 610 N ON -S PARSE M ULTIPLE K ERNEL F ISHER D ISCRIMINANT A NALYSIS weights β are optimal, if the ratio of the projected between and within class scatters is maximised. [sent-134, score-0.393]

21 SB = (2) The objective of single kernel FDA is to ﬁnd the projection direction w in the feature space that wT T m SB w w SB m m− maximises wT SW w , or equivalently, w+ST w T w practice a regularised objective function , where ST = SB + SW is the total scatter matrix. [sent-144, score-0.356]

22 The kernel weights must be regularised somehow to make sure Equation (8) remains meaningful and does not become arbitrarily large. [sent-156, score-0.388]

23 In this paper, we propose to impose an ℓ p regularisation on the kernel weights for any p ≥ 1, following Kloft et al. [sent-157, score-0.49]

24 Theorem 1 Given a set of n kernel matrices K1 , · · · , Kn , the kernel weights β that optimise Equation (12) are given by solving the following SIP problem: max θ (13) θ,β 1 1 s. [sent-181, score-0.413]

25 (14) j=1 Note that the program in Equation (14) is nothing but the single kernel FDA/RLS dual problem using the current estimate β (t) as kernel weights. [sent-194, score-0.391]

26 When choosing from linear combinations of a set of base kernels with kernel weights regularised with an ℓ p norm, the optimal kernel weights are given by: c 1 1 max min ∑ −hT αk + αT αk + k k β αk k=1 4 4λ s. [sent-227, score-0.84]

27 We use again second order Taylor expansion to approximate the norm constraint and arrive at the multiclass ℓ p MK-FDA saddle point problem: c 1 1 max min ∑ −hT αk + αT αk + k 4 k 4λ β αk k=1 n ∑ αT β j K j αk k j=1 s. [sent-230, score-0.311]

28 β ≥ 0, ν(β) ≤ 1, 615 YAN , K ITTLER , M IKOLAJCZYK AND TAHIR Algorithm 2 A wrapper algorithm for solving the multiclass ℓ p MK-FDA SIP in Equation (20) (1) Input: K1 , · · · , Kn , a, θ(1) = −∞, β j = n−1/p ∀ j, ε. [sent-232, score-0.253]

29 The iterative wrapper algorithm for solving the multiclass ℓ p MK-FDA SIP is summarised in Algorithm 2. [sent-245, score-0.253]

30 We show that by exploiting 616 N ON -S PARSE M ULTIPLE K ERNEL F ISHER D ISCRIMINANT A NALYSIS the equivalence between kernel FDA and least squares SVM (LSSVM) (Suykens and Vandewalle, 1999), the interleaved method in Sonnenburg et al. [sent-249, score-0.299]

31 Both new formulations discussed in this section are equivalent to previous ones in terms of learnt kernel weights, but can potentially lead to signiﬁcant efﬁciency improvement. [sent-254, score-0.367]

32 1 I NTERLEAVED O PTIMISATION OF THE S ADDLE P OINT P ROBLEM We consider the multiclass MKL problem for a general convex loss function V (ξik , hik ): c min ∑ w jk ,ξik ,β k=1 1 2 n m ||w jk ||2 +C ∑ V (ξik , hik ) ∑ βj j=1 i=1 n s. [sent-259, score-0.628]

33 When V (ξik , hik ) is the square loss V (ξik , hik ) = 2 (ξik − hik )2 , Equation (23) p is essentially multiclass multiple kernel regularised least squares (MK-RLS). [sent-262, score-0.987]

34 (2011), such an algorithm has two disadvantages: solving the whole single kernel problem in the α step is unnecessary therefore wasteful, and all kernels need to be loaded into memory. [sent-270, score-0.357]

35 2 W ORKING D IRECTLY WITH THE D UAL The MK-FDA algorithms considered so far, including the wrapper method and the interleaved method, are all based on the intermediate saddle point formulation Equation (24), or equivalently, Equation(19). [sent-282, score-0.329]

36 (2010) this time we impose the norm constraint in the form of Tikhonov regularisation instead of Ivanov regularisation: c ∑ ,β min w jk ,ξik k=1 1 2 n w jk βj ∑ j=1 2 m +C ∑ V (ξik , hik ) + i=1 µ β 2 2 p (25) n s. [sent-286, score-0.612]

37 This is a direct result of the fact that the two formulations use Tikhonov regularisation and Ivanov regularisation over β, respectively. [sent-303, score-0.541]

38 We show that by selecting the regularisation norm p on an independent validation set, the intrinsic sparsity of the given set of base kernels can be learnt. [sent-317, score-0.646]

39 Among the six datasets used in the experiments, three of them (synthetic, VOC08, VOC07) are binary problems and the rest (Caltech101, Flower17, Protein) are multiclass ones. [sent-325, score-0.269]

40 2 Once the kernel weights have been learnt, we use a spectral regression based efﬁcient kernel FDA implementation (Cai et al. [sent-328, score-0.413]

41 , 2010);5 while on multiclass problems, we compare ℓ p MK-FDA with two variants of multiclass ℓ p MK-SVM: MK-SVM Shogun and MK-SVM OBSCURE (Orabona et al. [sent-333, score-0.328]

42 For the ﬁrst ﬁve datasets, due to the kernel functions used, the kernel matrices are by deﬁnition spherically normalised: all data points lie on the unit hypersphere in the feature space. [sent-361, score-0.35]

43 For the protein localisation dataset, the kernels are multiplicatively normalised following Ong and Zien (2008) and Kloft et al. [sent-362, score-0.27]

44 The above process of mean/covariance generation, sampling, and kernel building is repeated n times, resulting in n training kernels (and n corresponding test kernels). [sent-379, score-0.389]

45 Another observation is that ℓ1 MK-FDA slightly outperforms ℓ2 MK-FDA when the number of kernels is 5, and vice versa when the number of kernels is 10 or 15. [sent-388, score-0.364]

46 Two typical examples of such weights, learnt using n = 5 kernels and n = 30 kernels respectively, are plotted in Figure 1 (b). [sent-392, score-0.519]

47 It has been known that ℓ1 norm regularisation tends to produce sparse solutions (R¨ tsch, 2001; Kloft et al. [sent-393, score-0.314]

48 (b) Comparing the kernel weights learnt from ℓ1 MK-FDA and ℓ2 MK-FDA. [sent-401, score-0.393]

49 As the number of kernels increases, eventually there are enough of them for the over-selectiveness of ℓ1 regularisation to exhibit itself. [sent-406, score-0.434]

50 Finally, it is worth noting that the sparsity of learnt kernel weights, which captures the sparsity of information in the kernel set, is not to be confused with the numerical sparsity of the kernel matrices. [sent-410, score-0.791]

51 For example, when the RBF kernel function is used, the kernel matrices will not contain any zero, regardless of the sparsity of kernel weights. [sent-411, score-0.562]

52 These random kernels are then mixed with the informative ones, to study how the properties of kernels affect the performance of ℓ1 and ℓ2 MK-FDAs. [sent-435, score-0.364]

53 In the following runs the number of informative kernels is increased and that of random kernels decreased, until the 31st run, where all 30 kernels are informative. [sent-438, score-0.546]

54 In each run, we apply both ℓ1 and ℓ2 MK-FDAs to the 20 binary problems, compute the MAP for each algorithm, and record the learnt kernel weights. [sent-439, score-0.363]

55 Figure 2 (a) plots the kernel weights learnt from ℓ1 MK-FDA and ℓ2 MK-FDA. [sent-440, score-0.393]

56 Another interpretation of this observation is that when the “intrinsic” sparsity of the base kernels is 622 N ON -S PARSE M ULTIPLE K ERNEL F ISHER D ISCRIMINANT A NALYSIS high then ℓ1 norm regularisation is appropriate, and vice versa. [sent-451, score-0.565]

57 This suggests that if we can learn this intrinsic sparsity of base kernels on a validation set, we will be able to ﬁnd the most appropriate regularisation norm p, and get improved performance over a ﬁx norm MK-FDA. [sent-452, score-0.708]

58 Finally, the two kernel functions are applied to the histograms to build kernel matrices. [sent-468, score-0.35]

59 We investigate the idea of learning the intrinsic sparsity of the base kernels by tuning the regularisation norm p on a validation set, using both ℓ p MK-SVM and ℓ p MK-FDA. [sent-469, score-0.646]

60 For ℓ p MK-SVM, the regularisation parameter C is learnt jointly with p from 10 values that are logarithmically spaced over 2−2 to 27 . [sent-471, score-0.407]

61 Similarly, for ℓ p MK-FDA, the regularisation parameter λ is learnt jointly with p from a set of 10 values that are logarithmically spaced over 4−5 to 44 . [sent-472, score-0.407]

62 For ℓ p MK-FDA, for each p value, the weights learnt with the optimal λ value are plotted; while for ℓ p MK-SVM, for each p value, we show the weights learnt with the optimal C value. [sent-475, score-0.436]

63 It is clear that as p increases, in both MKL algorithms, the sparsity of the learnt weights decreases. [sent-476, score-0.255]

64 As expected, when p = 106 (practically inﬁnity), the kernel weights become ones, that is, ℓ∞ MK-FDA/MK-SVM produces uniform kernel weights. [sent-477, score-0.413]

65 Note that for the same norm p, the weights learnt in ℓ p MK-FDA and ℓ p MK-SVM can be different. [sent-478, score-0.28]

66 Table 1: VOC2007: Average precisions of six MKL methods of the top row of Figure 4 are the learnt kernel weights with the optimal {p, λ} combination on the training set and on the training + validation set, respectively. [sent-622, score-0.504]

67 For this particular binary problem, the intrinsic sparsity of the set of base kernels is medium. [sent-624, score-0.318]

68 However, for the “pottedplant” class, the optimal p on the validation set is found to be 8, which implies that the intrinsic sparsity of the kernels is low. [sent-627, score-0.3]

69 The results in Table 1 demonstrate that learning the regularisation norm p indeed improves the performance of MK-FDA. [sent-630, score-0.314]

70 However, it is worth noting that this improvement is achieved at a computational price of cross validating for an additional parameter, the regularisation norm p. [sent-631, score-0.314]

71 We argue, however, that kernel alignment by itself cannot reveal completely the sparsity of a kernel set. [sent-643, score-0.418]

72 This means kernel alignment, which is class independent, by itself cannot be expected to identify the kernel set sparsity for all classes. [sent-655, score-0.387]

73 Caltech101 is a multiclass object recognition benchmark with 101 object categories. [sent-665, score-0.276]

74 Validation is omitted, as the training of multiclass MK-SVM Shogun on this dataset can be very time consuming. [sent-668, score-0.301]

75 Two multiclass MK-SVM implementations are compared against multiclass MK-FDA, namely, MK-SVM Shogun, and the recently proposed online MK-SVM algorithm OBSCURE (Orabona et al. [sent-674, score-0.328]

76 For OBSCURE, the parameters are set to default values, except for the MKL norm p and the regularisation parameter C. [sent-676, score-0.314]

77 5 Oxford Flower17 Oxford Flower17 is a multiclass dataset consisting of 17 categories of ﬂowers with 80 images per category. [sent-691, score-0.269]

78 6 Protein Subsellular Localisation In the previous three sections, we have shown that on both binary and multiclass object recognition problems, ℓ p MK-FDA tends to outperform ℓ p MK-SVM by a small margin. [sent-729, score-0.253]

79 (2011), the multiclass problem associated with each dataset is decomposed into binary problems using the one-vs-rest strategy. [sent-913, score-0.302]

80 (2011), the regularisation constant C for MKSVM is taken from a set of 9 values: {1/32, 1/8, 1/2, 1, 2, 4, 8, 32, 128}. [sent-918, score-0.252]

81 In our experiments, the regularisation constant λ for MK-FDA is also taken from a set of 9 values, and the values are logarithmically spaced over 10−8 to 100 . [sent-919, score-0.252]

82 1 B INARY C ASE : VOC2007 We ﬁrst compare on the VOC2007 dataset the training speed of three binary MKL methods: the wrapper based binary ℓ p MK-FDA in Section 2. [sent-947, score-0.292]

83 2 M ULTICLASS C ASE : C ALTECH 101 Next we compare three multiclass ℓ p MKL algorithms on the Caltech101 dataset, namely, the wrapper-based multiclass ℓ p MK-FDA in Section 2. [sent-976, score-0.328]

84 Note that the regularisation parameters (λ for MK-FDA and C for OBSCURE) are set to values that yield the highest classiﬁcation accuracy. [sent-1006, score-0.252]

85 Learning machines with the form of Equation (27) are collectively termed regularised kernel machines, a name capturing the two key aspects of them: regularisation, and kernel mapping. [sent-1022, score-0.5]

86 On the other hand, along with the success of SVM, regularised kernel machines using the square loss V ( f (φ(x)), y) = (y − f (φ(x)))2 have emerged several times under various names, including: regularised networks (RN) (Girosi et al. [sent-1042, score-0.475]

87 , 2000), regularised least squares (RLS) (Rifkin, 2002), kernel ridge regression (KRR) (Saunders et al. [sent-1044, score-0.325]

88 Finally, we have provided a discussion on the connection between (MK-)FDA and (MK-)SVM from the perspectives of both loss function and version space, under the uniﬁed framework of regularised kernel machines. [sent-1087, score-0.325]

89 Multiclass ℓ p MK-FDA Saddle Point Formulation In this appendix, we ﬁrst derive the saddle point formulation of multiclass MKL for a general convex loss. [sent-1092, score-0.28]

90 Using the output encoding scheme in Equation (18), multiclass MKL for a general convex loss function V (ξik , hik ) can be stated as: c ∑ min w jk ,ξik ,β k=1 1 2 m ||w jk ||2 +C ∑ V (ξik , hik ) ∑ βj i=1 j=1 n n ∑ wT φ j (xi ) = ξik , ∀i, ∀k; jk s. [sent-1094, score-0.694]

91 Using this fact we arrive at the multiclass MKL saddle point problem for a p general loss function: c m m k=1 i=1 i=1 min max ∑ C ∑ V (ξik , hik ) + ∑ αik ξik − ξik ,β αik 1 2 n ∑ αT β j K j αk k (29) j=1 β ≥ 0; ||β||2 ≤ 1. [sent-1104, score-0.415]

92 Take the square loss V (ξik , hik ) = 1 (ξik − hik )2 as an example. [sent-1108, score-0.332]

93 (2010) this time we impose the norm constraint in the form of Tikhonov regularisation instead of Ivanov regularisation: c 1 2 ∑ ,β min w jk ,ξik k=1 n w jk βj ∑ j=1 2 m +C ∑ V (ξik , hik ) + i=1 µ β 2 2 p (31) n s. [sent-1120, score-0.612]

94 j=1 Note however that the switching from Ivanov to Tikhonov regularisation is not essential for the derivation in the following. [sent-1123, score-0.252]

95 The dual program for Ivanov regularisation in Equation (28) can be derived in a similar way. [sent-1124, score-0.293]

96 We again take the square loss V (ξik , hik ) = 1 (ξik − hik )2 as an example. [sent-1138, score-0.332]

97 (36) j=1 q Unlike the saddle point formulation in Equation (30), the kernel weights β have been eliminated from Equation (36). [sent-1144, score-0.354]

98 An automated combination of kernels for predicting protein subcellular localization. [sent-1428, score-0.261]

99 A comparison of l1 norm and l2 norm multiple kernel svms in image and video classiﬁcation. [sent-1578, score-0.332]

100 Lp norm multiple kernel ﬁsher discriminant analysis for object and image categorisation. [sent-1586, score-0.385]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fda', 0.336), ('mkl', 0.305), ('regularisation', 0.252), ('tahir', 0.186), ('kernels', 0.182), ('ik', 0.18), ('kernel', 0.175), ('kloft', 0.173), ('hik', 0.166), ('multiclass', 0.164), ('ikolajczyk', 0.159), ('ittler', 0.159), ('learnt', 0.155), ('iscriminant', 0.15), ('isher', 0.15), ('regularised', 0.15), ('obscure', 0.135), ('yan', 0.135), ('interleaved', 0.124), ('shogun', 0.123), ('optimisation', 0.111), ('dataset', 0.105), ('ultiple', 0.1), ('sip', 0.098), ('ernel', 0.089), ('wrapper', 0.089), ('ye', 0.088), ('categorisation', 0.088), ('saddle', 0.085), ('sonnenburg', 0.078), ('nalysis', 0.075), ('parse', 0.072), ('datasets', 0.072), ('mikolajczyk', 0.068), ('equation', 0.066), ('jk', 0.066), ('svm', 0.064), ('weights', 0.063), ('norm', 0.062), ('suykens', 0.062), ('vishwanathan', 0.06), ('discriminant', 0.059), ('subproblem', 0.058), ('object', 0.056), ('localisation', 0.053), ('sb', 0.053), ('tikhonov', 0.048), ('validation', 0.047), ('mika', 0.047), ('sher', 0.045), ('silp', 0.045), ('aps', 0.044), ('subcellular', 0.044), ('cai', 0.042), ('lanckriet', 0.041), ('pascal', 0.041), ('dual', 0.041), ('ivanov', 0.038), ('shevade', 0.038), ('gehler', 0.038), ('formulations', 0.037), ('ht', 0.037), ('zien', 0.037), ('sparsity', 0.037), ('wt', 0.036), ('aeroplane', 0.035), ('baudat', 0.035), ('gemert', 0.035), ('gestel', 0.035), ('jrls', 0.035), ('lssvm', 0.035), ('mkfda', 0.035), ('mksvm', 0.035), ('psortneg', 0.035), ('psortpos', 0.035), ('rls', 0.035), ('sande', 0.035), ('protein', 0.035), ('centred', 0.035), ('orabona', 0.035), ('ong', 0.035), ('plant', 0.034), ('rifkin', 0.034), ('intrinsic', 0.034), ('binary', 0.033), ('image', 0.033), ('lp', 0.033), ('ciency', 0.033), ('base', 0.032), ('training', 0.032), ('keerthi', 0.031), ('smo', 0.031), ('maximises', 0.031), ('nowozin', 0.031), ('kittler', 0.031), ('alignment', 0.031), ('formulation', 0.031), ('lopez', 0.03), ('centring', 0.03), ('sw', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

2 0.3859123 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

Author: Francesco Orabona, Luo Jie, Barbara Caputo

Abstract: In recent years there has been a lot of interest in designing principled classiﬁcation algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach. Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem. We prove theoretically and experimentally that 1) our algorithm has a faster convergence rate as the number of kernels grows; 2) the training complexity is linear in the number of training examples; 3) very few iterations are sufﬁcient to reach good solutions. Experiments on standard benchmark databases support our claims. Keywords: multiple kernel learning, learning kernels, online optimization, stochastic subgradient descent, convergence bounds, large scale

3 0.19405076 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity

4 0.08324679 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difﬁcult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classiﬁcation and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efﬁcient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classiﬁcation and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classiﬁcation and regression. Keywords: kernel methods, learning kernels, feature selection

5 0.080498174 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

Author: Carl Brunner, Andreas Fischer, Klaus Luig, Thorsten Thies

Abstract: Pairwise classiﬁcation is the task to predict whether the examples a, b of a pair (a, b) belong to the same class or to different classes. In particular, interclass generalization problems can be treated in this way. In pairwise classiﬁcation, the order of the two input examples should not affect the classiﬁcation result. To achieve this, particular kernels as well as the use of symmetric training sets in the framework of support vector machines were suggested. The paper discusses both approaches in a general way and establishes a strong connection between them. In addition, an efﬁcient implementation is discussed which allows the training of several millions of pairs. The value of these contributions is conﬁrmed by excellent results on the labeled faces in the wild benchmark. Keywords: pairwise support vector machines, interclass generalization, pairwise kernels, large scale problems

6 0.079525255 44 jmlr-2012-Feature Selection via Dependence Maximization

7 0.077980705 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

8 0.068058684 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

9 0.065026872 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

10 0.057019837 97 jmlr-2012-Regularization Techniques for Learning with Matrices

11 0.056771994 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

12 0.052888222 100 jmlr-2012-Robust Kernel Density Estimation

13 0.049155015 4 jmlr-2012-A Kernel Two-Sample Test

14 0.046647832 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

15 0.045138132 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables

16 0.042938944 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

17 0.039921347 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

18 0.039116528 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

19 0.036871292 111 jmlr-2012-Structured Sparsity and Generalization

20 0.036013916 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.237), (1, -0.111), (2, 0.128), (3, 0.474), (4, 0.286), (5, 0.069), (6, -0.142), (7, -0.095), (8, 0.099), (9, 0.082), (10, 0.01), (11, -0.032), (12, 0.104), (13, 0.017), (14, -0.144), (15, -0.079), (16, -0.137), (17, 0.065), (18, 0.01), (19, -0.095), (20, -0.116), (21, -0.125), (22, 0.015), (23, 0.087), (24, 0.045), (25, 0.03), (26, -0.075), (27, -0.026), (28, 0.049), (29, -0.018), (30, 0.019), (31, -0.02), (32, -0.078), (33, -0.083), (34, 0.095), (35, -0.028), (36, -0.047), (37, -0.005), (38, 0.024), (39, 0.001), (40, -0.018), (41, -0.022), (42, -0.08), (43, 0.048), (44, -0.049), (45, -0.008), (46, -0.043), (47, -0.006), (48, -0.042), (49, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92807204 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

2 0.85009873 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

Author: Francesco Orabona, Luo Jie, Barbara Caputo

3 0.65068877 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

4 0.38500029 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

5 0.34530798 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

Author: Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon

Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework—that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet

6 0.3405059 44 jmlr-2012-Feature Selection via Dependence Maximization

7 0.3268365 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

8 0.326341 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

9 0.26847085 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

10 0.25709698 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

11 0.2464599 100 jmlr-2012-Robust Kernel Density Estimation

12 0.2061713 4 jmlr-2012-A Kernel Two-Sample Test

13 0.19645634 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

14 0.17771244 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

15 0.17753765 97 jmlr-2012-Regularization Techniques for Learning with Matrices

16 0.17459208 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

17 0.17392105 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

18 0.17055351 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables

19 0.16999565 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

20 0.16768654 54 jmlr-2012-Large-scale Linear Support Vector Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.016), (20, 0.275), (21, 0.037), (26, 0.043), (27, 0.013), (29, 0.034), (35, 0.021), (49, 0.025), (56, 0.014), (64, 0.016), (66, 0.054), (69, 0.025), (75, 0.163), (77, 0.019), (92, 0.046), (96, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77027619 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a uniﬁed constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classiﬁcation or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efﬁcient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efﬁcient than existing supervised topic models, especially for classiﬁcation. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines

same-paper 2 0.74540138 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

3 0.58878422 44 jmlr-2012-Feature Selection via Dependence Maximization

Author: Le Song, Alex Smola, Arthur Gretton, Justin Bedo, Karsten Borgwardt

Abstract: We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artiﬁcial and real-world data show that our feature selector works well in practice. Keywords: kernel methods, feature selection, independence measure, Hilbert-Schmidt independence criterion, Hilbert space embedding of distribution

4 0.58366776 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

Author: Francesco Orabona, Luo Jie, Barbara Caputo

5 0.58008558 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

Author: Michael Brückner, Christian Kanzow, Tobias Scheffer

Abstract: The standard assumption of identically distributed training and test data is violated when the test data are generated in response to the presence of a predictive model. This becomes apparent, for example, in the context of email spam ﬁltering. Here, email service providers employ spam ﬁlters, and spam senders engineer campaign templates to achieve a high rate of successful deliveries despite the ﬁlters. We model the interaction between the learner and the data generator as a static game in which the cost functions of the learner and the data generator are not necessarily antagonistic. We identify conditions under which this prediction game has a unique Nash equilibrium and derive algorithms that ﬁnd the equilibrial prediction model. We derive two instances, the Nash logistic regression and the Nash support vector machine, and empirically explore their properties in a case study on email spam ﬁltering. Keywords: static prediction games, adversarial classiﬁcation, Nash equilibrium

6 0.56276125 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

7 0.52961302 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

8 0.51132768 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

9 0.50304455 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

10 0.49995601 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

11 0.49290532 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

12 0.49201867 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

13 0.49062467 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

14 0.48811835 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

15 0.48605812 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

16 0.48486733 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

17 0.4836337 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

18 0.48320711 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

19 0.4778007 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

20 0.47776383 100 jmlr-2012-Robust Kernel Density Estimation