jmlr jmlr2008 jmlr2008-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. [sent-9, score-0.541]
2 We show that the kernel learning problem in RKDA can be formulated as convex programs. [sent-10, score-0.233]
3 Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. [sent-12, score-0.395]
4 That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. [sent-15, score-0.363]
5 Based on this decomposition property, SDP formulations are proposed for the multi-class case. [sent-16, score-0.232]
6 Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming 1. [sent-20, score-0.292]
7 The key fact underlying the success of kernel methods is that the embedding into feature space can be determined uniquely by specifying a kernel function that computes the dot product between data points in the feature space. [sent-26, score-0.342]
8 (2004b) pioneered the work of multiple kernel learning (MKL) in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. [sent-33, score-0.541]
9 Micchelli and Pontil (2005, 2007) studied the problem of finding an optimal kernel from a prescribed convex set of kernels by regularization. [sent-41, score-0.312]
10 In general, approaches based on learning linear combination of kernel matrices offer the additional advantage of facilitating heterogeneous data integration from multiple sources. [sent-47, score-0.229]
11 Such formulations have been applied to combine various biological data, for example, amino acid sequences, hydropathy profiles, and gene expression data, for enhanced biological inference (Lanckriet et al. [sent-48, score-0.232]
12 (2005) showed that the learning of kernels can be accomplished by defining a reproducing kernel Hilbert space on the space of kernels itself, and the resulting optimization problem is an SDP. [sent-51, score-0.409]
13 (2007) showed that the kernel matrix can be learned in a nonparametric manner by solving SDP. [sent-55, score-0.221]
14 This paper addresses the issue of kernel learning for regularized kernel discriminant analysis (RKDA) (Mika et al. [sent-60, score-0.389]
15 Based on this simplified form and the equivalence relationship between KRDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for this problem. [sent-67, score-0.224]
16 While most existing work on kernel learning only deals with binary-class problems, we show that all of our formulations can be extended naturally to the multi-class setting. [sent-69, score-0.375]
17 In particular, we show that the kernel learning problem for multi-class RKDA can be decomposed into a set of binary-class kernel learning problems that are constrained to share a common kernel. [sent-70, score-0.363]
18 The key contributions of this paper can be highlighted as follows: • We propose a simplified SDP formulation for the RKDA kernel learning problem in the binary-class case. [sent-78, score-0.286]
19 • We show that the multi-class RKDA kernel learning problem can be decomposed into k binary-class kernel learning problems where k is the number of classes. [sent-80, score-0.342]
20 • We show that all the proposed formulations can be recast to optimize the regularization parameter simultaneously. [sent-83, score-0.332]
21 To demonstrate the effectiveness of the proposed formulations for heterogeneous data integration, we apply these formulations to combine multiple data sources derived from gene expression pattern images (Tomancak et al. [sent-86, score-0.526]
22 The optimal weight vector w∗ ≡ argmax {F1 (w, K)} w that maximizes the objective function in Equation (1) for a fixed kernel function K and a fixed regularization parameter λ is given by − + w∗ = (m+ /mSK + m− /mSK + λI)−1 (µ+ − µ− ). [sent-112, score-0.274]
23 (2006) that the optimal Gram matrix G based on the kernel function K that maximizes F1∗ (K) given in Equation (4) can be obtained by solving a semidefinite program (SDP) 723 Y E , J I AND C HEN (Vandenberghe and Boyd, 1996; Boyd and Vandenberghe, 2004). [sent-120, score-0.263]
24 (10) It follows that the optimal kernel learning problem in RKDA, which maximizes F2∗ (K) in Equation (10) for a fixed regularization parameter λ, is equivalent to minimizing 1 ˜ ˜ λaT (λI + G)−1 a = aT I + G λ −1 a, (11) ˜ ˜ subject to the constraint that G ∈ G . [sent-143, score-0.264]
25 Mathematically, the optimal kernel learning problem can be formulated as follows: min θ subject to a T 1 p ˜ I + ∑ θi G i λ i=1 −1 a θ ≥ 0, θT r = 1. [sent-144, score-0.222]
26 In this subsection, we show that this kernel learning problem can be reformulated equivalently as a quadratically constrained quadratic program (QCQP) (Boyd and Vandenberghe, 2004), which can then be solved more efficiently than SDP. [sent-152, score-0.265]
27 1 The optimal kernel function K solving the optimization problem in Equation (11) is also the minimizer of the objective function in Equation (12). [sent-157, score-0.253]
28 The formulation in Equation (13) is a quadratically constrained quadratic program (QCQP), which is a special form of second order cone program (SOCP) (Lobo et al. [sent-180, score-0.229]
29 (2006) for SVM kernel learning involves solving a constrained quadratic program (QP) and a linear program. [sent-212, score-0.235]
30 4 Time Complexity Analysis We analyze the time complexity of the proposed formulations for the binary-class case. [sent-221, score-0.232]
31 (2004b) that the proposed SDP and QCQP formulations have the worst-case time complexity of O (p + n)2 n2. [sent-223, score-0.232]
32 i=1 Based on the properties of matrix trace, we have F5∗ (K) = k ∑ (Ghi )T i=1 k = ∑ trace StK Ghi (Ghi )T StK i=1 k = −1 ∑ trace StK −1 −1 Ghi Ghi (Ghi )T i=1 −1 = trace StK = trace −1 StK = trace StK k ∑ Ghi hT GT i i=1 GHH T GT −1 K Sb = F4∗ (K). [sent-262, score-0.298]
33 Following the results from the last section for the binary-class case, the optimal kernel learning problem for multi-class RKDA can be formulated as follows: k min t1 ,··· ,tk ,θ subject to ∑ tj j=1 1 p ˜ I + λ ∑i=1 θi Gi h j T hj tj θ ≥ 0, 0, for j = 1, · · · , k, θT r = 1. [sent-291, score-0.27]
34 1 Given a set of p centered kernel matrices G1 , · · · , G p , the optimal kernel matrix, in the form of linear combination of the given p kernel matrices, that maximizes the objective function in Equation (26) can be found by solving the SDP problem in Equation (29). [sent-319, score-0.623]
35 The formulation in Equation (32) is an approximation to the exact formulation in Equation (29). [sent-337, score-0.23]
36 In order to formulate the multi-class RKDA kernel learning problem into a QCQP problem, we first consider the minimization of the following objective function: k F6 (W, K) = ∑ ||(φK (X)P)T wi − hi ||2 + λ||wi ||2 , (33) i=1 where W = [w1 , · · · , wk ]. [sent-344, score-0.285]
37 Motivated by this equivalence result, we derive an efficient QCQP formulation for the multi-class RKDA kernel learning problem, as summarized in the following theorem. [sent-349, score-0.286]
38 ˜ Since strong duality holds, the optimal kernel matrix G is given by solving the following optimization problem: k 1 ˜ 1 . [sent-360, score-0.247]
39 4 Time Complexity Analysis In this subsection, we analyze the time complexity of the proposed formulations in the multi-class case. [sent-380, score-0.232]
40 By following similar analysis in the binary-class case, we can show that the proposed (approximate) SDP and QCQP formulations have worse-case time complexity of O (p + n)2 (k + n)2. [sent-381, score-0.232]
41 The complexity of multi-class RKDA kernel learning formulations is summarized in Table 1. [sent-385, score-0.375]
42 Joint Kernel and Regularization Parameter Learning The formulations presented in the last two sections focus on the estimation of the kernel matrix only, while the regularization parameter λ is pre-specified. [sent-387, score-0.451]
43 In this section, we show that all the formulations proposed in this paper can be reformulated equivalently, and this new formulation leads naturally to the estimation of the regularization parameter λ in a joint framework. [sent-389, score-0.419]
44 1 Joint Learning for Binary-class Problems One key advantage of the kernel learning formulation in Equation (8) in comparison with the one in Kim et al. [sent-392, score-0.286]
45 In particular, all the formulations (SDP, QCQP, and SILP) for the binary-class RKDA kernel learning problems, presented in Theorems 2. [sent-394, score-0.375]
46 We can thus treat the regularization parameter as one of the coefficients for the kernel matrix and optimize them simultaneously. [sent-407, score-0.271]
47 This can be formulated to optimize the regularization parameter as one τ of the coefficients for the kernel matrix as follows: max β,t subject to 1 βT a − t 4 1 ˜ t ≥ βT Gi β, i = 0, · · · , p. [sent-415, score-0.322]
48 2 Joint Learning for Multi-class Problems All formulations for the multi-class RKDA kernel learning problems presented in Section 3 can be recast to optimize the regularization parameter jointly. [sent-423, score-0.475]
49 1 and noticing the relationship with the binary-class case, we derive the following SDP formulation for the multi-class RKDA kernel learning problem: k ∑ tj min ˜ t1 ,··· ,tk ,θ j=1 subject to p ˜ ˜ ∑i=0 θi Gi h1 h2 hT t1 0 1 hT 0 t2 2 . [sent-429, score-0.334]
50 We demonstrate the effectiveness of the proposed MKL formulations for heterogeneous data integration in the second part of the experiments. [sent-462, score-0.257]
51 The SDP formulations in Equations (8), (32), (41), and (45) are solved using the optimization package SeDuMi (Sturm, 1999). [sent-463, score-0.23]
52 The source codes of the proposed formulations for the experiments are available online. [sent-467, score-0.232]
53 2 We first evaluate the proposed formulations for binary-class problems in Section 5. [sent-468, score-0.232]
54 We demonstrate the effectiveness of the proposed formulations for heterogeneous data integration in Section 5. [sent-472, score-0.257]
55 1 Experiments on Binary-class Problems In the binary-class case, we compare our formulations with the 1-norm soft margin SVM, 2-norm soft margin SVM with and without the regularization parameter C optimized jointly as proposed in 1. [sent-478, score-0.419]
56 1, SDP formulation with λ optimized jointly as proposed in Equation (41), QCQP formulation with λ fixed as proposed in Theorem 2. [sent-606, score-0.315]
57 2, QCQP formulation with λ optimized jointly as proposed in Equation (43), SILP formulation with λ fixed as proposed in Theorem 2. [sent-607, score-0.315]
58 3, SILP formulation with λ optimized jointly as proposed in Equation (44), SDP formulation proposed in Kim et al. [sent-608, score-0.315]
59 The ten pre-specified kernels are all RBF kernels and the σ values used are 0. [sent-612, score-0.253]
60 The coefficients for the proposed six formulations are normalized to sum to one while those for other compared approaches are reported as obtained from their formulations. [sent-626, score-0.232]
61 The second section includes RKDA and SVM with kernel and regularization parameter chosen by double cross-validation. [sent-628, score-0.255]
62 The third section shows the accuracies of RKDA and SVM when the kernel is fixed and the regularization parameters chosen by crossvalidation. [sent-630, score-0.257]
63 Accuracy of RKDA when the kernel is fixed to each of the ten candidate kernels and λ is chosen by cross-validation. [sent-647, score-0.318]
64 Accuracy of SVM when the kernel is fixed to each of the ten candidate kernels and C is chosen by cross-validation. [sent-648, score-0.318]
65 In terms of performance, formulations that optimize λ jointly achieve similar accuracies to the ones with λ fixed. [sent-668, score-0.295]
66 Thus, formulations that optimize λ jointly are expected to work better in such situations. [sent-671, score-0.257]
67 To understand the relative importance of each kernel when they are used individually, we fix the kernel to each of the ten pre-specified kernels and tune the regularization parameter using cross-validation and the accuracy of each kernel is recorded. [sent-676, score-0.708]
68 For the heart data, cross-validated SVM favors kernels corresponding to θ 9 and θ10 (they were chosen 9 and 17 times out of 30, respectively) while cross-validated RKDA uses kernels corresponding to θ7 , θ8 , and θ9 most frequently. [sent-680, score-0.237]
69 Our six formulations all give kernels corresponding to θ 1 and θ10 large weights, especially to θ1 , while SVM-based MKL formulations all set θ10 to zero. [sent-681, score-0.514]
70 Another interesting observation is that all the ten MKL formulations give the first kernel a large weight while it is the worst kernel when used individually. [sent-683, score-0.587]
71 This implies that the best individual kernel may not lead to a large weight when used in combination with others and poorly-performed individual kernel may contain complementary information that is useful when combined with other kernels. [sent-684, score-0.363]
72 For the ionosphere data, the best three individual kernels chosen by cross-validation are kernels corresponding to θ5 , θ6 and θ7 . [sent-686, score-0.241]
73 For the cancer data, all kernels achieve similar performance when used separately while MKL-based formulations tend to assign a large weight to the kernel corresponds to θ2 . [sent-688, score-0.503]
74 To compare the efficiency of the proposed formulations with methods based on cross-validation, we record the computation time of the proposed QCQP and SILP formulations along with that of methods based on double cross-validation. [sent-689, score-0.5]
75 It can be seen that the proposed SILP formulations are more efficient than cross-validation based methods. [sent-698, score-0.232]
76 2 Experiments on Multi-class Problems In the multi-class experiments, we compare our formulations with KRDA and SVM with kernels and regularization parameters tuned using double cross-validation. [sent-928, score-0.394]
77 In general, all the six proposed formulations achieve similar performance on the ten data sets. [sent-1082, score-0.273]
78 Compared to the QCQP and SILP formulations which are exact, our approximate SDP formulation for the multi-class problems work well in most cases. [sent-1083, score-0.319]
79 1, the SDP formulation with λ optimized jointly as proposed in Equation (45), the QCQP formulation with λ fixed as proposed in Theorem 3. [sent-1159, score-0.315]
80 2, the QCQP formulation with λ optimized jointly as proposed in Equation (46), the SILP formulation with λ fixed as proposed in Theorem 3. [sent-1160, score-0.315]
81 3, the SILP formulation with λ optimized jointly as proposed in Equation (47), RKDA and SVM with kernels and regularization parameters chosen by double cross-validation. [sent-1161, score-0.362]
82 Thus for these data sets, the kernels selected by cross-validation and multiple kernel learning (MKL) agree. [sent-1239, score-0.277]
83 The first ten columns show the number of times that a kernel is chosen by doubly cross-validated RKDA over 30 randomizations. [sent-1246, score-0.239]
84 The first ten columns show the number of times that a kernel is chosen by doubly cross-validated SVM over 30 randomizations. [sent-1249, score-0.239]
85 We expect that complementary information exists among kernels for this data set such that a subset of kernels can be combined to obtain the optimal performance though none of them is the best kernel when used individually. [sent-1478, score-0.404]
86 Similar phenomenon can be observed from the segment(3) and segment(4) data sets in which the first kernel is assigned the largest weight by MKL-based formulations while it is never selected by cross-validation. [sent-1479, score-0.375]
87 In contrast, the proposed SILP formulations are more efficient than methods based on cross-validation on all of the ten data sets. [sent-1694, score-0.273]
88 3 Gene Expression Pattern Image Classification In this experiment, we demonstrate the effectiveness of the proposed multiple kernel learning (MKL) formulations for data (feature) integration. [sent-1696, score-0.403]
89 To exploit the complementary information in kernels constructed from different Gabor images, we apply the proposed SILP formulation to learn a linear combination of the 48 kernel matrices. [sent-1949, score-0.441]
90 To see how each of the 48 kernel matrices works when used individually, we fix the kernel matrix and tune the λ value using cross-validation. [sent-1953, score-0.403]
91 The maximum, minimum, and average accuracies achieved across the 48 kernel matrices are 72. [sent-1954, score-0.242]
92 We also assign a uniform weight of 1 to each of the 48 kernel matrices and the combined kernel matrix achieves an accuracy of about 72. [sent-1958, score-0.403]
93 These results demonstrate that different Gabor images contain complementary information, which is critical for stage classification, and the proposed MKL formulations are effective in exploiting this information by combining different kernel matrices. [sent-1960, score-0.461]
94 While most existing work on kernel learning only deal with binaryclass problems, we show that our binary-class formulations can be extended naturally to multi-class setting. [sent-2037, score-0.375]
95 When combining kernels from a single source of data, the proposed formulations have similar performance with approaches based on double cross-validation. [sent-2040, score-0.374]
96 When the candidate kernels contain complementary information, we show that the proposed formulations are effective to exploit such information. [sent-2041, score-0.359]
97 When evaluating the relative importance of each kernel (either used separately or in linear combination), we found that the best individual kernel sometimes coincides with the highly-weighted kernels in linear combination and sometimes disagrees considerably. [sent-2043, score-0.448]
98 This results in the same optimal transformation matrix as the original criterion in Equation (26) when a common (fixed) kernel matrix is used. [sent-2048, score-0.227]
99 Most existing formulations for learning SVM kernels are restricted to the binary-class case. [sent-2051, score-0.31]
100 The implementation of the proposed QCQP formulations is based on code for SVM kernel learning provided by Dr. [sent-2083, score-0.403]
wordName wordTfidf (topN-words)
[('silp', 0.458), ('qcqp', 0.421), ('sdp', 0.368), ('rkda', 0.351), ('formulations', 0.204), ('kernel', 0.171), ('iscriminant', 0.133), ('rkdak', 0.119), ('hen', 0.119), ('lanckriet', 0.117), ('formulation', 0.115), ('kernels', 0.106), ('tsa', 0.098), ('stk', 0.089), ('ernel', 0.086), ('mkl', 0.085), ('ghi', 0.084), ('gi', 0.083), ('caption', 0.077), ('equation', 0.074), ('footnotes', 0.071), ('pgp', 0.07), ('satimage', 0.068), ('gabor', 0.064), ('svm', 0.06), ('kim', 0.059), ('irm', 0.056), ('trace', 0.054), ('ri', 0.053), ('regularization', 0.048), ('discriminant', 0.047), ('earning', 0.044), ('semide', 0.043), ('ormulation', 0.042), ('ten', 0.041), ('ye', 0.041), ('hi', 0.04), ('wi', 0.04), ('si', 0.038), ('accuracies', 0.038), ('images', 0.037), ('sb', 0.037), ('double', 0.036), ('sonar', 0.036), ('ite', 0.036), ('sedumi', 0.036), ('andersen', 0.035), ('drosophila', 0.035), ('sdpkim', 0.035), ('svmc', 0.035), ('convex', 0.035), ('wt', 0.034), ('objective', 0.034), ('matrices', 0.033), ('usps', 0.033), ('hk', 0.032), ('soft', 0.03), ('mika', 0.03), ('developmental', 0.03), ('ionosphere', 0.029), ('pg', 0.029), ('jointly', 0.029), ('quadratically', 0.028), ('matrix', 0.028), ('gene', 0.028), ('gpg', 0.028), ('recast', 0.028), ('svs', 0.028), ('proposed', 0.028), ('vandenberghe', 0.028), ('formulated', 0.027), ('doubly', 0.027), ('ni', 0.026), ('master', 0.026), ('boyd', 0.026), ('optimization', 0.026), ('heterogeneous', 0.025), ('heart', 0.025), ('margin', 0.025), ('programming', 0.025), ('reformulated', 0.024), ('tj', 0.024), ('subject', 0.024), ('optimize', 0.024), ('pct', 0.024), ('hettich', 0.024), ('sonnenburg', 0.023), ('cone', 0.023), ('cancer', 0.022), ('solving', 0.022), ('table', 0.022), ('program', 0.021), ('ghh', 0.021), ('jianhui', 0.021), ('shuiwang', 0.021), ('tomancak', 0.021), ('maximizes', 0.021), ('lagrangian', 0.021), ('constrained', 0.021), ('complementary', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
2 0.24565193 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
3 0.11369937 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
4 0.087204695 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
5 0.077915132 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
6 0.077178776 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
7 0.0685848 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
8 0.061208148 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion
9 0.05988355 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
10 0.057238258 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.046721023 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
12 0.046714891 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
13 0.038073055 83 jmlr-2008-Robust Submodular Observation Selection
14 0.037522431 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
15 0.036574747 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
16 0.036400463 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
17 0.03418659 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
18 0.031841192 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
19 0.031239845 71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations
20 0.029711505 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
topicId topicWeight
[(0, 0.198), (1, -0.14), (2, 0.094), (3, 0.029), (4, 0.21), (5, -0.119), (6, 0.074), (7, 0.373), (8, 0.038), (9, 0.088), (10, 0.069), (11, -0.035), (12, 0.069), (13, -0.045), (14, -0.136), (15, -0.118), (16, 0.023), (17, 0.271), (18, 0.066), (19, -0.038), (20, 0.021), (21, -0.225), (22, 0.131), (23, 0.019), (24, -0.132), (25, 0.18), (26, -0.081), (27, -0.012), (28, 0.177), (29, -0.019), (30, 0.071), (31, 0.002), (32, -0.005), (33, 0.014), (34, -0.113), (35, -0.019), (36, -0.05), (37, -0.057), (38, 0.052), (39, 0.0), (40, -0.001), (41, -0.081), (42, -0.054), (43, -0.023), (44, 0.039), (45, -0.017), (46, -0.064), (47, -0.034), (48, -0.029), (49, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.9203977 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
2 0.88968164 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
3 0.3520143 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
4 0.33776781 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
5 0.32499039 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
6 0.29513794 92 jmlr-2008-Universal Multi-Task Kernels
7 0.25346074 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
8 0.19911323 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
9 0.19267361 58 jmlr-2008-Max-margin Classification of Data with Absent Features
10 0.18309674 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
11 0.17625502 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion
12 0.17557201 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
13 0.17418528 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
14 0.16415277 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
15 0.150572 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
16 0.14696515 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
17 0.14602423 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
18 0.14291696 52 jmlr-2008-Learning from Multiple Sources
20 0.13879324 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
topicId topicWeight
[(0, 0.028), (5, 0.023), (9, 0.01), (40, 0.053), (54, 0.046), (58, 0.06), (66, 0.059), (74, 0.021), (76, 0.021), (78, 0.027), (88, 0.058), (92, 0.023), (93, 0.384), (94, 0.057), (99, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.72148556 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
2 0.64753288 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
3 0.32586351 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
4 0.32296205 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
5 0.32112768 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
6 0.31893823 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
7 0.31758621 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
8 0.31545991 57 jmlr-2008-Manifold Learning: The Price of Normalization
9 0.31504947 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
10 0.3134585 83 jmlr-2008-Robust Submodular Observation Selection
11 0.31315315 56 jmlr-2008-Magic Moments for Structured Output Prediction
12 0.31020665 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
13 0.30883521 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
14 0.30485499 9 jmlr-2008-Active Learning by Spherical Subdivision
15 0.30483091 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
16 0.30374411 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
17 0.29696354 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
18 0.29671127 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
19 0.29669917 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers