jmlr jmlr2006 jmlr2006-1 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
Reference: text
sentIndex sentText sentNum sentScore
1 Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. [sent-14, score-0.187]
2 These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. [sent-15, score-0.076]
3 Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant 1. [sent-17, score-0.359]
4 Introduction Many kernel learning algorithms (KLA) have been proposed for solving different kinds of problems. [sent-18, score-0.076]
5 , 2002) is another competitive algorithm for classification, the one-class SVM (Sch¨lkopf o and Smola, 2002) is a useful tool for novelty detection, while the kernel Fisher discriminant (KFD) (Mika et al. [sent-20, score-0.076]
6 , 2003) and the kernel PCA (KPCA) (Sch¨lkopf and Smola, 2002) are o powerful algorithms for feature extraction. [sent-21, score-0.076]
7 Many kernel learning algorithms result in a kernel machine (KM) (such as a kernel classifier), whose output can be calculated as NXV τ (x) = αi K(ˆ i , x) + b, ˆ x i=1 c 2006 Mingrui Wu, Bernhard Sch¨lkopf and G¨khan Bakır. [sent-22, score-0.248]
8 Usually K is a positive definite kernel (Sch¨lkopf and Smola, 2002), which implicitly o introduces a feature space F. [sent-24, score-0.076]
9 Hence (1) can also be written as a linear function τ (x) = w, φ(x) + b, where (2) NXV w= αi φ(ˆ i ) ˆ x (3) i=1 is the weight vector of the KM and it equals the linear expansion of XVs in the feature space F. [sent-26, score-0.084]
10 First, each of them results in a KM whose key component is a weight vector w, which can be expressed as the linear expansion of XVs in the feature space F. [sent-32, score-0.084]
11 Thus several sparse learning algorithms have been proposed to build KMs with small NXV . [sent-38, score-0.129]
12 The xi , 1 ≤ i ≤ NXV have different names in different kernel learning algorithms. [sent-48, score-0.1]
13 In this paper we uniformly call them expansion vectors for the sake of simplicity. [sent-50, score-0.087]
14 In (Lee and Mangasarian, 2001), the reduced support vector machine (RSVM) algorithm is proposed, which randomly selects Nz vectors from the training set as XVs, and then computes the expansion coefficients. [sent-52, score-0.119]
15 This algorithm can be applied to build sparse kernel classifiers. [sent-53, score-0.205]
16 The relevance vector machine (RVM) (Tipping, 2001) is another algorithm which leads to sparse KMs. [sent-55, score-0.106]
17 The basic idea of the RVM is to assume a prior of the expansion coefficients which favors sparse solutions. [sent-56, score-0.156]
18 In this paper, based on the common points of KLAs mentioned before, we present a direct method to build sparse kernel learning algorithms (SKLA). [sent-57, score-0.229]
19 We will also show that applying this method to the SVM will result in a specific algorithm for building sparse large margin classifiers (SLMC). [sent-60, score-0.187]
20 In section 2, we describe a direct method for building SKLAs. [sent-62, score-0.092]
21 The SLMC algorithm is presented in section 3, where we will also point out that it actually finds a discriminating subspace of the feature space F. [sent-64, score-0.076]
22 A Direct Method of Building SKLAs In this section, we propose a direct method for building SKLAs. [sent-68, score-0.092]
23 605 ¨ Wu, Scholkopf and Bakır Nξ is the number of auxiliary variables, Ng is the number of inequality constraints specified by gi (w, b, ξ), while Nh is the number of equality constraints specified by hj (w, b, ξ). [sent-80, score-0.135]
24 To achieve this, we propose to solve the following problem: min w,b,ξ,β,Z f (w, b, ξ), (8) gi (w, b, ξ) ≤ 0, 1 ≤ i ≤ Ng , (9) hj (w, b, ξ) = 0, subject to 1 ≤ j ≤ Nh , (10) Nz w= φ(zi )βi , (11) i=1 where (8)–(10) are exactly the same as (5)–(7), while Z = [z1 , . [sent-82, score-0.152]
25 It can be seen that the above problem is the problem (5)–(7) with one added constraint (11) saying that the weight vector of the resulting KM equals the expansion of the φ(zi ), 1 ≤ i ≤ Nz . [sent-89, score-0.084]
26 Due to the constraint (11), solving the problem (8)–(11) will naturally lead to a sparse KM. [sent-91, score-0.088]
27 However, our gradient based minimization will be performed only on the expansion vectors Z but not on all the variables. [sent-94, score-0.121]
28 To this end, we define the following the marginal function W (Z) which is obtained by keeping the expansion vectors in problem (8)–(11) fixed, i. [sent-95, score-0.087]
29 : W (Z) := min w,b,ξ,β f (w, b, ξ), (12) gi (w, b, ξ) ≤ 0, 1 ≤ i ≤ Ng , (13) hj (w, b, ξ) = 0, subject to 1 ≤ j ≤ Nh , (14) Nz and w= φ(zi )βi . [sent-97, score-0.152]
30 According to the constraint (11), the weight vector w can be completely determined by β and Z, so the functions f , gi and hj can be regarded as functions of b, ξ, β and Z. [sent-109, score-0.151]
31 Without causing confusions, we write them as f (x, Z), gi (x, Z) and hj (x, Z) in the following, where x := [b, ξ , β ] . [sent-110, score-0.135]
32 Substituting (15) into (12)–(14), we have min x f (x, Z), (16) gi (x, Z) ≤ 0, 1 ≤ i ≤ Ng , (17) hj (x, Z) = 0, subject to 1 ≤ j ≤ Nh , (18) and W (Z) is the optimal value of the above optimization problem. [sent-111, score-0.152]
33 Furthermore let αi and βj be the unique corresponding ¯ Lagrange multipliers associated with gi and hj respectively, 1 ≤ i ≤ Ng , 1 ≤ j ≤ Nh . [sent-113, score-0.153]
34 Building an SLMC Having described our direct method for building SKLAs, we will focus on a concrete application of this method in the rest of this paper. [sent-121, score-0.092]
35 1 Objective Now we begin to consider the binary classification problem, where we are given a set of training data {(xi , yi )}N , where xi ∈ X is the input data, and yi ∈ {−1, 1} is the class i=1 label. [sent-126, score-0.154]
36 Our objective is as follows: given a training data set and an positive integer Nz , we want to build a classifier such that the number of XVs of the classifier equals Nz and the margin of the classifier is as large as possible. [sent-127, score-0.104]
37 This way we build a large margin classifier whose sparseness is explicitly controlled. [sent-128, score-0.097]
38 ) Substituting (27) into (24) and (25), we have 1 β Kz β + C 2 min ξ,b,β N ξi , yi (β ψz (xi ) + b) ≥ 1 − ξi , subject to ξi ≥ 0, (28) i=1 ∀i, ∀i, (29) (30) where ψz (xi ) = [K(z1 , xi ), . [sent-158, score-0.09]
39 , K(zNz , xi )] (31) is the empirical kernel map (Sch¨lkopf and Smola, 2002) and Kz is the kernel matrix of zi , o z = K(z , z ). [sent-161, score-0.251]
40 Kij i j Note that when Nz = N and zi = xi , 1 ≤ i ≤ N , this is the standard SVM training problem. [sent-164, score-0.131]
41 In contrast, the problem (28)–(30) is to train a linear SVM in a subspace spanned by φ(zi ), 1 ≤ i ≤ Nz , where zi are are not necessarily training examples. [sent-165, score-0.151]
42 609 ¨ Wu, Scholkopf and Bakır The derivatives of L(ξ, b, β, α, γ) with respect to the primal variables must vanish, N ∂L = Kz β − ∂β ∂L =− ∂b αi yi ψz (xi ) = 0, (33) ∀i, (34) i=1 N αi yi = 0, i=1 ∂L = C − αi − γi = 0, ∂ξi ∀i. [sent-168, score-0.098]
43 (40) ˆ The function Kz (·, ·) defined by (40) is a positive definite kernel function (Sch¨lkopf and o Smola, 2002). [sent-171, score-0.076]
44 Therefore given Z, computing the expansion coefficients of SVM with kernel 4. [sent-176, score-0.144]
45 The map defined in (41) is called the “whitened” empirical kernel map or “kernel PCA map”(Sch¨lkopf o and Smola, 2002). [sent-177, score-0.076]
46 610 A Direct Method for Building Sparse Kernel Learning Algorithms ˆ function K is equivalent to training an SVM with a modified kernel function Kz defined by (40). [sent-178, score-0.108]
47 So given Z, assuming αi , 1 ≤ i ≤ N are the solution of (37)–(39), then we can compute W (Z) as N z αi − W (Z) = i=1 1 2 N N z z ˆ αi αj yi yj Kz (xi , xj ). [sent-180, score-0.119]
48 i (44) j=1 According to (36), the expansion coefficients β can be calculated as N β = (Kz )−1 z αi yi ψz (xi ) = (Kz )−1 (Kzx )Yαz , (45) i=1 where ψz (·) is defined by (31), Kzx is the matrix defined by Kzx = K(zi , xj ), Y is a diagonal ij z z matrix of class labels, i. [sent-181, score-0.189]
49 As can be seen in corollary 2, to apply lemma 1 to calculate W (Z), we only need to ˆ make two assumptions on problem (37)–(39): The kernel function Kz (·, ·) is strictly positive definite and the resulting support vectors come from both classes. [sent-192, score-0.095]
50 4 The Kernel Function Kz and Its Corresponding Feature Space Fz ˆ The kernel function Kz plays an important role in our approach. [sent-210, score-0.076]
51 It is well known that training an SVM with a nonlinear kernel function K in the input space X is equivalent to building a linear SVM in a feature space F. [sent-212, score-0.176]
52 , zNz , training an SVM with kernel function K is equivalent to building an SVM ˆ with another kernel function Kz , which is in turn equivalent to constructing a linear SVM in another feature space. [sent-218, score-0.252]
53 Based on this point of view, we can see that our proposed approach essentially finds a subspace Fz where the margin of the training data is maximized. [sent-237, score-0.107]
54 , zNz are obtained, the expansion coefficients β are computed by minimizing (4), which leads to (Sch¨lkopf and Smola, 2002) o β = (Kz )−1 (Kzx )Yα, (47) where Kzx and Y are defined as in (45), and α is the solution of building an SVM with kernel function K on the training data set {(xi , yi )}N . [sent-244, score-0.293]
55 The only difference is that in (47), α is the solution of training an SVM with kernel function K, while in (45), αz is the solution of training an SVM with the kernel ˆ function Kz , which takes the XVs zi into consideration. [sent-247, score-0.291]
56 2 Comparison with RSVM and a Modified RSVM Algorithm One might argue that our approach appears to be similar to the RSVM, because the RSVM algorithm also restricts the weight vector of the decision hyperplane to be a linear expansion of Nz XVs. [sent-251, score-0.084]
57 The first one (and probably the fundamental one) is that in the RSVM approach, Nz XVs are randomly selected from the training data in advance, but are not computed by finding a discriminating subspace Fz . [sent-253, score-0.108]
58 This step immediately reduces the problem (28)–(30) to a standard linear SVM training problem (Lin and Lin, 2003), where β becomes the weight vector of the decision hyperplane and the training set becomes {ψz (xi ), yi }N . [sent-257, score-0.129]
59 i=1 On the other hand, our method of computing β is to build a linear SVM in the subspace Fz , which is to train a linear SVM for the training data set {φz (xi ), yi }N . [sent-258, score-0.166]
60 Analogous to the modified RS method, we propose a modified RSVM algorithm: Firstly, Nz training data are randomly selected as XVs, then the expansion coefficients β are computed by (45). [sent-264, score-0.1]
61 3 Comparison with the RVM The RVM (Tipping, 2001) algorithm and many other sparse learning algorithms, such as sparse greedy algorithms (Nair et al. [sent-266, score-0.176]
62 Consequently, with the same number of XVs, SLMC can have better classification performance than the RVM and other sparse learning algorithms which select the XVS only from the training data. [sent-270, score-0.12]
63 4 SLMC vs Neural Networks Since the XVs of the SLMC do not necessarily belong to the training set and training an SLMC is a gradient based process,7 the SLMC can be thought of as a neural network with weight regularization (Bishop, 1995). [sent-274, score-0.114]
64 To this end, the regularizer takes into account the kernel matrix Kz . [sent-277, score-0.076]
65 It is straightforward to apply this method to build sparse one-class SVM algorithm (Sch¨lkopf o and Smola, 2002), to which there is no obvious neural network counterpart. [sent-281, score-0.129]
66 Note that the prediction time (the number of kernel evaluations) of a soft margin SVM scales linearly with the number of training patterns (Steinwart, 2003). [sent-283, score-0.139]
67 This is also similar to the recent work of (Snelson and Ghahramani, 2006) on building sparse Gaussian processes, which was done at almost the same time with our previous work (Wu et al. [sent-287, score-0.156]
68 3 Parameter Selection A Gaussian kernel is used in the experiments: K(x, x ) = exp(−γ x−x 2 ). [sent-310, score-0.076]
69 RS method: The RS method uses the same kernel parameter as the standard SVM, since it aims to simplify the standard SVM solution. [sent-315, score-0.076]
70 In Table 1, NSV stands for the number of support vectors (SVs) of the standard SVM, Nz represents the number of XVs of other sparse learning algorithms. [sent-342, score-0.107]
71 For each data set, the result of the RVM is shown in boldface if it is the best compared to the other sparse learning algorithms. [sent-446, score-0.088]
72 Table 1 also illustrates that SLMC outperforms the other sparse learning algorithms in most cases. [sent-449, score-0.088]
73 Theoretically if the Gaussian kernel is used and zi , 1 ≤ i ≤ Nz , are different from each other, then Kz should be full rank, whose inverse can be computed without any problems. [sent-496, score-0.151]
74 8 Training Time of SLMC Building an SLMC is a gradient based process, where each iteration consists of computing the φz (xi ), 1 ≤ i ≤ N , training a linear SVM over {φz (xi ), yi }N ,14 and then computing i=1 the gradient W (Z). [sent-520, score-0.149]
75 Let TSV M (N, d) denote the time complexity of training a linear SVM over a data set containing N d-dimensional vectors, then the time complexity of training an SLMC is 3 2 O(n × (N Nz d + TSV M (N, Nz ) + Nz + Nz d)), where n is the number of iterations of the SLMC training process. [sent-521, score-0.096]
76 For applications where processing speed is important, such as real time computer vision, sparse KMs can be of great value. [sent-531, score-0.088]
77 Furthermore, some kernel learning algorithms are not sparse at all, such as kernel ridge regression, KFD, KPCA, which means that all the training data need to be saved as the XVs in the resulting KMs trained by these algorithms. [sent-533, score-0.272]
78 Hence building sparse versions of these algorithms can not only accelerate the evaluation of the test data, but also dramatically ˆ 14. [sent-534, score-0.156]
79 Equivalently we can build an SVM with the kernel function Kz over {xi , yi }N . [sent-535, score-0.166]
80 Conclusions We present a direct method to build sparse kernel learning algorithms. [sent-540, score-0.229]
81 There are mainly two advantages of this method: First, it can be applied to sparsify many current kernel learning algorithms. [sent-541, score-0.094]
82 Based on this method we propose an approach to build sparse large margin classifiers, which essentially finds a discriminating subspace Fz of the feature space F. [sent-543, score-0.236]
83 Experimental results indicate that this approach often exhibits a better classification accuracy, at comparable sparsity, than the sparse learning algorithms to which we compared. [sent-544, score-0.088]
84 Possible future work may include applying the proposed method to other kernel learning algorithms and running the direct method greedily to find the XVs one after another in order to accelerate the training procedure. [sent-549, score-0.132]
85 Sparse KFD We have derived the SLMC algorithm as an example of our direct sparse learning approach. [sent-552, score-0.112]
86 Here we will show another example to build sparse KFD (Mika et al. [sent-553, score-0.129]
87 Following our proposed approach, in order to build sparse KFD (SKFD), we have to solve the following problem: −w Aw, subject to (52) ˆ w Bw = 1, min w∈F ,β∈RNz ,Z∈Rd×Nz (53) Nz w= φ(zi )βi . [sent-566, score-0.146]
88 Therefore according to lemma 1, the derivative of W (Z) with respect to zuv , which denotes the v-th component of vector zu , 1 ≤ u ≤ Nz , 1 ≤ v ≤ d, exists and can be computed as follows: z z ∂W (Z) ¯ ∂A β + λβ ∂B β. [sent-587, score-0.123]
89 After obtaining the XVs zi by solving the problem (52)–(54), we can take the solution βi of this problem as the expansion coefficients. [sent-589, score-0.143]
90 As mentioned before, having zi , we can apply our proposed method that the expansion coefficients and the bias can be calculated by solving the problem (37)–(39). [sent-592, score-0.163]
91 The best results among the sparse learning algorithms are in boldface. [sent-644, score-0.088]
92 For the KFD algorithm, the number of expansion vectors Nz is the same as the number of training data since KFD is not sparse at all. [sent-646, score-0.207]
93 Furthermore, the original KFD algorithm is not sparse all all, i. [sent-648, score-0.088]
94 Formal Definitions of the Two Conditions in Lemma 1 For each Z ∈ Rd×Nz , let S(Z) denote the feasible set of problem (16)–(18) S(Z) = {x | gi (x, Z) ≤ 0, 1 ≤ i ≤ Ng } ∩ {x | hj (x, Z) = 0, 1 ≤ j ≤ Nh }. [sent-658, score-0.152]
95 The gradients { x gi (x, Z)|x=¯ x x hj (x, Z)|x=¯ x = 0, 1 ≤ j ≤ Nh , (65) x hj (x, Z)|x=¯ , x 1 ≤ j ≤ Nh } are linearly independent, denotes the gradient with respect to the variables x. [sent-662, score-0.256]
96 First, for the soft margin SVM training problem (37)–(39), it is known that if the kernel ˆ function Kz (·, ·) is strictly positive definite then the problem has an unique solution and the corresponding Lagrange multipliers are also unique. [sent-665, score-0.157]
97 In this case, the vector vs3 can be easily computed as vs3 = − ys3 ys3 vi yi , 2 i∈S1 ∪S2 where y = [y1 , . [sent-682, score-0.081]
98 Therefore the left side of equation (65) equals v y = vi yi + i∈S1 = vi yi + i∈S2 , yi >0 i∈S2 , yi <0 vi − vi yi + i∈S1 vi yi i∈S2 , yi >0 vi . [sent-689, score-0.535]
99 (66) i∈S2 , yi <0 Recall that we construct v such that vi < 0 for i ∈ S2 . [sent-690, score-0.081]
100 Some greedy learning algorithms for sparse regression and classification with Mercer kernels. [sent-787, score-0.088]
wordName wordTfidf (topN-words)
[('nz', 0.511), ('slmc', 0.388), ('kz', 0.379), ('xvs', 0.37), ('rsvm', 0.147), ('coe', 0.135), ('rs', 0.134), ('zuv', 0.123), ('bak', 0.115), ('kfd', 0.112), ('rvm', 0.112), ('svm', 0.102), ('znz', 0.097), ('sparse', 0.088), ('hj', 0.087), ('az', 0.082), ('mrsvm', 0.079), ('kernel', 0.076), ('nh', 0.075), ('zi', 0.075), ('km', 0.072), ('mrs', 0.07), ('nxv', 0.07), ('rnz', 0.07), ('building', 0.068), ('expansion', 0.068), ('bz', 0.068), ('fz', 0.068), ('wu', 0.064), ('erent', 0.064), ('klas', 0.062), ('skfd', 0.062), ('uz', 0.06), ('scholkopf', 0.059), ('di', 0.055), ('xj', 0.052), ('yi', 0.049), ('gi', 0.048), ('kla', 0.044), ('kzx', 0.044), ('subspace', 0.044), ('banana', 0.043), ('build', 0.041), ('nsv', 0.037), ('usps', 0.037), ('gradient', 0.034), ('lkopf', 0.034), ('vi', 0.032), ('discriminating', 0.032), ('training', 0.032), ('margin', 0.031), ('mika', 0.03), ('ng', 0.03), ('sch', 0.029), ('tipping', 0.029), ('zj', 0.027), ('gauvin', 0.026), ('kms', 0.026), ('sklas', 0.026), ('classi', 0.026), ('sparseness', 0.025), ('smola', 0.025), ('titanic', 0.025), ('direct', 0.024), ('xi', 0.024), ('lagrange', 0.023), ('svs', 0.022), ('rd', 0.022), ('german', 0.021), ('bw', 0.02), ('kpca', 0.02), ('cients', 0.02), ('calculated', 0.02), ('compact', 0.019), ('breast', 0.019), ('waveform', 0.019), ('vectors', 0.019), ('lin', 0.019), ('benchmarks', 0.018), ('erence', 0.018), ('yj', 0.018), ('regularity', 0.018), ('relevance', 0.018), ('dubeau', 0.018), ('erentiable', 0.018), ('mingrui', 0.018), ('nair', 0.018), ('snelson', 0.018), ('sparsify', 0.018), ('tsv', 0.018), ('multipliers', 0.018), ('subject', 0.017), ('feasible', 0.017), ('aw', 0.017), ('suykens', 0.017), ('cancer', 0.017), ('weight', 0.016), ('duplicated', 0.015), ('erences', 0.015), ('khan', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
2 0.071299292 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
Author: Sayan Mukherjee, Qiang Wu
Abstract: We introduce an algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to find salient variables and estimate how they covary. An efficient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classification function and its gradient to the true classification function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classification
3 0.068224043 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
Author: Tonatiuh Peña Centeno, Neil D. Lawrence
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
4 0.068195499 45 jmlr-2006-Learning Coordinate Covariances via Gradients
Author: Sayan Mukherjee, Ding-Xuan Zhou
Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efficient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds
Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste
Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efficient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classification, sparse design
7 0.04844559 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
8 0.046623401 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
10 0.038473669 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)
13 0.036335133 44 jmlr-2006-Large Scale Transductive SVMs
14 0.035653934 93 jmlr-2006-Universal Kernels
16 0.034065124 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
17 0.033845387 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
18 0.03324255 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
19 0.029898595 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
topicId topicWeight
[(0, 0.155), (1, -0.097), (2, 0.104), (3, -0.053), (4, 0.114), (5, -0.086), (6, 0.006), (7, -0.019), (8, 0.049), (9, -0.053), (10, 0.07), (11, -0.036), (12, 0.044), (13, -0.035), (14, 0.04), (15, -0.005), (16, 0.037), (17, 0.003), (18, -0.023), (19, -0.042), (20, 0.171), (21, -0.058), (22, -0.194), (23, 0.004), (24, 0.083), (25, -0.002), (26, -0.063), (27, -0.252), (28, 0.076), (29, 0.13), (30, 0.079), (31, -0.012), (32, -0.146), (33, -0.125), (34, -0.173), (35, -0.064), (36, 0.182), (37, 0.03), (38, 0.151), (39, 0.11), (40, -0.229), (41, -0.082), (42, -0.132), (43, -0.058), (44, -0.173), (45, -0.015), (46, 0.206), (47, 0.162), (48, 0.108), (49, -0.215)]
simIndex simValue paperId paperTitle
same-paper 1 0.8987174 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
2 0.46189052 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
Author: Tonatiuh Peña Centeno, Neil D. Lawrence
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste
Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efficient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classification, sparse design
4 0.28775811 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. Keywords: multiple kernel learning, string kernels, large scale optimization, support vector machines, support vector regression, column generation, semi-infinite linear programming
Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor
Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs
7 0.23098435 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
9 0.20458177 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)
10 0.20270655 45 jmlr-2006-Learning Coordinate Covariances via Gradients
12 0.18541811 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
13 0.17170912 69 jmlr-2006-One-Class Novelty Detection for Seizure Analysis from Intracranial EEG
14 0.17049161 93 jmlr-2006-Universal Kernels
15 0.1604635 44 jmlr-2006-Large Scale Transductive SVMs
16 0.16010417 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
17 0.15258145 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies
18 0.1438136 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
19 0.14227626 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
topicId topicWeight
[(8, 0.056), (35, 0.013), (36, 0.05), (45, 0.012), (46, 0.015), (50, 0.033), (53, 0.386), (61, 0.013), (63, 0.038), (71, 0.013), (76, 0.016), (78, 0.043), (81, 0.017), (84, 0.023), (90, 0.06), (91, 0.046), (96, 0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.65400165 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
2 0.62353987 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng
Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks
3 0.31530565 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
4 0.30946344 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
Author: Tonatiuh Peña Centeno, Neil D. Lawrence
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
Author: Ben Taskar, Simon Lacoste-Julien, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex flow, depending on the structure of the problem. We show that this approach provides a memoryefficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.1 Keywords: Markov networks, large-margin methods, structured prediction, extragradient, Bregman projections
7 0.30231628 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
9 0.30166769 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
10 0.30163127 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
11 0.30042681 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
12 0.29781502 44 jmlr-2006-Large Scale Transductive SVMs
13 0.29652423 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
14 0.29529604 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
15 0.29442731 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
17 0.29188934 14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)
18 0.29084799 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix